My assumption is that there could be higher dimensional 'token' representations that can be used instead of vision tokens (though obviously not human interpretable). I wonder if there is an optimisation for compression that takes the vector space at a specific level in the network to provide the most context for minimal memory space.