LLMs share more with a Markov-Chain than many would like to admit, but the fundamental improvements is because the model is learning the state representation and that latent representation is essentially a lossy compression for nearly all written human text.
People really underestimate both the magnitude of the parameter space for leaner this and the scale of the training data.
But, at the end of the day, the model is still just sampling one token at a time and updating its state (and of course, that state is conditioned on all previous tokens, so that’s a other point of departure)
People really underestimate both the magnitude of the parameter space for leaner this and the scale of the training data.
But, at the end of the day, the model is still just sampling one token at a time and updating its state (and of course, that state is conditioned on all previous tokens, so that’s a other point of departure)