Why is it that LLMs are ‘stochastic’, shouldn’t the same input lead to the same ...

fnl · on Dec 27, 2024

For Mixture of Expert models (like GPTs are), they can produce different results for an input sequence if that sequence is retried together with a different set of sequences in its inference batch, because of the model (“expert”) routing depends on the batch, not the single sequence: https://152334h.github.io/blog/non-determinism-in-gpt-4/

And in general, binary floating point arithmetic cannot guarantee associativity - i.e. `(a + b) + c` might not be the same as `a + (b + c)`. That in turn can lead to the model picking another token in rare cases (and it’s auto-regressive consequences, that the entire remainder of the generated sequence might differ): https://www.ingonyama.com/blog/solving-reproducibility-chall...

Edit: Of course, my answer assumes you are asking about the case when the model lets you set its token generation temperature (stochasticity) to exactly zero. With default parameter settings, all LLMs I know of randomly pick among the best tokens.

sroussey · on Dec 27, 2024

They always return the same output for the same input. That is how tests are done for llama.cpp, for example.

To get variety, you give each person a different seed. That way each user gets consistent answers but different than each other. You can add some randomness in each call if you don’t want the same person getting the same output for the same input.

It would be impossible to test and benchmark llama.cpp et al otherwise!

By the time you get to a UI someone has made these decisions for you.

It’s just math underneath!

Hope this helps.

ijustlovemath · on Dec 27, 2024

They probabilistically choose an output. Check out 3b1b's series on LLMs for a better understanding!