> Small models only trained on 800B tokens, compared to 1T for llama-7B LLaMA is...

dragonwriter · on April 19, 2023

But Chinchilla optimality, while an interesting result, is a strange target for most practical purposes. Training is one time, inference is many times; not training past the point where its cheaper to training a larger model for the same (proxy for) quality discounts to zero the import of the cost of inference.

whimsicalism · on April 19, 2023

Yep, but if stability has the goal of training the best possible model then that would explain the choices they made.

GaggiX · on April 19, 2023

I mean 800B tokens on a 3B model and 7B model is still way beyond the Chinchilla scale.

MacsHeadroom · on April 19, 2023

They're going to 1.5T and possibly 3T. The 800B is just for the "Alpha" checkpoints released today. New checkpoints will be released later.

anentropic · on April 19, 2023

According to this LLaMA still didn't go far enough: https://www.harmdevries.com/post/model-size-vs-compute-overh...

whimsicalism · on April 19, 2023

Yep, it depends on what your goal is.

cubefox · on April 19, 2023

This doesn't say that LLaMA didn't go far enough.

anentropic · on April 20, 2023

Not exactly, but it did say they could have gone further than they did without wasting time and energy on infinitesimally small gains though