Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

But Chinchilla optimality, while an interesting result, is a strange target for most practical purposes. Training is one time, inference is many times; not training past the point where its cheaper to training a larger model for the same (proxy for) quality discounts to zero the import of the cost of inference.


Yep, but if stability has the goal of training the best possible model then that would explain the choices they made.


I mean 800B tokens on a 3B model and 7B model is still way beyond the Chinchilla scale.


They're going to 1.5T and possibly 3T. The 800B is just for the "Alpha" checkpoints released today. New checkpoints will be released later.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: