Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> Small models only trained on 800B tokens, compared to 1T for llama-7B

LLaMA is trained far beyond chinchilla optimality, so this is not as surprising to me.



But Chinchilla optimality, while an interesting result, is a strange target for most practical purposes. Training is one time, inference is many times; not training past the point where its cheaper to training a larger model for the same (proxy for) quality discounts to zero the import of the cost of inference.


Yep, but if stability has the goal of training the best possible model then that would explain the choices they made.


I mean 800B tokens on a 3B model and 7B model is still way beyond the Chinchilla scale.


They're going to 1.5T and possibly 3T. The 800B is just for the "Alpha" checkpoints released today. New checkpoints will be released later.


According to this LLaMA still didn't go far enough: https://www.harmdevries.com/post/model-size-vs-compute-overh...


Yep, it depends on what your goal is.


This doesn't say that LLaMA didn't go far enough.


Not exactly, but it did say they could have gone further than they did without wasting time and energy on infinitesimally small gains though




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: