Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

https://huggingface.co/docs/transformers/perf_train_gpu_one#...

You can't train a 100T model with "only" 100TB of VRAM, you need for each parameters 4 bytes + 4 bytes (gradient) + 8 bytes (AdamW optimizer) + forward activations that depends on the batch size, sequence length etc, maybe more if you use mixed precision and also you need to distribute the weights.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: