https://huggingface.co/docs/transformers/perf_train_gpu_one#anatomy-of-models-me...

https://huggingface.co/docs/transformers/perf_train_gpu_one#...

You can't train a 100T model with "only" 100TB of VRAM, you need for each parameters 4 bytes + 4 bytes (gradient) + 8 bytes (AdamW optimizer) + forward activations that depends on the batch size, sequence length etc, maybe more if you use mixed precision and also you need to distribute the weights.