Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The general rule of thumb that I'm familiar with is that you need about 80 bytes of VRAM per parameter when you are doing training. Inference is different and a lot more efficient, and LoRA is also different and more efficient, but training a base model requires a LOT of memory.

A machine like this would top out below 2 trillion parameters using the training algorithms that I'm familiar with.



I suppose it would be 12 bytes? 4 bytes for base model, 4 bytes for optimizer momentum and 4 bytes for optimizer second moment EWA.


I don't know what the breakdown is, but I know there was code for training the llama models on a DGX (640 GB of VRAM, repo is now gone), and it could only train the 7b model without using deepspeed 3 (offloading).

The ML engineers in my group chat say "1 DGX to train 7b at full speed, 2 for 13b, 4 for 30b, 8 for 65b"


Why 80? It's matrix operations on 4 byte numbers for single precision.


Because you need a lot more information to perform back-propagation.


It's not "a lot more" information, it's holding derivative (single number) per parameter, right?


For automatic differentiation (backpropagation) you need to store the intermediate results per layer of the forward pass. With checkpointing you can only store every nth layer and recompute the rest accordingly to reduce memory requirements in favor of more compute.


What intermediate results you need to store?

For backpropagation you take the diff between actual and expected output and you go backwards to calculate derivate and apply it with optimiser - that's 8 extra bytes for single precision floats per trainable parameter.

Why do you need 80?


You also need the optimizer (e.g. Adam)'s state, which is usually double the parameter's size. So if using fp16, one parameter takes up 6 bytes in memory.


Yes, if you use ADAM - but it doesn't add up to 80, does it?

Even for fp64 it adds only 16 bytes.

RMSPRop, Adagrad have half of this overhead.

SGD has no optimizer overhead of course.


It's not per parameter, you also need to hold activations for back prop to work.


You need activations for inference as well.

But all of that (trainable parameters, activations, optimizer state) is like 12 bytes per trainable parameter, not 80.


Not the GP, but I believe that they are talking about the size of the training data set in relation to the model size.


You don't need to and can't really load all training data.

For LLMs you need to load single row of context size, that's vector of ie. 8k numbers, which is 32kB for single precision floats.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: