The general rule of thumb that I'm familiar with is that you need about 80 bytes...

renonce · on May 31, 2023

I suppose it would be 12 bytes? 4 bytes for base model, 4 bytes for optimizer momentum and 4 bytes for optimizer second moment EWA.

Taek · on May 31, 2023

I don't know what the breakdown is, but I know there was code for training the llama models on a DGX (640 GB of VRAM, repo is now gone), and it could only train the 7b model without using deepspeed 3 (offloading).

The ML engineers in my group chat say "1 DGX to train 7b at full speed, 2 for 13b, 4 for 30b, 8 for 65b"

mirekrusin · on May 31, 2023

Why 80? It's matrix operations on 4 byte numbers for single precision.

KeplerBoy · on May 31, 2023

Because you need a lot more information to perform back-propagation.

mirekrusin · on May 31, 2023

It's not "a lot more" information, it's holding derivative (single number) per parameter, right?

calaphos · on May 31, 2023

For automatic differentiation (backpropagation) you need to store the intermediate results per layer of the forward pass. With checkpointing you can only store every nth layer and recompute the rest accordingly to reduce memory requirements in favor of more compute.

mirekrusin · on May 31, 2023

What intermediate results you need to store?

For backpropagation you take the diff between actual and expected output and you go backwards to calculate derivate and apply it with optimiser - that's 8 extra bytes for single precision floats per trainable parameter.

Why do you need 80?

ioedward · on May 31, 2023

You also need the optimizer (e.g. Adam)'s state, which is usually double the parameter's size. So if using fp16, one parameter takes up 6 bytes in memory.

mirekrusin · on May 31, 2023

Yes, if you use ADAM - but it doesn't add up to 80, does it?

Even for fp64 it adds only 16 bytes.

RMSPRop, Adagrad have half of this overhead.

SGD has no optimizer overhead of course.

rfoo · on May 31, 2023

It's not per parameter, you also need to hold activations for back prop to work.

mirekrusin · on May 31, 2023

You need activations for inference as well.

But all of that (trainable parameters, activations, optimizer state) is like 12 bytes per trainable parameter, not 80.

gmueckl · on May 31, 2023

Not the GP, but I believe that they are talking about the size of the training data set in relation to the model size.

mirekrusin · on May 31, 2023

You don't need to and can't really load all training data.

For LLMs you need to load single row of context size, that's vector of ie. 8k numbers, which is 32kB for single precision floats.