The general rule of thumb that I'm familiar with is that you need about 80 bytes of VRAM per parameter when you are doing training. Inference is different and a lot more efficient, and LoRA is also different and more efficient, but training a base model requires a LOT of memory.
A machine like this would top out below 2 trillion parameters using the training algorithms that I'm familiar with.
I don't know what the breakdown is, but I know there was code for training the llama models on a DGX (640 GB of VRAM, repo is now gone), and it could only train the 7b model without using deepspeed 3 (offloading).
The ML engineers in my group chat say "1 DGX to train 7b at full speed, 2 for 13b, 4 for 30b, 8 for 65b"
For automatic differentiation (backpropagation) you need to store the intermediate results per layer of the forward pass. With checkpointing you can only store every nth layer and recompute the rest accordingly to reduce memory requirements in favor of more compute.
For backpropagation you take the diff between actual and expected output and you go backwards to calculate derivate and apply it with optimiser - that's 8 extra bytes for single precision floats per trainable parameter.
You also need the optimizer (e.g. Adam)'s state, which is usually double the parameter's size. So if using fp16, one parameter takes up 6 bytes in memory.
A machine like this would top out below 2 trillion parameters using the training algorithms that I'm familiar with.