Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

1 CPU, DDR4 ram


How many channels of DDR4 ram? What speed is it running at? DDR4-3200?

The (approximate) equation for milliseconds per token, is:

Time for token generation = (number of params active in the model)*(quantization size in bits)/8 bits*[(percent of active params in common weights)/(memory bandwidth of GPU) + (percent of active params in experts)/(memory bandwidth of system RAM)].

This equation ignores prefill (prompt processing) time. This assumes the CPU and GPU is fast enough compute-wise to do the math, and the bottleneck is memory bandwidth (this is usually true).

So for example, if you are running Kimi K2 (32b active params per token, 74% of those params are experts, 26% of those params are common params/shared expert) at Q4 quantization (4 bits per param), and have a 3090 gpu (935GB/sec) and an AMD Epyc 9005 cpu with 12 channel DDR5-6400 (614GB/sec memory bandwidth), then:

Time for token generation = (32b params)*(4bits/param)/8 bits*[(26%)/(935 GB/s) + (74%)/(614GB/sec)] = 23.73 ms/token or ~42 tokens/sec. https://www.wolframalpha.com/input?i=1+sec+%2F+%2816GB+*+%5B...

Notice how this equation explains how the second 3090 is pretty much useless. If you load up the common weights on the first 3090 (which is standard procedure), then the 2nd 3090 is just "fast memory" for some expert weights. If the quantized model is 256GB (rough estimate, I don't know the model size off the top of my head), and common weights are 11GB (this is true for Kimi K2, I don't know if it's true for Qwen3, but this is a decent rough estimate), then you have 245GB of "expert" weights. Yes, this is generally the correct ratio for MoE models, Deepseek R1 included. If you put 24GB of that 245GB on your second 3090, you have 935GB/sec speed on... 24/245 or ~10% of each token. In my Kimi K2 example above, you start off with 18.08ms per token spent reading the model from RAM, so even if your 24GB on your GPU was infinitely fast, it would still take... about 16ms per token reading from RAM. Or in total about 22ms/token, or in total 45 tokens/sec. That's with an infinitely fast 2nd GPU, you get a speedup of merely 3 tokens/sec.


Inspired me to write this, since it seems like most people don't understand how fast models run:

https://unexcitedneurons.substack.com/p/how-to-calculate-hom...


Thank you for that writeup!

In my case it is a fairly old system I built from cheap eBay parts. Threadripper 3970X with 8x32GB dual channel 2666Mhz DDR4.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: