Way back in the day, I built and ran the platform for a business on Pentium grade web & database servers which gave me 1 "core" in 2 rack units.
That's 24 cores per 48 unit rack, so 288 cores would be a dozen racks or pretty much an entire aisle of a typical data center.
I guess all of Palo Alto Internet eXchange (where two of my boxen lived) didn't have much more than a couple of thousand cores back in 98/99. I'm guessing there are homelabs with more cores than that entire PAIX data center had back then.
Oh yeah, it is not that many cores for the cluster-universe. Just neat to see the number of cores per socket increase.
A while ago I had access to an 8-socket shared memory machine… but this was the semi-olden days, so it was “only” 80 cores. It was a fun machine at the time! We’re so spoiled these days, haha.
On Zen4 and Zen4c the register is 512 bits wide. However, internally, many “datapaths” (execution units, floating-point units, vector ALUs, etc.) are 256 bits wide for much of the AVX-512 functional units…
Zen5 is supposed to be different, and again, I wrote the kernels for Zen5 last year, but still have no hardware to profile the impact of this implementation difference on practical systems :(
This is an often repeated myth, which is only half true.
On Zen 4 and Zen 4c, for most vector instructions the vector datapaths have the same width as in Intel's best Xeons, i.e. they can do two 512-bit instructions per clock cycle.
The exceptions where AMD has half throughput are the vector load and store instructions from the first level cache memory and the FMUL and FMA instructions, where the most expensive Intel Xeons can do two FMUL/FMA per clock cycle while Zen 4/4c can do only 1 FMUL/FMA + 1 FADD per clock cycle.
So only the link between the L1 cache and the vector registers and also the floating-point multiplier have half-width on Zen 4/4c, while the rest of the datapaths have the same width (2 x 512-bit) on both Zen 4/4c and Intel's Xeons.
The server and desktop variants of Zen 5/5c (and also the laptop Fire Range and Strix Halo CPUs) double the width of all vector datapaths, exceeding the throughput of all past or current Intel CPUs. Only the server CPUs expected to be launched in 2026 by Intel (Diamond Rapids) are likely to be faster than Zen 5, but by then AMD might also launch Zen 6, so it remains to be seen which will be better by the end of 2026.
A basic block simulator like llvm-mca is unlikely to give useful information here, as memory access is going to play a significant part in the overall performance.
It is pretty wide, but 288 cores with 8x FP32 lanes each is still only about a tenth of the lanes on an RTX 5090. GPUs are really, really, really wide.
AVX-512 is on the P-cores only (along with AMX now). The E-cores only support 256-bit vectors.
If you're doing a lot of loading and storing, these E-core chips are probably going to outperform the chips with huge cores because they will be idling a lot. For CPU-bound tasks, the P-cores will win hands down.
I mean, yeah, it's "a lot" because we've been starved for so long, but having run analytics aggregation workloads I now sometimes wonder if 1k or 10k cores with a lot of memory bandwidth could be useful for some ad-hoc queries, or just being able to serve an absurd number of website requests...
CPU on PCIe card seems like it matches with the Intel Xeon Phi... I've wondered if that could boost something like an Erlang mesh cluster...
Damn, when I first landed on the page I saw $7,600 and thought "for 320 cores that's pretty amazing!" but that's the default configuration with 32 cores & 64GB of memory.
320 cores starts at $28,000.. $34k with 1TB of memory..
Do these things have AVX512? It looks like some of the Sierra Forest chips do have AVX512 with 2xFMA…
That’s pretty wide. Wonder if they should put that thing on a card and sell it as a GPU (a totally original idea that has never been tried, sure…).