It’s a very nice and detailed benchmark suite! Great effort!
Can you please share the CPU model you are running on?
I suspect it’s an x86 CPU without AVX-512 support.
Makes sense! I mostly focus on newer AVX-512 variants as opposed to older AVX2-only CPUs. As for aarch64, it is supported with both NEON, SVE, and SVE2 kernels for some tasks. The last two are rarely useful, unless you run on AWS Graviton 3 (previous gen) or some of the supercomputers with custom chips like Fujitsu A64FX.
> newer AVX-512 variants as opposed to older AVX2-only CPUs
This is exactly my issue with targeting AVX-512. It isn't just absent on "older AVX2-only CPUs." It's also absent on many "newer AVX2-only CPUs." For example, the i9-14900K. I don't think any of the other newer Intel CPUs have AVX-512 either. And historically, whether an x86-64 CPU supported AVX-512 at all was hit or miss.
AVX-512 has been around for a very long time now, and it has just never been consistently available.
It’s mainly available in data centers, but yes missing in consumer parts. And for a while even in data centers you wanted to be careful about using it due to Intel’s issues with clock downscaling but that hasn’t been true for a few years.
The consumer situation is changing. A few years ago, when I was working with a team on some closed source HPC stuff, we’ve got everyone Tiger Lake-based laptops to simplify AVX-512 R&D. Now, Zen4-based desktop CPUs also support it.
But its fair to say that I’m mostly focusing on the datacenter/supercomputing hardware, both on the x86 and Arm side.
If you’re targeting AVX-512 Intel consumer it’s pointless. But yes, AMD does continue to ship AVX-512 chips so completely ignoring 512 on consumer isn’t ideal.
Yes, at the scale of 128-bit registers NEON is mostly enough, except for a few categories of instructions missing in that ISA subset, like scatter/gather ops, that can yield 30% boost over serial memory accesses: https://github.com/ashvardanian/less_slow.cpp/releases/tag/v...