I know many people that have evaluated ARM for a variety of server applications ...

moconnor · on May 24, 2018

I worked in HPC for over a decade. Most of the codes are not hyper-efficient on Intel - particularly not current-generation Intels that need efficient use of very wide vectors to reach peak performance.

Many codes are memory-bound, and the TX2 has excellent bandwidth. This shows up in real-world simulation codes such as OpenFOAM.

gnufx · on May 24, 2018

Indeed -- memory- or communication-bound. Somewhere there's a Dell "roofline" plot indicating IPC in some codes running in an unspecified way. See also John McCalpin's writings.

Things doing 3-D FFTs (common in materials science) tend to spend a lot of time in an MPI collective at sufficient scale. I spent half an hour profiling and then changing an MPI parameter for 30% improvement in one case at not very large scale.

auvi · on May 24, 2018

Very interesting as I used to do OpenFOAM parallel runs years ago on x86. Do you have any links to OpenFOAM benchmarks on ARM64?

Ar-Curunir · on May 24, 2018

The conclusion of the linked article has a reference to a relevant paper

gnufx · on May 24, 2018

But, as usual, you don't know what the profile of the calculations were, in particular because you don't know the mode it's operating in/data it's operating on. ("OpenFOAM" is actually many different programs.)

That said, the indications are that the performance is decent for HPC. What I haven't seen is a comparison with Ryzen (or POWER9). There's also a lack of data even on the SIMD hardware in ThunderX2 and POWER9, at least that I've been able to find.

namibj · on May 24, 2018

For cache-tuning, there is polyhedral optimization, available in LLVM via Polly and in GCC via Graphite. Of course they don't fix everything, but they are great at automating loop nesting decisions with the goal of matching data locality to cache size(s), and, potentially, even SSD/HDD storage (think compiling the software so that it makes good use of the 1GB ram/core you can give it, and not be awfully inefficient with it's swapping.

gnufx · on May 24, 2018

Unfortunately, the polyhedral optimization was still broken in gcc last I tried. Perhaps it's fixed in gcc 8.

namibj · on May 25, 2018

My tests on a recent (6 weeks, I'm too lazy to re-build) git snapshot from master worked reasonably well, but I didn't quite reach the full testing stage, as it was doing something (I was trying to benchmark CockroachDB on Goldmont (server atom), and the -mnative flags are just a couple weeks old on the mailinglist, and a month younger in git master.) and I didn't have time to check back. But it appeared to work when I activated the command line flags.