I know many people that have evaluated ARM for a variety of server applications and ARM's advantages, in terms of tradeoffs, are not what many people assume. Where ARM shows well is applications that spend a significant amount of time idle in power-restricted environments (battery-powered analytics clusters being a common example) and running software that is weakly optimized for efficiency (some web app stacks look like this). In practice, this is a niche set of requirements for servers.
The challenge that ARM has long had in servers is that highly optimized and efficient Intel codes have competitive performance per watt and performance per dollar. Cavium is quite open that their cores are optimized for low ILP/IPC software, which is essentially arbitraging software that is poorly designed for x86 silicon. As data and analysis scales have increased, more standard server software is specifically designed to be hyper-efficient on Intel, not just HPC codes like the old days. This makes it increasingly difficult for ARM to compete with the operation throughput per watt/dollar that is possible with Intel and well optimized software.
There is the additional issue with platforms like ThunderX2 that even though it could be competitive in theory, it would require software to be specifically engineered around the peculiarities of the microarchitecture. Software that assumes Intel microarchitectures and optimizes toward that, either explicitly or implicitly, will inherently be suboptimal on ThunderX2. This isn't something that can be fixed in a compiler, it is intrinsic to the architecture of the software at a higher level e.g. C++ codes embed many assumptions about CPU cache topology and properties.
I worked in HPC for over a decade. Most of the codes are not hyper-efficient on Intel - particularly not current-generation Intels that need efficient use of very wide vectors to reach peak performance.
Many codes are memory-bound, and the TX2 has excellent bandwidth. This shows up in real-world simulation codes such as OpenFOAM.
Indeed -- memory- or communication-bound. Somewhere there's a Dell "roofline" plot indicating IPC in some codes running in an unspecified way. See also John McCalpin's writings.
Things doing 3-D FFTs (common in materials science) tend to spend a lot of time in an MPI collective at sufficient scale. I spent half an hour profiling and then changing an MPI parameter for 30% improvement in one case at not very large scale.
But, as usual, you don't know what the profile of the calculations were, in particular because you don't know the mode it's operating in/data it's operating on. ("OpenFOAM" is actually many different programs.)
That said, the indications are that the performance is decent for HPC. What I haven't seen is a comparison with Ryzen (or POWER9). There's also a lack of data even on the SIMD hardware in ThunderX2 and POWER9, at least that I've been able to find.
For cache-tuning, there is polyhedral optimization, available in LLVM via Polly and in GCC via Graphite. Of course they don't fix everything, but they are great at automating loop nesting decisions with the goal of matching data locality to cache size(s), and, potentially, even SSD/HDD storage (think compiling the software so that it makes good use of the 1GB ram/core you can give it, and not be awfully inefficient with it's swapping.
My tests on a recent (6 weeks, I'm too lazy to re-build) git snapshot from master worked reasonably well, but I didn't quite reach the full testing stage, as it was doing something (I was trying to benchmark CockroachDB on Goldmont (server atom), and the -mnative flags are just a couple weeks old on the mailinglist, and a month younger in git master.) and I didn't have time to check back. But it appeared to work when I activated the command line flags.
The challenge that ARM has long had in servers is that highly optimized and efficient Intel codes have competitive performance per watt and performance per dollar. Cavium is quite open that their cores are optimized for low ILP/IPC software, which is essentially arbitraging software that is poorly designed for x86 silicon. As data and analysis scales have increased, more standard server software is specifically designed to be hyper-efficient on Intel, not just HPC codes like the old days. This makes it increasingly difficult for ARM to compete with the operation throughput per watt/dollar that is possible with Intel and well optimized software.
There is the additional issue with platforms like ThunderX2 that even though it could be competitive in theory, it would require software to be specifically engineered around the peculiarities of the microarchitecture. Software that assumes Intel microarchitectures and optimizes toward that, either explicitly or implicitly, will inherently be suboptimal on ThunderX2. This isn't something that can be fixed in a compiler, it is intrinsic to the architecture of the software at a higher level e.g. C++ codes embed many assumptions about CPU cache topology and properties.