"Leela Chess Zero is losing with quite a gap"
Check https://tcec-chess.com/ ATM Leela is in front again.
That being said it is not clear which approach is better e.g. a very smart but relatively slow evaluation function (Leela) or SFs approach to use a relatively dumb eval function with a more sophisticated search. It is pretty clear that Leelas search can be improved (Check https://github.com/dje-dev/Ceres for example)
That being said, assuming you can fully saturate multiple GPUs then the SF approach has the disadvantage that it cannot use the performance improvements for GPUs/TPUs which still are growing fast, whereas CPU performance only grows slowly these days.
The Stockfish mentality is: not everybody owns a GPU and Stockfish should be available for everybody and perform well [1]. So they went for a CPU micro-arch optimized neural net, which is great. Maybe this is ultimately not the best for tournaments.
What I would find interesting is if they could give engines an energy budget instead of a time limit. Maybe that makes CPU vs GPU games more interesting & fair.
IMHO they did this because that was the only incremental way to improve strength. They always relied an a fast search with a relatively simplistic eval function. For this approach alpha beta search works well. What is not clear whether it works well for the Leela approach. All attempts so far at least have not been successful(could be for other reasons, such as training approach). SF strength seems to be their well optimized search function. Rewriting that would IMHO be equivalent to creating a new chess engine. That being said I still think that the success of SF NNUE is very remarkable.
Interesting point. Nvidia have been improving the int performance for quantized inference on their GPUs a lot. It might be a lot of work but could it be possible to scale up this NNUE approach to the point where it would be worthwhile to run on a GPU?
For single-input "batches" (seems like this is what's being used now?) it might never be worthwhile but perhaps if multiple positions could be searched in parallel and the NN evaluation batched this might start to look tempting?
Not sure what the effect of running PVS with multiple parallel search threads is. Presumably the payoff of searching with less information means you reach the performance ceiling quite a lot quicker than MCTS-like searches as the pruning is a lot more sensitive to having up-to-date information about the principal variation.
Disclaimer: My understanding of PVS is very limited.
Sure if someone can come up with an approach to run an NNUE (efficiently updatable) network on GPUs that might really be another breakthrough. But at a first glance it looks to me that this could be very difficult. Because AFAIK the SF search is relatively complicated. Even for Leela implementing an efficient batched search on multiple GPUs seems to be difficult (some improvements coming with Ceres). And Leela is using a much simple MCTS search. That doesn't that Leela's search could not be improved. It does not give higher priority necessarily for forced sequence of moves (at least not explicitly) or high risk moves. Which is IMHO why sometimes she does not see relatively simple tactics.
Generally the GPU vs CPU gap on the evaluation side isn't nearly as big as in training. In theory the gap should remain large, but in practice things like the inability to have data ready for batch evaluation means it is harder to saturate the GPU (which as you note is important).
But I'm not sure how much of this applies to chess engines. I see some comments noting that the search part makes it hard to generate batches, but I'm not an expert in this.