An important thing omitted in this post, which makes work-stealing less attractive, is that one core being idle can actually improve performance of other cores. Today's CPUs basically have a fixed energy budget, and if one core is idle that means more of that budget can go to other cores.
In other words, core utilization is less relevant today - what you care about is energy utilization (which is shared across cores).
Of course, there's a point at which this stops being relevant - if you have multiple sockets for example, this won't apply. But work stealing across multiple sockets is so expensive anyway that you would never want to do it. You might as well work-steal across machines at that point - something which is indeed useful sometimes, but usually niche.
If a CPU is being cooled enough to not throttle, it is much more time and energy efficient to use all the cores you can rather than have another core run at a slightly higher frequency.
Higher frequencies have diminishing returns and exponential heat loss.
You might as well work-steal across machines at that point
Shared memory is extremely fast, it crushes using local loopback networking, let alone using actual networking.
You can practice energy-aware scheduling at higher levels, too. If you have to send an RPC and you can choose between multiple peers, choose the one with the coldest CPU temperature.
In other words, core utilization is less relevant today - what you care about is energy utilization (which is shared across cores).
Of course, there's a point at which this stops being relevant - if you have multiple sockets for example, this won't apply. But work stealing across multiple sockets is so expensive anyway that you would never want to do it. You might as well work-steal across machines at that point - something which is indeed useful sometimes, but usually niche.