My experience from a ~ 2000 server job with all ECC RAM is that errors cluster a lot. Easily 90%+ of the systems never had any ECC errors (correctable or no) in their whole life. Some systems would get one error a day. A small number got 100s per hour until we shut them down. Even less got thousands per hour. One in my whole career got like 50,000 in the time between when it started running very slowly (from processing ECC interrupts) and the next hourly reporting interval. You might get close to the literature numbers by taking that one instance and dividing it over the total ram byte hours. Although that was probably hardware failure and not comsic rays, but then you don't get a different ECC code for comsic rays.
Also, there was an expectation that comsic ray induced bit errors would grow as ram circuit features shrunk, but it ended up not happening; reasons unknown or at least I never saw anything suggesting a reason.
Getting errors in a small sample of RAM is unlikely, unless you specifically induce them by using debug features or misconfiguring your system (but some systems conspire to misconfigure themselves, making it easier to observe! a couple years ago, retail motherboards really liked setting the ram voltage too low)
Were some of your servers in lead-lined coffins while others weren’t?
Jokes aside, that it seems really unlikely that cosmic rays would be clustered past, like, a couple hours, right? It isn’t like some neutron star is, like, tracking your server as the world spins (well, I hope not, I mean who’d you piss off for that to happen?).
Anyway, this fits my totally unscientific expectation that cosmic rays are just sort of like an informal description of hardware bugs that nobody can reasonably find beforehand. A server that seems to be hit by lots of cosmic rays probably has a dodgy connection somewhere inside it, but I mean maybe it’s the RAM, swap that out or replace it… but maybe inside the chip SOC, so what are we going to do, bust out the electron microscope to check all those connections?
Yes, these kinds of error rates almost certainly indicate either hardware faults of some kind, unshielded electrical noise, or possibly poor cooling pushing temps past the stable operating range.
You'd have to ask our host; these were all rental dedicated servers, none of us ever got to see them at all. Pictures of their racks never included coffins though. I don't think we pissed anybody off enough for them to get out the cosmic ray gun, but I'm also thinking the people we did piss off didn't have cosmic ray guns anyway. ;)
In terms of diagnostics, we pretty much just asked for ram replacement, if that didn't work, cpu replacement, if that didn't work, motherboard replacement. If that didn't work, the chassis / rack position is clearly cursed, don't give us anything there again, please. :D I don't know what they did with the hardware we didn't like, maybe send it to the manufacturer, maybe give it to customers they don't like, maybe surplus it.
Joking aside that's incredibly fascinating, I never thought that ECC memory has that much of a performance impact. Might be more optimal to just get a large jerry can of water and put that over the server as radiation shielding lol.
ECC interrupts were never a problem at reasonable counts. It's just this machine where the memory was falling apart where it was a problem. Our system was robust to a machine halting, but not so great at dealing with a machine running very very slow. Plus, it wasn't easy to connect and shut down the service (maybe we should have just killed it from IPMI, but this was the only time it happened, so learning experience).
I had a similar issue one time where a Pentium III era server rebooted and came up with a comically small amount of memory, maybe 2-4 mb instead of 128 mb. That wasn't too bad, because it was a very lightly service; important to be on its own machine for reasons, but didn't need much. Just ran a little slow when it was running from swap. I think it did trigger a swap usage alert, and then it was like why is it swapping, why is it so slow, wait why doesn't it have any memory!?
I think this study https://www.cs.toronto.edu/~bianca/papers/ASPLOS2012.pdf (full disclosure: authors were my labmates) supports your observations, and supports the notion that cosmic rays are not the leading cause of random bit-flips in RAM.
This particular server wasn't mine as a product manager. But we had what passed as a distributed server in the mid-80s and our biggest company was a retail insurance/finance company of some sort. No ECC.
So, at scale, they were getting failures constantly. (It didn't help that the QIC tape backup was basically write-only.)
Also, there was an expectation that comsic ray induced bit errors would grow as ram circuit features shrunk, but it ended up not happening; reasons unknown or at least I never saw anything suggesting a reason.
Getting errors in a small sample of RAM is unlikely, unless you specifically induce them by using debug features or misconfiguring your system (but some systems conspire to misconfigure themselves, making it easier to observe! a couple years ago, retail motherboards really liked setting the ram voltage too low)