My experience from a ~ 2000 server job with all ECC RAM is that errors cluster a...

bee_rider · on April 16, 2024

Were some of your servers in lead-lined coffins while others weren’t?

Jokes aside, that it seems really unlikely that cosmic rays would be clustered past, like, a couple hours, right? It isn’t like some neutron star is, like, tracking your server as the world spins (well, I hope not, I mean who’d you piss off for that to happen?).

Anyway, this fits my totally unscientific expectation that cosmic rays are just sort of like an informal description of hardware bugs that nobody can reasonably find beforehand. A server that seems to be hit by lots of cosmic rays probably has a dodgy connection somewhere inside it, but I mean maybe it’s the RAM, swap that out or replace it… but maybe inside the chip SOC, so what are we going to do, bust out the electron microscope to check all those connections?

naasking · on April 17, 2024

Yes, these kinds of error rates almost certainly indicate either hardware faults of some kind, unshielded electrical noise, or possibly poor cooling pushing temps past the stable operating range.

toast0 · on April 16, 2024

You'd have to ask our host; these were all rental dedicated servers, none of us ever got to see them at all. Pictures of their racks never included coffins though. I don't think we pissed anybody off enough for them to get out the cosmic ray gun, but I'm also thinking the people we did piss off didn't have cosmic ray guns anyway. ;)

In terms of diagnostics, we pretty much just asked for ram replacement, if that didn't work, cpu replacement, if that didn't work, motherboard replacement. If that didn't work, the chassis / rack position is clearly cursed, don't give us anything there again, please. :D I don't know what they did with the hardware we didn't like, maybe send it to the manufacturer, maybe give it to customers they don't like, maybe surplus it.

moffkalast · on April 16, 2024

> 50,000

It's not 3.6 roentgen...

Joking aside that's incredibly fascinating, I never thought that ECC memory has that much of a performance impact. Might be more optimal to just get a large jerry can of water and put that over the server as radiation shielding lol.

toast0 · on April 16, 2024

ECC interrupts were never a problem at reasonable counts. It's just this machine where the memory was falling apart where it was a problem. Our system was robust to a machine halting, but not so great at dealing with a machine running very very slow. Plus, it wasn't easy to connect and shut down the service (maybe we should have just killed it from IPMI, but this was the only time it happened, so learning experience).

I had a similar issue one time where a Pentium III era server rebooted and came up with a comically small amount of memory, maybe 2-4 mb instead of 128 mb. That wasn't too bad, because it was a very lightly service; important to be on its own machine for reasons, but didn't need much. Just ran a little slow when it was running from swap. I think it did trigger a swap usage alert, and then it was like why is it swapping, why is it so slow, wait why doesn't it have any memory!?

thedrexster · on April 16, 2024

> It's not 3.6 roentgen...

I'm told it's the equivalent of a chest x-ray...

I see you, brother! :D

dfryer · on April 17, 2024

I think this study https://www.cs.toronto.edu/~bianca/papers/ASPLOS2012.pdf (full disclosure: authors were my labmates) supports your observations, and supports the notion that cosmic rays are not the leading cause of random bit-flips in RAM.

ghaff · on April 16, 2024

This particular server wasn't mine as a product manager. But we had what passed as a distributed server in the mid-80s and our biggest company was a retail insurance/finance company of some sort. No ECC.

So, at scale, they were getting failures constantly. (It didn't help that the QIC tape backup was basically write-only.)