We used to have a horribly written node process that was running in a Mesos clus...

bombolo · on Nov 22, 2022

USA at some point had an anti-missile system that needed periodical reboots because it was originally designed for short deployments, so the floating point variable for the clock would start to lose precision after a while.

mlindner · on Nov 22, 2022

Fixed point, not floating point. https://en.wikipedia.org/wiki/Fixed-point_arithmetic

gcr · on Nov 22, 2022

Floating point clocks do lose precision after long enough time though; see https://randomascii.wordpress.com/2012/02/13/dont-store-that...

Storing floating point coordinates for example is what causes the "farlands" world generation behavior in Minecraft, for example.

saagarjha · on Nov 22, 2022

Which, of course, led to people dying when the drift was too great.

perihelions · on Nov 22, 2022

Context for those unfamiliar:

https://www-users.cse.umn.edu/~arnold/disasters/patriot.html

https://en.wikipedia.org/wiki/MIM-104_Patriot#Failure_at_Dha...

https://hn.algolia.com/?query=patriot%20missile (HN threads)

aftbit · on Nov 22, 2022

I once managed a cluster of worker servers with an @reboot cronjob that scheduled another reboot after $(random /8..24/) hours. They took jobs from a rabbitmq queue and launched docker containers to run them, but had some kind of odd resource leak that would lead to the machines becoming unresponsive after a few days. The whole thing was cursed honestly but that random reboot script got us through for a few more years until it could be replaced with a more modern design.

throwawaaarrgh · on Nov 22, 2022

This is a feature in many HTTPDs, WSGI/FastCGI apps, possibly even in K8s. After X requests/time, restart worker process. Old tricks are the best tricks ;)

AndyMcConachie · on Nov 22, 2022

"Have you tried turning it off and on again?"

aeyes · on Nov 22, 2022

You don't even need that, the kernel OOM killer would take care of this eventually. Unless its something like Java where the garbage collector would begin to burn CPU.

j3th9n · on Nov 22, 2022

The OOM killer doesn't restart (randomly, unless configured) killed processes, it just kills.

ce4 · on Nov 22, 2022

Unless the OOM-killer kills the wrong process. Ages ago we had a userspace filesystem (gpfs) that was of course one of the oldest processes around and it consumed lots of RAM. When the oom killer started looking for a target, of course one of the mmfsd processes was selected and it resulted in instantaneous machine lockup (any access to that filesystem would be blocked forever in the system call which depended on the userspace daemon to return, alas never returning). Was funny to debug

j3th9n · on Nov 22, 2022

You can prevent a process from being killed by OOM killer: https://backdrift.org/oom-killer-how-to-create-oom-exclusion...

pkolaczk · on Nov 22, 2022

If it's deployed in K8s, it would be restarted automatically after dying.

TylerE · on Nov 22, 2022

Then you have two problems.