The computer errors from outer space

defrost · on Oct 12, 2022

> In one much-discussed incident, a 2008 Air Qantas flight over Western Australia fell hundreds of feet twice within 10 minutes, injuring dozens of passengers on board

This was Qantas Flight 72 [1] and should be of interest to all critical system engineers.

There was a fault, an unidentified cause that corrupted data in ONE of three redundant Air Data Inertial Reference Unit (ADIRU)s.

The fault might have been a cosmic ray incident, it may have been EM interference, the usual array of software bugs, software corruption, hardware faults, etc.

The serious design issue was the spike in one unit was badly and incorrectly handled by the "failsafe" logic that was there to ride out bad juju coming from one of three redundant units.

Shit happens, its nice to have toilet paper thats effective when things are about to hit a fan.

[1] https://en.wikipedia.org/wiki/Qantas_Flight_72

Full Air Safety Report (313 pages)

https://www.atsb.gov.au/media/3532398/ao2008070.pdf

contingencies · on Oct 12, 2022

the spike in one unit was badly and incorrectly handled by the "failsafe" logic

In other words, this sounds like a reasonably easily detectable bug that was allowed in to production due to insufficient system verification, the fault for which lies squarely on Airbus and approving regulators. Arguably the simplest design for redundancy-based high availability systems is that when the system enters a non-quorum state, dissenting inputs should be flagged and discarded and if recovery is initiated their subsystem fully reset.

A more complicated design would be to have a mechanism for evaluating the extent to which inputs differ from those anticipated based upon other known state or inputs (last known position, inertia, airspeed, etc.), and to discard those most unlikely / least supported. However, the tiny fraction of potential failure conditions for which this provides an enhanced recovery path is largely outweighed by the greater complexity of state, processing overhead, lack of transparency in decision making and increased development and testing time (thus system cost).

A better investment of additional system design resources may be in creating a trust metric within the higher-level flight control systems that can reduce risk by avoiding autonomous actions based on subsystems that have entered a low-trust (eg. non-quorum) state.

And indeed, the report (page 21) states:

At 0440:26, one of the aircraft’s three air data inertial reference units (ADIRU 1) started providing incorrect data to other aircraft systems. At 0440:28, the autopilot automatically disconnected, and the captain took manual control of the aircraft.

The report reveals they had nominally independent autopilots running on nominally independent computers and nominally independent flight displays, all of which were of use during incident recovery. However, the number of systems that are reported to have broken (autotrim, cabin pressure, GNSS/RNAV, autobrake, third computer) strongly suggests a deep and systemic failure in the core flight control systems, probably stemming from systemic systems architecture failure to isolate and discard bad data from the malreporting subsystem.

A heterogeneous array of redundant subsystems (ie. from different manufacturers, or with differing dates or places of manufacture) are nominally more likely to survive a fault event. In this event, all the ADIRU units were identical LTN-101 models from Northrop Grumman (who, being a major military avionics contractor, one would have incorrectly assumed would have understood the value of neutron shielding https://www.sciencedirect.com/science/article/pii/B978012819...).

It is also worth noting that the ADIRU units are designed to calculate, maintain and report inertial navigation state. Having this state, sensor errors may compound or persist over time.

However, page 41 reveals that while each autopilot runs on an independent computer, Autopilot 1 on FMGEC 1 trusts ADIRU 1 as its "main" source, and likewise for #2. This suggests a "true quorum feed" is not obtained, possibly for reasons of redundancy (SPOF).

It would be interesting to discuss the current design of such systems with an Airbus engineer and to what extent that incident changed their internal test and design processes and sensor data architecture.

defrost · on Oct 13, 2022

Good comment, appreciated.

> In other words, this sounds like a reasonably easily detectable bug that was allowed in to production due to insufficient system verification, the fault for which lies squarely on Airbus and approving regulators.

Indeed - cosmic rays capable of bitflips are expected in aircraft systems so it is disapointing to see a failure to sheild and|or correctly mitigate for the expected.

> It is also worth noting that the ADIRU units are designed to calculate, maintain and report inertial navigation state. Having this state, sensor errors may compound or persist over time.

A very particular bugbear of mine. I've come across several examples where stats accumulators for low frequency error events are poorly implemented .. leading to situations where "this looks like it's working" and being passed over for any closer examination; years later some threshold is crossed and Bam! something bites hard.

contingencies · on Oct 13, 2022

People should learn from the experts. Lamport is always the wizard:

You're not going to come up with a simple design through any kind of coding techniques or any kind of programming language concepts. Simplicity has to be achieved above the code level before you get to the point which you worry about how you actually implement this thing in code. - Leslie Lamport

Then there's Wiener's Eighth and Final Law: You can never be too careful about what you put into a digital flight-guidance system. - Earl Wiener, Professor of Engineering, University of Miami (1980)

The corollary of this is to treat any and all state within subsystems as a red flashing lights level liability.

.. via https://github.com/globalcitizen/taoup

Tomte · on Oct 12, 2022

People love to talk about cosmic rays, but in the business (of safety-related systems) it is well-known that most causes of single event upsets (SEU)/soft errors are the microchips (or their packaging) themselves.

"In the terrestrial environment, the key radiations of concern are alpha particles emitted by trace impurities in the chip materials themselves"

(https://www.ti.com/support-quality/faqs/soft-error-rate-faqs...)

rkagerer · on Oct 12, 2022

Do pacemakers not use ECC?

Not saying that makes 'em cosmic ray proof, but my understanding is it can harden you by an order of magnitude or more (and help guard against glitches from other sources).

If I was designing something this life-crucial from silicon up, every single bit of memory (including registers) would have extra bits for this, and checks would occur everywhere it's moved or used (even over buses inside IC's, regardless of whether they shift data, addresses, control logic, etc). My code would go to great lengths to verify it hasn't been corrupted. The culture of redundancy might be akin to that seen in the Space Shuttle control systems.

Silicon is so damn cheap nowadays it shouldn't be the constraint. That it's hard to source components of such pedigree is disheartening. Basic ECC should be an industry norm for all but the lowest-end chips, and as feature sizes shrink further I hope sound engineering will become a bigger product differentiator. (Look at how HDD's have been getting more bits of ECC as platters become denser).

If more die for our buck doesn't equal less die for ourselves then we're doing something wrong.

Also, here's an interesting paper about radiation effects on pacemakers (from the perspective of cancer treatment) that I came across when searching whether radiation-hardening in this field is a thing: https://aapm.onlinelibrary.wiley.com/doi/pdf/10.1118/1.59725...

smallpipe · on Oct 12, 2022

> every single bit of memory (including registers) would have extra bits for this, and checks would occur everywhere it's moved or used (even over buses inside IC's, regardless of whether they shift data, addresses, control logic, etc).

That's one of the two approaches used for automotive silicon. The other, which is a bit more expensive but usually considered safer, is to run the same circuit twice in parallel (usually with a few cycles of offset) and compare the outputs. If those outputs don't match, the system is reset into a fail-safe mode.

abfan1127 · on Oct 12, 2022

its been 10 years since I designed them, but back then, no they didn't. Silicon is pretty cheap, but power is very expensive. ECC drives up memory power quite a bit. Both static and dynamic power are impacted.

Gordonjcp · on Oct 12, 2022

In practice what you do is design it to crash easily and recover quickly. Computers are fast, hearts are slow. Your microcontroller can go "huh, that set of readings is bollocks, let's restart and try again" several times between beats.

Even better, a pacemaker is not literally constantly driving a patient's heart, continuously. It's giving it a nudge every now and again if it's drifting out of whack. Think PLL rather than master clock ;-)

shagie · on Oct 12, 2022

A YouTube video on the subject : Veritasium - The Universe is Hostile to Computers https://youtu.be/AaZ_RSt0KP8

Some of the same examples are in the article and the video.

pknerd · on Oct 12, 2022

Google computers also become a victim of outer space "enemies"

FTA(https://www.newyorker.com/magazine/2018/12/10/the-friendship...)

On Sanjay’s monitor, a thick column of 1s and 0s appeared, each row representing an indexed word. Sanjay pointed: a digit that should have been a 0 was a 1. When Jeff and Sanjay put all the missorted words together, they saw a pattern—the same sort of glitch in every word. Their machines’ memory chips had somehow been corrupted.

Sanjay looked at Jeff. For months, Google had been experiencing an increasing number of hardware failures. The problem was that, as Google grew, its computing infrastructure also expanded. Computer hardware rarely failed, until you had enough of it—then it failed all the time. Wires wore down, hard drives fell apart, motherboards overheated. Many machines never worked in the first place; some would unaccountably grow slower. Strange environmental factors came into play. When a supernova explodes, the blast wave creates high-energy particles that scatter in every direction; scientists believe there is a minute chance that one of the errant particles, known as a cosmic ray, can hit a computer chip on Earth, flipping a 0 to a 1. The world’s most robust computer systems, at nasa, financial firms, and the like, used special hardware that could tolerate single bit-flips. But Google, which was still operating like a startup, bought cheaper computers that lacked that feature. The company had reached an inflection point. Its computing cluster had grown so big that even unlikely hardware failures were inevitable.

4gotunameagain · on Oct 12, 2022

This is why aerospace is fun..

You get to design systems that are supposed to work under a shower of particles. A solar storm during a solar maximum is no joke and requires some serious fault tolerance

Razengan · on Oct 12, 2022

Wow I was JUST thinking about this today:

Can there be a hardware random number generator that solely relies on such cosmic particles etc. hitting it?

Bonus: Use it as signs from God (reminiscing of Terry Davis of TempleOS)

shagie · on Oct 12, 2022

Hotbits uses radioactive decay as its source https://www.fourmilab.ch/hotbits/

It measures the time between two events (call this T1) and then the time between the next two events (call this T2).

If T1 > T2, you've got a 1. If T1 < T2, you've got a 0. If T1 = T2 throw it out.

As this is measuring the time between events rather than the rate of events, a decaying source just gives less bits over time rather than less randomness.

> The trick I use was dreamed up in a conversation in 1985 with John Nagle, who is doing some fascinating things these days with artificial animals. Since the time of any given decay is random, then the interval between two consecutive decays is also random. What we do, then, is measure a pair of these intervals, and emit a zero or one bit based on the relative length of the two intervals. If we measure the same interval for the two decays, we discard the measurement and try again, to avoid the risk of inducing bias due to the resolution of our clock.

vikingerik · on Oct 12, 2022

There would be a tiny tiny bias here, right? There are fewer atoms available to decay for the second interval (because the first two that decayed no longer exist), so the expected time to see two more decays would be slightly longer.

You could compensate for this mathematically - you'd look at the intervals relative to the expected interval for the current radioactive mass, rather than relative to each other.

(Of course, the bias would be on the order of the reciprocal of Avogadro's number, and probably ignorable for all practical uses.)

shagie · on Oct 12, 2022

That change in "fewer atoms" would likely be beyond the detection of clock jitter.

If you've got a mole of cesium, that's 132 grams and contains on the order of 10^23 atoms. Even after one half life (33 years), that's still on the order of 10^23 atoms. The difference between 10^23 and (10^23) - 1 isn't going to be noticeable.

The statistical tests of some hotbits data - https://www.fourmilab.ch/hotbits/statistical_testing/stattes...

The test that you're likely most interested in is ent ( https://www.fourmilab.ch/random/ ) and the entropy per byte and serial correlation test

    Entropy = 7.999975 bits per byte.
    Serial correlation coefficient is -0.000053 (totally uncorrelated = 0.0).

From the docs:

> Serial Correlation Coefficient

> This quantity measures the extent to which each byte in the file depends upon the previous byte. For random sequences, this value (which can be positive or negative) will, of course, be close to zero. A non-random byte stream such as a C program will yield a serial correlation coefficient on the order of 0.5. Wildly predictable data such as uncompressed bitmaps will exhibit serial correlation coefficients approaching 1. See [Knuth, pp. 64–65] for more details.

If you had a sufficiently precise clock where the difference between the 10^23 and 10^23 - 1 became noticeable, you'd likely see other effects. Cesium 137 decays via beta decay which is influenced by the weak force and neutrino. There is evidence that solar neutrinos slightly perturb the rate of beta decay ( https://physicsworld.com/a/do-solar-neutrinos-affect-nuclear... )

> Further evidence that solar neutrinos affect radioactive decay rates on Earth has been put forth by a trio of physicists in the US. While previous research looked at annual fluctuations in decay rates, the new study presents evidence of oscillations that occur with frequencies around 11 and 12.5 cycles per year. The latter oscillation appears to match patterns in neutrino-detection data from the Super-Kamiokande observatory, in Japan. Other physicists, however, are not convinced by the claim.

... and if you are capable of measuring the differences between N and N-1 atoms in the decay rate or the influence of neutrinos in the decay rate your clock is way too expensive to be used for generating random numbers.

----

(late edit)

https://www.fourmilab.ch/hotbits/how3.html specifically addresses this though:

> For example, you might worry about the fact that the intensity of the radiation source is slowly decreasing over time. Cæsium-137's 30.17 year half-life isn't all that long. One half-life in the future, we'll measure T1 and T2 intervals, on the average, twice as long as today. This means, then, that even on consecutive measurements there is a small bias in favour of T2 being longer than T1. How serious is this? Well, expressed in seconds, the half-life is about 9.5×108 and we receive count pulses at a rate of 1000 per second or so. So the time needed to perform the measurements to produce one random bit is on the order of 10−12 half-lives, and T2 will then tend to be longer by a factor of the same magnitude. Since the inter-count interval is around a millisecond, this means T2 will be, on average, 10−15 seconds longer than T1. This is comparable to the long-term accuracy of the best atomic time standards and is entirely negligible for our purposes. The crystal oscillator which provides the time base for the computer making the measurement is only accurate to 100 parts per million, or one part in ten thousand, and thus can induce errors ten million times as large as those due to the slow decay of the source. (This is, again, unlikely to be a real problem because most computer clocks, while prone to drifting as temperature and supply voltage vary, do not change significantly on the millisecond scale. Still, jitter due to where the clock generator happens to trigger on the oscillator waveform will still dwarf the effects of decay of the source during one measurement.)

dmurray · on Oct 12, 2022

Doesn't the time between events increase over time as there are fewer radioactive atoms left to decay? So E[T1] < E[T2] and you get slightly more 0 bits than 1s.

I'm sure the device is on a scale where this effect is tiny, but the description sounds like the creator is aiming for absolute theoretical correctness rather than "good enough" randomness which is not so hard to come by.

shagie · on Oct 12, 2022

If you've got a mole of cesium (137 grams), that's 10^23 atoms.

The difference between 10^23 and (10^23) - 1 would be lost in the clock resolution of the computer it is hooked up to.

The statistical tests of randomness are described at https://www.fourmilab.ch/hotbits/statistical_testing/stattes... from ent ( https://www.fourmilab.ch/random/ )

    Entropy = 7.999975 bits per byte.
    
    Optimum compression would reduce the size
    of this 11468800 byte file by 0 percent.
    
    Chi square distribution for 11468800 samples is 402.53, and randomly
    would exceed this value 0.01 percent of the times.
    
    Arithmetic mean value of data bytes is 127.5423 (127.5 = random).
    Monte Carlo value for Pi is 3.141486168 (error 0.00 percent).
    Serial correlation coefficient is -0.000053 (totally uncorrelated = 0.0).

I'd be impressed to find any other data source that is more random than hotbits (autocorrect keeps wanting to make this hobbits).

----

(late edit)

https://www.fourmilab.ch/hotbits/how3.html specifically addresses this though:

> For example, you might worry about the fact that the intensity of the radiation source is slowly decreasing over time. Cæsium-137's 30.17 year half-life isn't all that long. One half-life in the future, we'll measure T1 and T2 intervals, on the average, twice as long as today. This means, then, that even on consecutive measurements there is a small bias in favour of T2 being longer than T1. How serious is this? Well, expressed in seconds, the half-life is about 9.5×108 and we receive count pulses at a rate of 1000 per second or so. So the time needed to perform the measurements to produce one random bit is on the order of 10−12 half-lives, and T2 will then tend to be longer by a factor of the same magnitude. Since the inter-count interval is around a millisecond, this means T2 will be, on average, 10−15 seconds longer than T1. This is comparable to the long-term accuracy of the best atomic time standards and is entirely negligible for our purposes. The crystal oscillator which provides the time base for the computer making the measurement is only accurate to 100 parts per million, or one part in ten thousand, and thus can induce errors ten million times as large as those due to the slow decay of the source. (This is, again, unlikely to be a real problem because most computer clocks, while prone to drifting as temperature and supply voltage vary, do not change significantly on the millisecond scale. Still, jitter due to where the clock generator happens to trigger on the oscillator waveform will still dwarf the effects of decay of the source during one measurement.)

dmurray · on Oct 13, 2022

Great analysis, but the argument at the end still seems wrong. The methodology has been carefully designed so that the clock jitter only adds random noise, not a directional bias. Even if the oscillator consistently runs fast by triggering 90 ppm early, it doesn't predictably increase the probability of a 0 or 1 bit.

I suppose you could get more randomness using von Neumann's method for a biased coin - either waiting for multiple time periods, or with multiple decay sources in parallel. Or XOR the bits with a shitty but cheap source of randomness. But it's more valuable to be able to put an upper bound on your bias is - here, something like 10^-12 - and ensure that's acceptable for your application.

dekhn · on Oct 12, 2022

Is that the same John Nagle as Nagle's algorithm (he is also active on this site)?

shagie · on Oct 12, 2022

I would be surprised if there were two, and the links go through to some other work that appears to be the same person.

As to his activity on this (HN) site? https://news.ycombinator.com/item?id=9048947 ... and chasing links...

Animats 13 hours ago

dekhn · on Oct 12, 2022

There's another John Nagle who is a scientist at CMU (at least, from what I can tell, he is the not the Nagle of Nagle's algorithm).

shagie · on Oct 12, 2022

So... tying all the threads together...

The John Nagle linked in hotbits:

> The trick I use was dreamed up in a conversation in 1985 with John Nagle, who is doing some fascinating things these days with artificial animals.

Artificial animals in that part links to http://www.animats.com

The link at https://news.ycombinator.com/item?id=9048947 for the user Animats (who has an about and signs posts as John Nagle and links to the above site also)

> Animats on Feb 14, 2015 | root | parent | next [–]

> Yes, it's me. I did my networking work at Ford Aerospace in the early 1980s. But I left in 1986. It still bothers me that the Nagle algorithm (which I called tinygram prevention) and delayed ACKs interact so badly. ...

V__ · on Oct 12, 2022

This is what random.org does:

> A binary digit (bit) can be either 0 or 1. There are several Random.org radios located in Copenhagen, Dublin, and Ballsbridge, each generating 12,000 bits per second[8] from the atmospheric noise picked up.[9] The generators produce a continuous string of random bits which are converted into the form requested (integer, Gaussian distribution, etc.)

[1] https://en.wikipedia.org/wiki/Random.org

dmurray · on Oct 12, 2022

Should that read "Ballsbridge, Dublin and Copenhagen"?

idiocrat · on Oct 12, 2022

For increased entropy, you can use a thermal noise from a cheap camera or a static noise from a microphone.

Bayart · on Oct 12, 2022

As far as I know, the counts per minute for background radiation you get with cheap compact captors (say the stuff you find on Geiger counters) is pretty low, a dozen to a hundred counts per minute. That's not a lot of entropy. You could much more precise captors, but then you couldn't just plug it into a PCIe slot. At this scale, you probably get more entropy from just jitter on the electrical circuit. And it's pointless to have one huge centralized RNG, at least from a security standpoint.

tgflynn · on Oct 12, 2022

> You could much more precise captors

What do you mean by that ? There are only so many particles of ionizing radiation traversing a given surface per unit time (fortunately for us).

Bayart · on Oct 13, 2022

I meant bigger. But it doesn't sound quite as nice as "precise".

idiocrat · on Oct 19, 2022

You can open up a cheap USB camera to get the nacked CCD sensor and put a radioactive source (Americium-241) from a smoke detector directly onto the chip.

The CCD sensor, connected as USB camera, will record thousands of flashes/sec capturing the alpha radiation.

defrost · on Oct 12, 2022

Sure .. although you'll find it will have a (perhaps surprisingly) consistant distribution of energies and timings and here on earth will fluctuate by density of atmosphere above (height, humidity, tempreture) and relationship to earths magnetic flux lines with a partial coupling to solar output.

Airborne radiometric ground surveys run calibration flights at varying altitude to build an estimate of cosmic activity in order to subtract that from ground events originating from Uranium, Potassium, Thorium, Radon, etc.

sebzim4500 · on Oct 12, 2022

Those kind of trends don't matter very much if you e.g. only look at the least significant bits of the detection count.

defrost · on Oct 12, 2022

They might suggest more interesting applications such as using cosmic rays to map the interior of pyramids though.

bgirard · on Oct 12, 2022

Wouldn't there be a more likely explanation for an unintended bit flip than a cosmic ray? Perhaps some random hardware effect like an unintentional 'Row hammer' bit flip in other parts of the system, a very rare hardware race condition, a very unlikely quantum tunnel for the node size, etc...?

GrabbinD33ze69 · on Oct 12, 2022

Wow, a pacemaker malfunctioning due to a stray cosmic ray is incredible/alarming; I've only heard of these single upset events in the context of aerospace applications, not a medical device.

gl-prod · on Oct 12, 2022

I guess give it enough time and it will happen

airbreather · on Oct 15, 2022

Digital cameras are shipped by sea to avoid/lessen sensor damage they would be subjected to 7 miles high from greater cosmic radiation.

7373737373 · on Oct 12, 2022

Is critical infrastructure usually protected against this?