Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

the spike in one unit was badly and incorrectly handled by the "failsafe" logic

In other words, this sounds like a reasonably easily detectable bug that was allowed in to production due to insufficient system verification, the fault for which lies squarely on Airbus and approving regulators. Arguably the simplest design for redundancy-based high availability systems is that when the system enters a non-quorum state, dissenting inputs should be flagged and discarded and if recovery is initiated their subsystem fully reset.

A more complicated design would be to have a mechanism for evaluating the extent to which inputs differ from those anticipated based upon other known state or inputs (last known position, inertia, airspeed, etc.), and to discard those most unlikely / least supported. However, the tiny fraction of potential failure conditions for which this provides an enhanced recovery path is largely outweighed by the greater complexity of state, processing overhead, lack of transparency in decision making and increased development and testing time (thus system cost).

A better investment of additional system design resources may be in creating a trust metric within the higher-level flight control systems that can reduce risk by avoiding autonomous actions based on subsystems that have entered a low-trust (eg. non-quorum) state.

And indeed, the report (page 21) states:

At 0440:26, one of the aircraft’s three air data inertial reference units (ADIRU 1) started providing incorrect data to other aircraft systems. At 0440:28, the autopilot automatically disconnected, and the captain took manual control of the aircraft.

The report reveals they had nominally independent autopilots running on nominally independent computers and nominally independent flight displays, all of which were of use during incident recovery. However, the number of systems that are reported to have broken (autotrim, cabin pressure, GNSS/RNAV, autobrake, third computer) strongly suggests a deep and systemic failure in the core flight control systems, probably stemming from systemic systems architecture failure to isolate and discard bad data from the malreporting subsystem.

A heterogeneous array of redundant subsystems (ie. from different manufacturers, or with differing dates or places of manufacture) are nominally more likely to survive a fault event. In this event, all the ADIRU units were identical LTN-101 models from Northrop Grumman (who, being a major military avionics contractor, one would have incorrectly assumed would have understood the value of neutron shielding https://www.sciencedirect.com/science/article/pii/B978012819...).

It is also worth noting that the ADIRU units are designed to calculate, maintain and report inertial navigation state. Having this state, sensor errors may compound or persist over time.

However, page 41 reveals that while each autopilot runs on an independent computer, Autopilot 1 on FMGEC 1 trusts ADIRU 1 as its "main" source, and likewise for #2. This suggests a "true quorum feed" is not obtained, possibly for reasons of redundancy (SPOF).

It would be interesting to discuss the current design of such systems with an Airbus engineer and to what extent that incident changed their internal test and design processes and sensor data architecture.



Good comment, appreciated.

> In other words, this sounds like a reasonably easily detectable bug that was allowed in to production due to insufficient system verification, the fault for which lies squarely on Airbus and approving regulators.

Indeed - cosmic rays capable of bitflips are expected in aircraft systems so it is disapointing to see a failure to sheild and|or correctly mitigate for the expected.

> It is also worth noting that the ADIRU units are designed to calculate, maintain and report inertial navigation state. Having this state, sensor errors may compound or persist over time.

A very particular bugbear of mine. I've come across several examples where stats accumulators for low frequency error events are poorly implemented .. leading to situations where "this looks like it's working" and being passed over for any closer examination; years later some threshold is crossed and Bam! something bites hard.


People should learn from the experts. Lamport is always the wizard:

You're not going to come up with a simple design through any kind of coding techniques or any kind of programming language concepts. Simplicity has to be achieved above the code level before you get to the point which you worry about how you actually implement this thing in code. - Leslie Lamport

Then there's Wiener's Eighth and Final Law: You can never be too careful about what you put into a digital flight-guidance system. - Earl Wiener, Professor of Engineering, University of Miami (1980)

The corollary of this is to treat any and all state within subsystems as a red flashing lights level liability.

.. via https://github.com/globalcitizen/taoup




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: