Where are the dead bodies?
The possibility of faults in software causing death or serious injury is often talked about and in some cases large amounts of money are invested in work to reduce the possibility of these events occurring (or at least doing things that will support the view that a company took reasonable precautions, should a case end up in court). The Therac-25 accidents are an often quoted example of a software fault that directly resulted in deaths. These accidents occurred over a 19 month period in the mid 1980s and are believed to have resulted in the death of six people. I don’t wish to disrespect the memory of the people who died, but six people 20 years ago; is that it? Less than the number of people killed every day (around 10) in traffic accidents in the UK.
If faults in software really do have a non-trivial impact on human safety then we would expect this fact to be reflected in accident statistics. After searching the accident statistics for the UK I cannot find any whose cause is directly attributed to software. If there are people who have died as a direct result of faults in software, the death rate has not yet reached the minimum level needed to be recorded as such (or are these deaths ‘hidden’ away in ones and twos within other causes?)
The US National Transportation Safety Board carries out a thorough investigation of all US aviation accidents. Searching the Aviation Accident Database on the query “software” between the dates 1 Jan 2000 and 9 Aug 2005 returns 44 matches. Reading these 44 reports I did not find any accident attributed to a software related issue.
If faults in software are not killing or seriously injuring many people why is so much effort invested in reducing the probability of these events occurring? The following are some of the possibilities:
- The investment actually made is small, but it is talked up.
- The investment is made for economic reasons (e.g., more reliable products are likely to reduce support costs) and increased ’safety’ is a side effect.
- In situations where there is a likelihood of death or serious injury the procedures and reliability of non-software items is sufficient to short-circuit the effects of any life threatening faults that may exist in the software used (at least until the fault can be corrected).
As any developer knows, replicating faulty behavior in software can be very difficult, if not impossible. It may be that software faults are not given as the root cause of death or serious injury because the necessary proof is not available. Or perhaps software faults have yet to be the root cause of such events on any non-trivial scale.
Existing practice affects what people are willing to put up with. Many users of Microsoft Windows now accept that it is necessary to reboot the computer they are using on a daily, or even hourly, basis. Users of cars accept that the tool they are using can result in serious injuries or even death (usually rating nothing more than a story in the local town newspaper). Will there be a public hue and cry once software faults start to be recorded as a primary factor in accidental death or serious injury? As this paper shows, it can take a lot of dead bodies before existing practices are changed.
The lack of dead bodies attributed to a software root cause suggests that it is very still early days for the field of high integrity software development.
This material was originally written in 2005 and appeared in an earlier blog of mine which I did not keep up.
I don’t know if you looked or not, and I don’t know if they’d be included in the US National Transportation Safety Board’s database, but have you looked into military deaths related to software? I know at least pilot died the new F-22 stealth fighter for Lockheed Martin. How much of that death was software related I don’t know though.
The Patriot 0.1s bug is also often cited, and certainly at least one life was lost that could have been saved — but this was war. Or maybe we hear a lot about it because we have colleagues who have a numerical precision analysis tool they want to sell you, although in this case, a simple compiler warning that a floating-point literal cannot be represented exactly, properly understood, would have been enough to avoid embarrassment.
Generally speaking, it is true that for instance, research in the field of software safety uses the same examples again and again to justify its existence, even when use of the methods they advocate remains — as of 2009 — marginal.
On the other hand, software is deterministic and often replicated exactly into thousands of devices. Perhaps it is feared — and perhaps even with reason — the death toll would become immediately unacceptably high if the slightest compromise in safety was allowed.
Well…I’ll guess two things:
1) There are likely hardware (e.g. non-software) limits in place that sufficiently overcome the faults of software in most cases.
2) Deaths probably get related to hardware issues (no limiter in place to keep the software from going there) than getting reported as software.
Also, don’t forget that while something may keep someone from being killed there is also the issue of injury – regardless of degree. How much does software related faults contribute to injuries? Probably even harder to tell.