Reading 05: Engineering Disasters

The Therac-25 incident, as described here, was an incident that resulted in six patients being exposed to huge amounts of radiation that killed four of them and left the remaining two with “lifelong injuries.”  The machine was designed to administer radiation therapy to cancer patients in the hope of killing off the cancerous cells and curing the patient.  According to the same article, the investigation determined that the cause of the malfunction was due to two key factors.  First that the machine’s software, ” contained bugs which proved to be fatal.” Second, that the machine, “relied on the controlling computer alone for safety.”  Further reading indicated that the reason these design flaws existed was that the manufacturer wanted to reduce the amount of manual preparation required in order to make the use of the machine simpler and faster for the hospital technicians operating it.

Due to the way the computer would set up the machine to function, it was possible to administer radiation without intending to if setup was performed too fast by the operator. Because testing had been done slowly and methodically, this bug was not encountered during pre-production testing.  Additionally, previous models of the machine had hardware implementations in the form of safety fuses that would prevent this error from occurring.  According to the article and the results of the investigation, “safety-critical loads were placed upon a computer system that was not designed to control them. Timing analysis wasn’t performed. Unit testing never happened. Fault trees for both hardware and software were not created.” It was the responsibility of the software engineers as well as the system engineers to catch these bugs.

It would appear that the root causes of these accidents were poor design implementation on top of inadequate testing and analysis of the system.

Software developers working on safety-critical systems face challenges in that they need to conform their designs to handle situations or time constraints that cannot necessarily be serviced by typical computer functionality with ease.  When additional cost or design constraints are placed on the system, the task can become even more difficult for the software engineers responsible.

Software engineers should always approach these types of systems with the care and knowledge of what they are doing.  The actions and work of these software engineers actions always carry consequences, but even more so when working on safety-critical systems.  Failure to perform one’s job to their highest ability could result in the loss of life or at the very least some serious harm to others.  The stakes have been raised in these scenarios, and the people responsible should be fully aware of what is at stake.

I also believe that software engineers should be held liable for their actions.  That being said, serious and thorough investigations should be performed to determine the extent of their responsibility in the incident.  For example, an incident caused by pure negligence or insufficient testing would be completely the fault of the engineer responsible for those duties.  However, there may be scenarios where the engineer is not completely at fault, for which we would need extensive and thorough investigation to determine who is liable.

Reading 05: Engineering Disasters

Leave a comment