MIT Aero/Astro
System Safety and Software Engineering Research Papers

Older papers on the following topics are available here.

Newer publications on STAMP can be found by clicking here.




REQUIREMENTS SPECIFICATION AND ANALYSIS


Intent Specifications: An Approach to Building Human-Centered Specifications by Nancy Leveson, IEEE Trans. on Software Engineering, January 2000. (PostScript) (PDF )
This paper examines and proposes an approach to writing software specifications, based on research in systems theory, cognitive psychology, and human-machine interaction. The goal is to provide specifications that support human problem solving and the tasks that humans must perform in software development and evolution. A type of specification, called Intent Specifications, is constructed upon this underlying foundation.

Making Embedded Software Reuse Practical and Safe Nancy Leveson and Kathryn Anne Weiss. Proceedings of Foundations of Software Engineering, November 2004. (pdf),

Reuse of application software has been limited and sometimes has led to accidents. This paper suggests some requirements for successful and safe application software reuse and demonstrates them using a case study on a real spacecraft.

Advanced System and Safety Engineering Environments by Nancy Leveson. This is an annotated powerpoint presentation on SpecTRM and SpecTRM-RL. ( Annotated Powerpoint Slides),

The notes plus the slides describe the SpecTRM tools and environment for building complex safety-critical systems.

Reusable Specification Components for Model-Driven Development by Kathryn Anne Weiss, Elwin C. Ong, and Nancy G. Leveson. Proceedings of the International Conference on System Engineering (INCOSE '03), July 2003. (pdf),

Modern, complex control systems for a specific application domain often display common system design architectures with similar subsystem functionality and interactions, making them suitable for representation by a reusable specification architecture. For example, every spacecraft requires attitude determination and control, power, thermal, communications, and propulsion subsystems. The similarities between these subsystems in most spacecraft can be exploited to create a model-driven system development environment in which generic reusable specifications and models can be tailored for the specific spacecraft design, executed and validated in a simulation environment, and then either manually or automatically transformed into software or hardware. Modifications to software and hardware during operations can be similarly made in the same controlled way, that is, starting from a model, validating the change, and finally implementing the change. The approach is illustrated using a spacecraft attitude determination and control subsystem.

Fault Protection in a Component-Based Spacecraft Architecture by Elwin C. Ong and Nancy G. Leveson. Proceedings of the International Conference on Space Mission Challenges for Information Technology, Pasadena, July 2003. (doc),

As spacecraft become more complex and autonomous, the need for reliable fault protection will become more prevalent. When coupled with the additional requirement of limiting cost, the task of implementing fault protection on spacecraft becomes extremely challenging. This paper describes how domain knowledge about spacecraft fault protection can be captured and stored in a reusable, component-based spacecraft architecture. The spacecraft-level fault protection strategy can then be created by composing generic component specifications, each with component-level fault protection included. The resulting design can be validated by formal analysis and simulation before any costly implementation begins. As spacecraft technology improves, new generic fault protection logic may be added, allowing active improvements to be made to the foundation.

Completeness in Formal Specification Language Design for Process Control Systems by Nancy G. Leveson. Proceeedings of Formal Methods in Software Practice Conference, August 2000. (Postscript), (PDF).

This paper shows how the information required by the completeness criteria we defined for blackbox requirements specification can be embedded in the syntax of a formal specification language. This paper is a companion paper for the one that follows, but this one was written later and contains the current definition of the SpecTRM-RL requirements specification language.

On the Use of Visualization in Formal Requirements Specification by Nicolas Dulac, Thomas Viguier, Nancy Leveson, and Margaret-Anne Storey. International Conference on Requirements Engineering, Essen, September 2002. (pdf),
A limiting factor in the industrial acceptance of formal specifications is their readability, particularly for large, complex engineering systems. We hypothesize that multiple visualizations generated from a common model will improve the requirements creation, reviewing, and understanding process. Visual representations, when effective, provide cognitive support by highlighting the most relevant interactions and aspects of a specification for a particular use. In this paper, we propose a taxonomy and some preliminary principles for designing visual representations of formal specifications. The taxonomy and principles are illustrated by sample visualizations we created while trying to understand a formal specification of the MD-11 Flight Management System.

Investigating the Readability of State-Based Formal Requirements Specification Languages by Marc Zimmerman, Kristina Lundqvist, Nancy Leveson. International Conference on Software Engineering, Orlando, May 2002. (pdf),
The readability of formal requirements specification languages is hypothesized as a limiting factor in the acceptance of formal methods by the industrial community. An empirical study was conducted to determine how various factors of state-based requirements specification language design affect readability using aerospace applications. Six factors were tested in all, including the representation of the overall state machine structure, the expression of triggering conditions, the use of macros, the use of internal broadcast events, the use of hierarchies, and transition perspective (going-to or coming-from). Subjects included computer scientists as well as aerospace engineers in an effort to determine whether background affects notational preferences. Becuase so little previous experimentation on this topic exists on which to build hypotheses, the study was designed as a preliminary exploration of what factors are most important with respect to readability. It can serve as a starting point for more thorought and carefully controlled experimentation in specification language readability.

Reducing the Effects of Requirements Changes through System Design by Israel Navarro, Nancy Leveson, and Kristina Lundqvist, MIT SERL Technical Report, 2001. ( PDF)

The continuous stream of requirements changes that often takes place during software development can create major problems in the development process. This paper defines a concept we call semantic coupling that along with features of intent specifications can be used during system design to reduce the impact of changing requirements. The practicality of using the approach on real software is demonstrated using the intent specification of the control software for a NASA robot designed to service the heat resistant tiles on the Space Shuttle.

Making Formal Methods Practical by Marc Zimmerman, Mario Rodriguez, Benjamin Ingram, Masafumi Katahira, Maxime de Villepin, Nancy Leveson. Digital Aviation Systems Conference, Oct. 2000. (Postscript), (PDF).

Despite its potential, formal methods have had difficult gaining acceptance in the industrial sector. Some doubts are based on supposed impracticality or long learning curve. Contributing to this scepticism is the fact that some types of formal methods have not yet been proven to handle systems of realistic complexity. To learn more about how to design formal specification languages that can be used for complex systems and require minimal training, we developed a formal specification of an English language specification of a vertical flight control system similar to that found in the MD-11. This paper describes the lessons learned from this experience. A companion paper (below) describes how the model can be used in human-computer interaction and analysis and pilot task analysis.

Designing Specification Languages for Process Control Systems: Lessons Learned and Steps to the Future, by Nancy G. Leveson, Mats Heimdahl, and Jon Damon Reese. Presented at SIGSOFT FOSE '99 (Foundations of Software Engineering), Toulouse, September 1999. (Postscript), (PDF).

Previously we defined a blackbox formal system modeling language called RSML (Requirements State Machine Language). The language was developed over several years while specifying the system requirements for a collision avoidance system for commercial passenger aircraft. During the language development, we received continual feedback and evaluation by FAA employees and industry representatives, which helped us to produce a specification language that is easily learned and used by application experts. Since the completion of the RSML project, we have continued our research on specification languages. This research is part of a larger effort to investigate the more general problem of providing tools to assist in developing embedded systems. Our latest experimental toolset is called SpecTRM (Specification Tools and Requirements Methodology), and the formal specification language is SpecTRM-RL (SpecTRM Requirements Language). This paper describes what we have learned from our use of RSML and how those lessons were applied to the design of SpecTMR-RL. We discuss our goals for SpecTRM-RL and the design features that support each of these goals.

Completeness and Consistency in Hierarchical State-Based Requirements by Mats P.E. Heimdahl and Nancy Leveson. Published in IEEE Transactions on Software Engineering (May 1996). (PostScript) (PDF )

This paper describes automated methods for analyzing RSML specifications for completeness and consistency. Results are presented from the application of these methods to TCAS II.

Requirements Specification for Process-Control Systems by Nancy G. Leveson, Mats P.E. Heimdahl, Holly Hildreth, and Jon D. Reese. Published in IEEE Transactions on Software Engineering (Sept. 1994) (PostScript). (PDF )

Introduces RSML and the RSML requirements specification of TCAS II, an aircraft collision-avoidance system that motivated RSML's development.

An Intent Specification Model for a Robotic Software Control System , by Israel Navarro, Kristina Lundqvist, and Nancy Leveson, DASC '01. (PDF).

This paper shows a sample intent specification for an industrial robot designed to service the heat resistant tiles on the Space Shuttle.

Software Deviation Analysis: A ``Safeware'' Technique by Jon Damon Reese and Nancy G. Leveson. AIChe 31st Annual Loss Prevention Symposium, Houston, TX, March 1997. (PostScript) (PDF).

This paper describes one of the Safeware hazard analysis techniques, Software Deviation Analysis, that incorporates the beneficial features of HAZOP (such as guidewords, deviations, exploratory analysis, and a systems engineering approach) into an automated procedure that is capable of handling the complexity and logical nature of computer software.

Software Deviation Analysis by Jon Damon Reese and Nancy G. Leveson, International Conference on Software Engineering, Boston, 1997. (PDF).

A longer and more technically detailed paper on SDA than the one above.

Integrated Safety Analysis of Requirements Specifications, by Francesmary Modugno, Nancy G. Leveson, Jon D. Reese, Kurt Partridge, and Sean D. Sandys. Requirements Engineering '97. (Postscript). (PDF).

This paper describes the application of manual and automated safety analysis techniques to a prototype of an aircraft guidance system.

Software Requirements Analysis for Real-Time Process Control Matt Jaffe, Nancy Leveson, Mats Heimdahl, and Bonnie Melhart. IEEE Trans. on Software Engineering, March 1991. (PDF)

Back to the top


SOFTWARE SYSTEM SAFETY


Safeware: System Safety and Computers by Nancy Leveson. Published by Addison Wesley (1995). (HTML Table of Contents)

This book examines past accidents and what is currently known about building safe electromechanical systems to see what lessons can be applied to new computer-controlled systems. Most accidents are not the result of unknown scientific principles but rather of a failure to apply well-known, standard engineering practices. In addition, accidents will not be prevented by technological fixes alone, but will require control of all aspects of the development and operation of the system. A methodology for building safety-critical systems is outlined.

Software Challenges in Achieving Space Safety by Nancy Leveson. Journal of the British Interplanetary Society, Vol. 62, 2009. (DOC)

Techniques developed for hardware reliability and safety do not work on software-intensive systems; software does not satisfy the assumptions underlying these techniques. The new problems and why the current approaches are not effective for complex, software-intensive systems are first described. Then a new approach to hazard analysis and safety-driven design is presented. Rather than being based on reliability theory, as most current safety engineering techniques are, the new approach builds on system and control theory.

A Systems-Theoretic Approach to Safety in Software-Intensive Systems by Nancy Leveson. IEEE Trans. on Dependable and Secure Computing, January 2005. (PDF)

Traditional accident models were devised to explain losses caused by failures of physical devices in relatively simple systems. They are less useful for explaining accidents in software-intensive systems and for non-technical aspects of safety such as organizational culture and human decision-making. This paper describes how systems theory can be used to form new accident models that better explain system accidents (accidents arising from the interactions among components rather than individual component failure), software-related accidents, and the role of human decision-making. Such models consider the social and technical aspects of systems as one integrated process and may be useful for other emergent system properties such as security. The loss of a Milstar satellite being launched by a Titan/Centaul launch vehicle is used as an illustration of the approach.

A New Approach to Hazard Analysis for Complex Systems by Nancy Leveson. Int. Conference of the System Safety Society, Ottawa, August 2003. (DOC)

This paper describes a new hazard analysis approach, called STPA (STamP-based Analysis) based on a new model of accidents called STAMP. The paper briefly describes STPA and illustrates it with an aircraft collision avoidance system.

Model-Based Analysis of Socio-Technical Risk by Nancy Leveson. Technical Report, Engineering Systems Division, Massachusetts Institute of Technology, June 2002 (DOC)

In this report, a new type of hazard analysis, based on the STAMP model of accident causation, is described. STPA (STAMP-based Analysis) is illustrated by applying it to TCAS II, a complex aircraft collision avoidance system, and to a public water safety system in Canada. The TCAS II results are compared with a high-quality fault tree created by MITRE for the FAA. The STPA analysis was found to be more comprehensive and complete than the fault tree analysis. The integration of STPA, SpecTRM-RL system engineering tools, and system dynamics modeling creates the potential for a simulation and analysis environment to support and guide the initial technical and operational system design as well as organizational and management policy design. The results of STPA analysis can also be used to support organizational learning and performance monitoring throughout the system's life cycle so that degradation of safety and increases in risk can be detected before a catastrophe results.

An Approach to Design for Safety in Complex Systems by Nicolas Dulac and Nancy Leveson. Int. Conference on System Engineering (INCOSE '04), Toulouse, June 2004. (PDF)

Most traditional hazard analysis techniques rely on discrete failure events that do not adequately handle software-intensive systems or system accidents resulting from dysfunctional interactions between system components. This paper demonstrates a methodology where a hazard analysis based on the STAMP accident model is performed together with the system development process to design for safety in a complex system. Unlike traditional hazard analyses, this approach considers system accidents, organizational factors, and the dynamics of complex systems. The analysis is refined as the system design progresses and produces safety-related information to help system engineers in making design decisions for complex safety-critical systems. The preliminary design of a Space Shuttle Thermal Tile Processing System is used to demonstrate the approach.

Incorporating Safety Risk in Early System Architecture Trade Studies by Nicolas Dulac and Nancy Leveson. AIAA Journal of Spacecraft and Rockets, Vol. 46, No. 2, March-April 2009. (DOC)

Evaluating risk early in concept design is difficult due to the lack of information available at that early stage. This paper describes the approach we developed to perform a preliminary risk evaluation to use in the trade studies by MIT and Draper Laboratory for concept evaluation and refinement of the new NASA Space Exploration Initiative.

Demonstration of a Safety Analysis on a Complex System by N. Leveson, L. Alfaro, C. Alvarado, M. Brown, E.B. Hunt, M. Jaffe, S. Joslyn, D. Pinnel, J. Reese, J. Samarziya, S. Sandys, A. Shaw, Z. Zabinsky. Presented at the Software Engineering Laboratory Workshop, NASA Goddard, December 1997.

This paper describes a demonstration of the Safeware methodology on the Center-TRACON Automation System (CTAS) portion of the air traffic control system and procedures currently employed at the Dallas/Fort Worth TRACON (Postscript) (PDF ). The complete report can be found by clicking (Postscript or (PDF ).

Use of SpecTRM in Space Applications by Masafumi Katahira (NASDA) and Nancy Leveson. This paper was presented at the 19th International System Safety Conference, Huntsville, Alabama, September 2001. ( .doc (Word) ).

This paper provides an introduction to the application of SpecTRM (Specification Tools and Requirements Methodology) to safety-critical software in spacecraft controllers. The SpecTRM toolset includes modeling of the behavior of safety-critical software and its operation, while generating and maintaining significant safety information. We studied the applicability and effectiveness for safety-critical controllers on the International Space Station. Errors in the original requirements specifications of the Japanese Experimental Module (JEM) found during the modeling process are described.

A Safety and Human-Centered Approach to Developing New Air Traffic Management Tools by Nancy Leveson, Maxime de Villepin, Mirna Daouk, John Bellingham, Jayakanth Srinivasan, Natasha Neogi, and Ed Bachelder (MIT) and Nadine Pilon and Geraldine Flynn (Eurocontrol). This paper will be presented at ATM 2001, Albuquerque NM, December 2001. ( PDF ).

This paper describes a safety-driven, human-centered process for designing and integrating new components into an airspace management system. The general design of a conflict detection function currently being evaluated by Eurocontrol is being used as the testbed for the methodology, alghouth the details differ somewhat. The development and evaluation approach proposed is based on the principle that critical properties must be designed into a system from the start. As a result, our methodology integrates safety analysis, functional decomposition and allocation, and human factors from the very beginning of the system development process. It also emphasizes using both formal and informal modeling to accumulate the information needed to make tradeoff decisions and ensure that desired system qualities are satisfied early in the design process when changes are easier and less costly. The formal modeling language was designed with readability as a primary criterion and therefore the models can act as an unambiguous communication medium among the developers and implementers. The methodology is supported by a new specification structuring approach, called Intent Specifications, that supports traceability and documentation of design rationale as the development process proceeds.

Integrated Safety Analysis of Requirements Specifications, by Francesmary Modugno, Nancy G. Leveson, Jon D. Reese, Kurt Partridge, and Sean D. Sandys. Requirements Engineering '97. (Postscript). (PDF).

This paper describes the application of manual and automated safety analysis techniques to a prototype of an aircraft guidance system.

System Safety in Computer-Controlled Automotive Systems, by Nancy G. Leveson, SAE Congress, March, 2000. (Postscript), (PDF).

An invited paper that summarizes the state of the art in software system safety and suggests some approaches possible for the automotive and other industries.

Software Deviation Analysis: A ``Safeware'' Technique by Jon Damon Reese and Nancy G. Leveson. AIChe 31st Annual Loss Prevention Symposium, Houston, TX, March 1997. (PostScript) (PDF).

This paper describes one of the Safeware hazard analysis techniques, Software Deviation Analysis, that incorporates the beneficial features of HAZOP (such as guidewords, deviations, exploratory analysis, and a systems engineering approach) into an automated procedure that is capable of handling the complexity and logical nature of computer software.

The Therac-25 Accidents by Nancy G. Leveson. (Postscript ) or (PDF).

This paper is an updated version of the original IEEE Computer (July 1993) article. It also appears in the appendix of my book.

The following papers are not currently available in electronic form:

Leveson, N.G. and P.R. Harvey. ``Analyzing Software Safety,'' IEEE Transactions on Software Engineering, vol. SE-9, no. 5, 1983.

Leveson, N.G. and Stolzy, J.L. ``Safety Analysis Using Petri Nets,'' IEEE Trans. on Software Engineering, Vol. SE-13, No. 3, March 1987, pp. 386-397.

Leveson, N.G. ``Software Safety in Embedded Computer Systems,'' Communications of the ACM, February, 1991.

Leveson, N.G., Cha, S.S., Shimeall, T.J. ``Safety Verification of Ada Programs using Software Fault Trees,'' IEEE Software, July 1991.

Back to the top


SYSTEM SAFETY AND ACCIDENT MODELS


Modeling and Hazard Analysis using STPA by Takuto Ishimatsu, Nancy Leveson, John Thomas, Masa Katahira, Yuko Miyamoto, Haruka Nakao. Presented at the Conference of the International Association for the Advancement of Space Safety, Huntsville, Alabama, May 2010. ( DOC )
A joint research project between MIT and JAXA/JAMSS is investigating the application of a new hazard analysis technique, called STPA, to the system and software in the HTV. STPA is based on systems theory rather than reliability theory. It treats safety as a control problem rather than a failure problem. Traditional hazard analysis focuses on component failures but software does not fail in this way. Software most often contributes to accidents by commanding the spacecraft into an unsafe state (e.g., turning off the descent engines prematurely) or by not issuing required commands. That makes the standard hazard analysis techniques of limited usefulness on software-intensive systems, which describes most spacecraft built today.

This paper describes the experimental application of STPA to the JAXA HTV (unmanned cargo transfer vehicle to the International Space Station). Because the HTV was originally developed using fault tree analysis and following the NASA standards for safety-critical systems, the results of our experimental application of STPA can be compared with these more traditional safety engineering approaches in terms of the problems identified and the resources required to use it.

Applying Systems Thinking to Analyze and Learn from Events by Nancy Leveson, presented at NeTWorK 2008: Event Analysis and Learning from Events, Berlin, August 2008. (DOC )
Why don't the approaches we use to learn from events, most of which go back for decades and have been incrementally improved over time, work well in today's world? Maybe the answer can be found by reexamining the underlying assumptions and paradigms in safety and identifying any potential disconnects with the world as it exists today. While abstractions and simplications are useful in dealing with complex systems and problems, those that are counter to reality can hinder us from making forward progress. Most of the new research in this field never questions these assumptions and paradigms. It is important to devote some effort to examining our foundations, which is what I try to do in this paper. There are too many beliefs in accident analysis---starting with the assumption that analyzing events and learning from them is adequate---that are accepted without question.


A Safety-Driven, Model-Based System Engineering Methodology, Part I by Margaret Stringfellow Herring, Brandon D. Owens, Nancy Leveson, Michel Ingham, and Kathryn Ann Weiss. MIT Technical Report, December 2007. (PDF )
The final report for a JPL grant to demonstrate a safety-driven, model-based system engineering methodology on a JPL spacecraft. In this methodolgy, safety is folded into and drives the design process rather than being conducted as a separate activity. The methodology integrates MIT's STAMP accident model and the hazard analysis method based on it (called STPA), intent specifications(a structured system engineering specification framework and model-based specification language), and JPL's State Analysis (a system modeling approach).

A Safety-Driven, Model-Based System Engineering Methodology, Part II: Application of the Methodology to an Outer Planet Exploration Mission by Brandon D. Owens, Margaret Stringfellow Herring, Nancy Leveson, Michel Ingham, and Kathryn Ann Weiss. MIT Technical Report, December 2007. (Word )
A sample intent specification created for a Outer Planets Explorer spacecraft as part of a safety-driven, model-based system engineering demonstration for JPL.

Application of a Safety-Driven Design Methodology to An Outer Planet Exploration Mission by Brandon D. Owens, Margaret Stringfellow Herring, Nicholas Dulac, Nancy Leveson, Michel Ingham, and Kathryn Ann Weiss. IEEE Aerospace Conference, Big Sky, Montana, March 2008. (PDF )
A conference paper on the two JPL reports above if you don't want all the details or to see the examples but just want an overall description.

A Comparative Look at MBU Hazard Analysis Techniques by Brandon Owens and Nancy Leveson. 2006 MAPLD (Military and Aerospace Programmable Logic Device) International Conference, Washington, D.C., September 2006. (PDF )
The flux of radiation particles encountered by a spacecraft is a phenomenon that can largely be understood statistically. However, the same cannot be said for the interactions of these particles with the spacecraft as they are far more challenging to grasp and guard against. The ultimate impact of a radiation particle's interaction with a spacecraft depends on factors that often extend beyond the purview of any subject matter expert and typically cannot be represented quantitatively in system-level trade studies without the acceptance of numerous assumptions. In this paper, may of the assumptions associated with the probabilistic assessment of the system-level effects of a specific type of radiation-indusced hazard, a Multiple Bit Upset (MBU) are explored in the light of MBU events during the Gravity Probe B, Cassini, and X-ray Timing Explorer missions. These events highlight key problems in using probabilistic, quantitative analysis techniques for hazards in highly complex and unique systems such as spacecraft. As a result, a case is made for the use of system-level, qualitative techniques for both the identification of potential system-level hazards and the justification of responses to them in the system design.

Safety in Integrated Systems Health Engineering and Management by Nancy Leveson. NASA Ames Integrated System Health Engineering and Management Forum (ISHEM), Napa, November 2005. (DOC )
This paper describes the state of the art in system safety engineering and management along with new models of accident causation, based on systems theory, that may allow us to greatly expand the power of the techniques and tools we use. The new models consider hardware, software, humans, management decision-making, and organizational design as an integrated whole. New hazard analysis techniques based on these expanded models of causation provide a means for obtaining the information necessary to design safety into the system and to determine which are the most critical parameters to monitor during operations and how to respond to them. The paper first describes and contrasts the current system safety and reliability engineering approaches to safety and the traditional methods used in both these fields. It then outlines the new system-theoretic approach being developed in Europe and the U.S. and the application of the new approach to aerospace systems, including a recent risk analysis and health assessment of the NASA manned space program management structure and safety culture that used the new approach.

A New Accident Model for Engineering Safer Systems by Nancy Leveson. Safety Science, Vol. 42, No. 4, April 2004. (PDF )
A new model of accidents is proposed based on systems theory. Systems are viewed as interrelated components that are kept in a state of dynamic equilibrium by feedback loops of information and control. Accidents result from inadequate control or enforcement of safety-related constraints on the system. Instead of defining safety management in terms of preventing component failure events, it is defined as a continuous control task to impose the constraints necessary to limit system behavior to safe changes and adaptations. Accidents can be understood, using this model, in terms of why the controls that were in place did not prevent or detect maladaptive changes, that is, by identifying the safety constraints that were violated and determining why the controls were inadequate in enforcing them. This model provides a theoretical foundation for the introduction of unique new types of accident analysis, hazard analysis, design for safety, risk assessment techniques, and approaches to designing performance monitoring and safety metrics.

Safety and Risk Driven Design in Complex Systems of Systems by Nancy Leveson and Nicolas Dulac. Presented at the 1st NASA/AIAA Space Exploration Conference, Orlando, February 2005. (DOC )
This paper describes STAMP briefly and shows (1) how it can be applied to accident/incident (root cause) analysis, using a Titan/Milstar loss and (2) describes a new hazard analysis technique called STPA based on STAMP, using an industrial robot example.

Applying STAMP in Accident Analysis by Nancy Leveson, Mirna Daouk, Nicolas Dulac, and Karen Marais, Workshop on Investigation and Reporting of Incidents and Accidents (IRIA), September 2003. (PDF )
This paper shows how STAMP can be applied to accident analysis using three different models of the accident process and proposes a notation for describing this process. The models are illustrated using a case study of a water contamination accident in Walkerton, Canada.

The Analysis of a Friendly Fire Accident Using a Systems Model of Accidents. by Nancy Leveson. International Conference of the System Safety Society, 2002. (PDF )
An example of my new accident model applied to a friendly fire accident in the Iraqi No-Fly-Aone in 1994.

The Role of Software in Spacecraft Accidents by Nancy Leveson. This paper appeared in the AIAA Journal of Spacecraft and Rockets, Vol. 41, No. 4, July 2004. (PDF )
The first and most important step in solving any problem is understanding the problem well enough to create effective solutions. To this end, several software-related spacecraft accidents were studied to determine common systemic factors. Although the details in each accident were different, very similar factors related to flaws in the safety culture, the management and organization, and technical deficiencies were identified. These factors include complacency and discounting of software risk, diffusion of responsibility and authority, limited communication channels and poor information flow, inadequate system and software engineering (poor or missing specifications, unnecessary complexity and software functionality, software reuse without appropriate safety anaysis, violation of basic safety engineering practices in the digital components), inadequate review activities, ineffective system safety engineering, flawed test and simulation environments, and inadequate human factors engineering, Each of these factors is discussed along with some recommendations on how to eliminate them in future projects.

Evaluating Accident Models using Recent Aerospace Accidents (Part 1: Event-Based Models) by Nancy Leveson (PDF )
A report written for NASA Ames and the NASA Software IV&V Facility evaluating common event-based accident models and identifying underlying systemic factors in 8 aerospace accidents. Warning: the report is 140 pages so you might want to look at it before printing it. There is an executive summary that summarizes the overall contents, and Chapter 4 summarizes what was learned about the accident models and also the common factors identified in the accidents. The paper listed immediately above summarizes the factors found in the spacecraft accidents.

An Analysis of Causation in Aerospace Accidents (doc) . by Kathryn Weiss, Nancy Leveson, Kristina Lundqvist, Nida Farid, and Margaret Stringfellow. Presented at Space 2001, Albuquerque, New Mexico, August 2001. ( DOC )
This paper describes the causal factors in the mission interruption of the SOHO (SOlar Heliospheric Observatory) spacecraft using the hierarchical model introduced in the NASA report listed above. The factors in this accident are similar to common factors found in other recent software-related aerospace losses.

Back to the top


ORGANIZATIONAL and CULTURAL ISSUES IN SAFETY


Demonstration of a New Dynamic Approach to Risk Analysis for NASA's Constellation Program by Nicolas Dulac, Brandon Owens, Nancy Leveson, Betty Barrett, John Carroll, Joel Cutcher-Gershenfled, Stephen Friedenthal, Joseph Laracy, and Joseph Sussman. Final Report of the NASA Exploration Systems Mission Directorate Associate Administrator. , March 2007. (PDF )
Effective risk management is the development of complex aerospace systems requires the balancing of multiple risk components including safety, cost, performance, and schedule. Safety considerations are especially critical during system development because it is very difficult to design or "inspect" safety into a system during operation. This report describes the results of an MIT Complex Systems Research Laboratory (CSRL) study conducted at the request of the ANSA Exploration Systems Mission Directorate (ESMD) to evaluate the usefulness of a new model of accident causation (STAMP) and STAMP-based system dynamics models in the development of new spacecraft systems. In addition to fulfilling the specific needs of ESMD, the study is part of an on-going effort by the MIT CSRL to develop and refine techniques for modeling and treating organizational safety culture as a dynamic control problem.

Technical and Managerial Factors in the NASA Challenger and Columbia Losses: Looking Forward to the Future by Nancy Leveson, in Handelsman and Kleinman (editors), Controveries in Science and Technology (to appear) , University of Wisconsin Press, 2007. (DOC )
This essay examines the technical and organizational factors leading to the Challenger and Columbia accidents and what we can learn from them. While accidents are often described in terms of a chain of directly related events leading to a loss, examining this event chain does not explain why the events themselves occurred. In fact, accidents are better conceived as complex processes involving indirect and non-linear interactions among people, societal and organizational structures, engineering activities, and physical system components. They are rarely the result of a chance occurrence of random events, but usually result from the migration of a system (organization) toward a state of high risk where almost any deviation will result in a loss. Understanding enough about the Challenger and Columbia accidents to prevent future ones, therefore, requires not only determining what was wrong at the time of the losses, but also why the high standards of the Apollo program deteriorated over time and allowed the conditions cited by the Rogers Commission as the root causes of the Challenger loss and why the fixes instituted after Challenger became ineffective over time, i.e., why the manned space program has a tendency to migrate to states of such high risk and poor decision-making processes that an accident becomes almost inevitable.

What System Safety Engineering can Learn from the Columbia Accident by Nancy Leveson and Joel Cutcher-Gershenfeld, Int. Conference of the System Safety Society, Providence Rhode Island, August 2004. (PDF )
Many of the dysfunctionalities in the system safety program at NASA contributing to the Columbia accident can be seen in other groups and industries. This paper summarizes some of the lessons we can all learn from this tragedy. While there were many factors involved in the loss of the Columbia Space Shuttle, this paper concentrates on the role of system safety engineering and what can be learned about effective (and ineffective) safety efforts.

Risk Analysis of the NASA Independent Technical Authority by Nancy Leveson and Nicholas Dulac with contributionns by Joel Cutcher-Gershenfeld, John Carroll, Betty Barrett and Stephen Friedenthal. (DOC )
The application of STAMP and STPA to an organizational risk analysis.

Modeling, Analyzing, and Engineering NASA's Safety Culture by Nancy Leveson (with Nicolas Dulac, David Zipkin, Joel Cutcher-Gershenfeld, Betty Barrett, and John Carroll), Final Report of a Phase 1 NASA/USRA research grant (PDF )
This is the final report on Phase 1 (5 months) of a research grant STAMP and system dynamics models. We used the NASA manned space program as our testbed.

Moving Beyond Normal Accidents and High Reliability Organizations: An Alternative Approach to Safety in Complex Systems by Nancy Leveson, Karen Marais, Nicolas Dulac, and John Carroll, (DOC ) to appear in Organizational Studies (Sage Publishers).
Organizational factors play a role in all accidents and are a critical part of understanding and preventing them. Two prominent sociological schools of thought have addressed the organizational aspects of safety: normal Accident Theory and High Reliability Organizations (HRO). In this paper, we argue that the conclusions of HRO reseachers are limited in their applicability and usefulness to complex, high-risk systems and following some of the recommendations could actually contribute to accidents. Normal Accident Theory, on the other hand, does recognize the difficulties involved but is unnecessarily pessimistic about the possibility of effectively dealing with them. An alternative systems approach to safety is described.

Effectively Addressing NASA's Organizational and Safety Culture: Insights from System Safety and Engineering Systems by Nancy Leveson, Joel Cutcher-Gershenfeld, Betty Barrett, Alexander Brown, John Carroll, Nicolas Dulac, Lydia Fraile, Karen Marais. MIT ESD Symposium, March 2004 (Word)
This paper illustrates some aspects of the changes required for a realignment of social systems as recommended by the Columbia Accident Investigation Board (CAIB). The paper focuses on three aspects of social systems at NASA: organizational structure; organizational subsystems and social interaction processes (communication systems, leadership, and information systems); and capability and motivation. Issues of organizational vision, strategy, and culture are woven throughout the analysis.

Archetypes for Organizational Safety by Karen Marais and Nancy G. Leveson. Proceedings of the Workshop on Investigation and Reporting of Incidents and Accidents, September 2003. (pdf),
We propose a framework using system dynamics to model the dynamic behavior of organizations in accident anaysis. Most current accident analysis techniques are event-based and do not adequately capture the dynamic complexity and non-linear interactions that characterize accidents in complex systems. In this paper, we propose a set of system safety archetypes that often lead to accidents. As accident analysis and investigation tools, the archetypes can be used to develop dynamic models that describe the systemic and organizational factors contributing to the accident. The archetypes help to clarify why safety-related decsion do not always result in the desired behavior, and how independent decisions in different parts of the organizational can combine to impact safety.

Back to the top


HUMAN-MACHINE INTERACTION


Analyzing Software Specifications for Mode Confusion Potential, by Nancy G. Leveson, L. Denise Pinnel, Sean David Sandys, Shuichi Koga, Jon Damon Reese. Presented at the Workshop on Human Error and System Development, Glascow, March 1997. (Postscript) (PDF ).

Increased automation in complex systems has led to changes in the human controller's role and to new types of technology-induced human error. Attempts to mitigate these errors have primarily involved giving more authority to the automation, enhancing operator training, or changing the interface. While these responses may be reasonable under many circumstances, an alternative is to redesign the automation in ways that do not reduce necessary or desirable functionality or to change functionality where the tradeoffs are judged to be acceptable. This paper describes an approach to detecting error-prone automation features early in the development process while significant changes can still be made to the conceptual design of the system. The software requirements are modeled using a hierarchical state machine language and then analyzed (manually or with automated assistance) to identify violations of a set of design constraints associated with mode-confusion errors. The approach is illustrated with a model of the software controlling a NASA robot.

Designing Automation to Reduce Operator Errors by Nancy G. Leveson and Everett Palmer (NASA Ames Research Center). In the Proceedings of Systems, Man, and Cybernetics Conference, Oct. 1997 (PostScript) (PDF ).

Advanced automation has been accompanied, particularly in aircraft, with a proliferation of modes, where modes define mutually exclusive sets of system behavior. The new mode-rich systems provide flexibility and enhanced capabilities, but they also increase the need for and difficulty of maintaining mode awareness. A previous paper described some categories of potential design flaws that can lead to mode confusion errors and described an approach to finding these flaws by first modeling blackbox software behavior and then using analysis methods and tools to assist in searching the models for predictable error forms, i.e., for automation features that can contribute to operator mistakes. This paper shows an example of the approach for one particular feature, i.e., indirect mode changes, using an example from the MD-88 control logic. The particular indirect mode transition problem used in the example, called a ``kill-the-capture bust'' has been noted in many ASRS incident reports.

Describing and Probing Complex System Behavior: A Graphical Approach by Edward Bachelder and Nancy Leveson. In the proceedings of the Aviation Safety Conference, Seattle, Sept. 2001. ( Word (.doc) )

Hands-on training and operation is generally considered the primary means that a user of a complex system will use to build a mental model of how that system works. However, accidents abound where a major contributing factor was user disorientation/misorientation with respect to the automation behavior, even when the operator was a seasoned user. This paper presents a compact graphical method that can be used to describe system operation, where the system may be composed of interacting automation and/or human entities. The fundamental goal of the model is to capture and present critical interactive aspects of a complex system in an integrated, intuitive fashion. This graphical approach is applied to an actual military helicopter system, using the onboard hydraulic leak detection/isolation system as a testbed. The helicopter Flight Manual is used to construct the system model, whose components include: logical structure (waiting and checking states, transitional events, and conditions), human/automation cross communication (messages, information sources), and automation action and associated action limits. Using this model, examples of the following types of mode confusion are identified in the military helicopter case study: (1) unintended side effects, (2) indirect mode transitions, (3) inconsistent behavior, (4) ambiguous interfaces, and (5) lack of appropriate feedback. The model also facilitates analysis and revision of emergency procedures, which is demonstrated using an actual set of procedures.

Modeling Controller Tasks for Safety Analysis, by Molly Brown and Nancy G. Leveson. Presented at the Workshop on Human Error and System Development, Seattle, April 1998. (Postscript) (PDF 3.0).

As control systems become more complex, the use of automated control has increased. At the same time, the role of the human operator has changed from primary system controller to supervisor or monitor. Safe design of the human--computer interaction becomes more difficult.

In this paper, we present a visual task modeling language that can be used by system designers to model human--computer interactions. The visual models can be translated into SpecTRM-RL, a blackbox specifiction language for modeling the automated portion of the control system. The SpecTRM-RL suite of analysis tools allow the designer to perform formal and informal safety analyses on the task model in isolation or integrated with the rest of the modeled system.

Identifying Mode Confusion Potential in Software Design by Mario Rodriguez, Marc Zimmerman, Masafumi Katahira, Maxime de Villepin, Benjamin Ingram, and Nancy Leveson. Digital Aviation Systems Conference, October 2000. (Postscript), (PDF).

While automation has eliminated many types of operator error, it has also created new types of technology-induced human error. This paper shows how a formal model of an FMS similar to an MD-11 can be used to evaluate human factors aspects of the automation design.

An Approach to Human-Centered Design , by Mirna Daouk and Nancy G. Leveson. Presented at the Workshop on Human Error and System Development, Linkoping, Sweden, June 2001. (.doc)

Human-automation interactions are changing in nature, and new sources of errors and hazards are being introduced. The need for reducing human errors without sacrificing the benefits of computers has led to the idea of human-centered system design; little work, however, has been done as to how one would achieve this goal. This paper provides a methdology for human-centered design of systems including both humans and automation. The proposed methodology integrates task allocation, task analysis, simulations, human factors experiments, formal models, and several safety, usability, and performance analyses into Intent Specifications. An air traffic control conflict detection tool, MTCD, is used to illustrate the methodology.

Back to the top


SOFTWARE FAULT TOLERANCE


An Experimental Evaluation of the Assumption of Independence in Multi-Version Programming, by John Knight and Nancy Leveson, IEEE Transactions on Software Engineering, Vol. SE-12, No. 1, January 1986, pp. 96-109 ( PDF (sorry, but this paper is so old, I have only a copy that was converted from an old typesetting language).

Our original paper that got us in such hot water for the next ten years until everyone who tried to show we were wrong, got the same results and grudgingly admitted we were right. Unfortunately, the same idea keeps popping up again like a bad penny among people who do not bother to learn anything about what has been done in the past.

A Reply to the Criticisms of the Knight and Leveson Experiment by John Knight and Nancy Leveson ACM Software Engineering Notes, January 1990 ( PDF ).

After years of ridiculous and mostly untrue statements about our original multi-version programming experiment (always in forums where we were unable to reply), we finally had had enough and decided to respond publicly. If nothing else, writing this paper had a cathartic effect.

Analysis of Faults in an N-Version Software Experiment by Susan Brilliant, John Knight, and Nancy Leveson. IEEE Trans. on Software Engineering, Vol. SE-16, No. 2, February 1990 ( PDF

More details about the actual errors found in the multiple version programs and an explanation of why they caused correlated failures.

The Consistent Comparison Problem in N-Version Programming by Susan Brilliant, John Knight, and Nancy Leveson. IEEE Trans. on Software Engineering, Vol. SE-15, No. 11, November 1989 ( PDF ).

During the multi-version programming experimentation, we identified a problem we called the Consistent Comparison Problem. In this paper we showed that when versions make comparisons involving the results of finite-precision calculations, it is impossible to guarantee the consistency of their results. Correct versions may therefore arrive at completely different outputs for an application that does not apparently have multiple correct solutions. If this problem is not dealt with explicitly, an N-version system may be unable to reach a consensus even when none of its component versions fail. We discuss potential solutions, none of which is entirely satisfactory.

The Use of Self Checks and Voting in Software Error Detections: An Empirical Study by Nancy Leveson, Stephen Cha, John Knight, and Timothy Shimeall. IEEE Trans. on Software Engineering, Vol. SE-16, No. 4, April, 1990 ( PDF ).

While we were on a roll, we decided to compare the use of self-checks (assertions) and voting (n-versions).

An Empirical Comparison of Software Fault Tolerance and Fault Elimination by Timothy Shimeall and Nancy Leveson IEEE Trans. on Software Engineering, Vol. SE-17, No. 2, February 1991, pp. 173-183 ( PDF ).

Before bowing out gracefully (and bloodied) from the software fault-tolerance community and taking a break from running experiments, I decided to try one more. This paper compares the effectiveness of two software fault tolerance techniques (embedded self-checks and multi-version programming) with some common fault elimination techniques.

Back to the top


MISCELLANEOUS


High-Pressure Steam Engines and Computer Software by Nancy Leveson. Presented as a keynote address at the International Conference Software Engineering in Melbourne Australia, 1992 and published in IEEE Computer, October 1994. (PostScript) (PDF ).

A comparison between the history of steam engine technology and software technology and what we can learn from the mistakes made with steam engines.

An Empirical Evaluation of the MC/DC Coverage Criterion on the HETE-2 Satellite Software , by Arnaud Dupuy (Alcatel) and Nancy Leveson (MIT), Digital Aviations Systems Conference (DASC), October, 2000. (Postscript), (PDF).

In order to be certified by the FAA, airborne software must comply with the DO-178B standard. For the unit testing of safety-critical software, this standard requires the testing process to meet a source code coverage criterion called Modified Condition/Decision Coverage. This part of the standard is controversial in the aviation community, partially because of perceived high cost and low effectiveness. Arguments have been made that the criterion is unrelated to the safety of the software and does not find errors that are not detected by functional testing. In this paper, we present the results of an empirical study that compared functional testing and functional testing augmented with test cases to satisfy MC/DC coverage. The evaluation was performed during the testing of the attitude control software for the HETE-2 (High Energy Transient Explorer) scientific satellite (since that time, the software has been modified).

Baker Panel Report on Texas City Accident

Back to the top