|
REQUIREMENTS SPECIFICATION AND ANALYSIS
Intent Specifications: An Approach to Building Human-Centered
Specifications by Nancy Leveson, IEEE Trans. on Software
Engineering, January 2000.
(PostScript)
(PDF )
This paper examines and proposes an approach to writing software
specifications, based on research in systems theory, cognitive
psychology, and human-machine interaction. The goal is to provide
specifications that support human problem solving and the tasks
that humans must perform in software development and evolution. A
type of specification, called Intent Specifications, is
constructed upon this underlying foundation.
Making Embedded Software Reuse Practical and Safe
Nancy Leveson and Kathryn Anne Weiss. Proceedings of
Foundations of Software Engineering, November 2004.
(pdf),
Reuse of application software has been limited and sometimes
has led to accidents. This paper suggests some requirements for
successful and safe application software reuse and demonstrates
them using a case study on a real spacecraft.
Advanced System and Safety Engineering Environments
by Nancy Leveson. This is an annotated powerpoint
presentation on SpecTRM and SpecTRM-RL.
( Annotated Powerpoint Slides),
The notes plus the slides describe the SpecTRM tools and
environment for building complex safety-critical systems.
Reusable Specification Components for Model-Driven
Development by Kathryn Anne Weiss, Elwin C. Ong, and
Nancy G. Leveson. Proceedings of the International Conference
on System Engineering (INCOSE '03), July 2003.
(pdf),
Modern, complex control systems for a specific application
domain often display common system design architectures with similar
subsystem functionality and interactions, making them suitable for
representation by a reusable specification architecture. For example,
every spacecraft requires attitude determination and control, power,
thermal, communications, and propulsion subsystems. The similarities
between these subsystems in most spacecraft can be exploited to create
a model-driven system development environment in which generic reusable
specifications and models can be tailored for the specific spacecraft
design, executed and validated in a simulation environment, and then
either manually or automatically transformed into software or hardware.
Modifications to software and hardware during operations can be
similarly made in the same controlled way, that is, starting from a
model, validating the change, and finally implementing the change.
The approach is illustrated using a spacecraft attitude determination
and control subsystem.
Fault Protection in a Component-Based Spacecraft Architecture
by Elwin C. Ong and Nancy G. Leveson. Proceedings of the
International Conference on Space Mission Challenges for
Information Technology, Pasadena, July 2003.
(doc),
As spacecraft become more complex and autonomous, the need for
reliable fault protection will become more prevalent. When coupled
with the additional requirement of limiting cost, the task of
implementing fault protection on spacecraft becomes extremely
challenging. This paper describes how domain knowledge about
spacecraft fault protection can be captured and stored in a reusable,
component-based spacecraft architecture. The spacecraft-level fault
protection strategy can then be created by composing generic component
specifications, each with component-level fault protection included.
The resulting design can be validated by formal analysis and
simulation before any costly implementation begins. As spacecraft
technology improves, new generic fault protection logic may be added,
allowing active improvements to be made to the foundation.
Completeness in Formal Specification Language Design for
Process Control Systems by Nancy G. Leveson. Proceeedings
of Formal Methods in Software Practice Conference, August 2000.
(Postscript),
(PDF).
This paper shows how the information required by the completeness
criteria we defined for blackbox requirements specification can
be embedded in the syntax of a formal specification language.
This paper is a companion paper for the one that follows, but this
one was written later and contains the current definition of
the SpecTRM-RL requirements specification language.
On the Use of Visualization in Formal Requirements
Specification by Nicolas Dulac, Thomas Viguier, Nancy
Leveson, and Margaret-Anne Storey. International
Conference on Requirements Engineering, Essen, September 2002.
(pdf),
A limiting factor in the industrial acceptance of formal
specifications is their readability, particularly for large,
complex engineering systems. We hypothesize that multiple
visualizations generated from a common model will
improve the requirements creation, reviewing, and understanding
process. Visual representations, when effective, provide cognitive
support by highlighting the most relevant interactions and aspects
of a specification for a particular use. In this paper, we propose
a taxonomy and some preliminary principles for designing visual
representations of formal specifications. The taxonomy and
principles are illustrated by sample visualizations we created
while trying to understand a formal specification of the MD-11
Flight Management System.
Investigating the Readability of State-Based Formal
Requirements Specification Languages by Marc Zimmerman, Kristina
Lundqvist, Nancy Leveson. International Conference on Software
Engineering, Orlando, May 2002.
(pdf),
The readability of formal requirements specification languages
is hypothesized as a limiting factor in the acceptance of formal
methods by the industrial community. An empirical study was
conducted to determine how various factors of state-based
requirements specification language design affect readability
using aerospace applications. Six factors were tested in all,
including the representation of the overall state machine
structure, the expression of triggering conditions, the use of
macros, the use of internal broadcast events, the use of
hierarchies, and transition perspective (going-to or coming-from).
Subjects included computer scientists as well as aerospace
engineers in an effort to determine whether background affects
notational preferences. Becuase so little previous experimentation
on this topic exists on which to build hypotheses, the study was
designed as a preliminary exploration of what factors are most
important with respect to readability. It can serve as a starting
point for more thorought and carefully controlled experimentation
in specification language readability.
Reducing the Effects of Requirements Changes through System
Design by Israel Navarro, Nancy Leveson, and Kristina
Lundqvist, MIT SERL Technical Report, 2001.
( PDF)
The continuous stream of requirements changes that often takes place
during software development can create major problems in the development
process. This paper defines a concept we call semantic coupling
that along with features of intent specifications can be used during
system design to reduce the impact of changing requirements. The
practicality of using the approach on real software is demonstrated using
the intent specification of the control software for a NASA robot designed
to service the heat resistant tiles on the Space Shuttle.
Making Formal Methods Practical by Marc Zimmerman, Mario
Rodriguez, Benjamin Ingram, Masafumi Katahira, Maxime de Villepin,
Nancy Leveson. Digital Aviation Systems Conference, Oct. 2000.
(Postscript),
(PDF).
Despite its potential, formal methods have had difficult
gaining acceptance in the industrial sector. Some doubts are
based on
supposed impracticality or long learning curve. Contributing to
this scepticism is the fact that some types of formal methods have
not yet been proven to handle systems of realistic complexity.
To learn more about how to design formal specification languages
that can be used for complex systems and require minimal training,
we developed a formal specification of an English language
specification of a vertical flight control system similar to that
found in the MD-11. This paper describes the lessons learned
from this experience. A companion paper (below) describes how
the model can be used in human-computer interaction and analysis
and pilot task analysis.
Designing Specification Languages for Process Control Systems:
Lessons Learned and Steps to the Future, by
Nancy G. Leveson, Mats Heimdahl, and Jon Damon Reese.
Presented at SIGSOFT FOSE '99 (Foundations of Software
Engineering), Toulouse, September 1999.
(Postscript),
(PDF).
Previously we defined a blackbox formal system modeling
language called RSML (Requirements State Machine Language).
The language was developed over several years while specifying
the system requirements for a collision avoidance system for
commercial passenger aircraft. During the language development,
we received continual feedback and evaluation by FAA employees
and industry representatives, which helped us to produce a
specification language that is easily learned and used by
application experts. Since the completion of the RSML project,
we have continued our research on specification languages. This
research is part of a larger effort to investigate the more
general problem of providing tools to assist in developing
embedded systems. Our latest experimental toolset is called
SpecTRM (Specification Tools and Requirements Methodology), and
the formal specification language is SpecTRM-RL (SpecTRM
Requirements Language). This paper describes what we have learned
from our use of RSML and how those lessons were applied to the
design of SpecTMR-RL. We discuss our goals for SpecTRM-RL and
the design features that support each of these goals.
Completeness and Consistency in Hierarchical State-Based
Requirements by Mats P.E. Heimdahl and Nancy Leveson.
Published in IEEE Transactions on Software Engineering
(May 1996). (PostScript)
(PDF )
This paper describes automated methods for analyzing
RSML specifications for completeness and consistency. Results
are presented from the application of these methods to TCAS
II.
Requirements Specification for Process-Control Systems
by Nancy G. Leveson, Mats P.E. Heimdahl, Holly Hildreth, and
Jon D. Reese. Published in IEEE Transactions on Software
Engineering (Sept. 1994)
(PostScript).
(PDF )
Introduces RSML and the RSML requirements specification of
TCAS II, an aircraft collision-avoidance
system that motivated RSML's development.
An Intent Specification Model for a Robotic Software Control
System , by Israel Navarro, Kristina Lundqvist, and Nancy
Leveson, DASC '01.
(PDF).
This paper shows a sample intent specification for an industrial
robot designed to service the heat resistant tiles on the Space
Shuttle.
Software Deviation Analysis: A ``Safeware'' Technique
by Jon Damon Reese and Nancy G. Leveson. AIChe 31st Annual
Loss Prevention Symposium, Houston, TX, March 1997.
(PostScript)
(PDF).
This paper describes
one of the Safeware hazard analysis techniques, Software Deviation
Analysis, that incorporates the beneficial features of HAZOP
(such as guidewords, deviations, exploratory analysis, and a systems
engineering approach) into an automated procedure that is capable
of handling the complexity and logical nature of computer software.
Software Deviation Analysis
by Jon Damon Reese and Nancy G. Leveson,
International Conference on Software Engineering, Boston, 1997.
(PDF).
A longer and more technically detailed paper on SDA
than the one above.
Integrated Safety Analysis of Requirements
Specifications, by Francesmary Modugno, Nancy G. Leveson,
Jon D. Reese, Kurt Partridge, and Sean D. Sandys.
Requirements Engineering '97.
(Postscript).
(PDF).
This paper describes the application of manual and automated
safety analysis techniques to a prototype of an aircraft
guidance system.
Software Requirements Analysis for Real-Time Process
Control Matt Jaffe, Nancy Leveson, Mats Heimdahl,
and Bonnie Melhart.
IEEE Trans. on Software Engineering, March 1991.
(PDF)
Back to the top
SOFTWARE SYSTEM SAFETY
- Safeware: System Safety and Computers by Nancy
Leveson. Published by Addison Wesley (1995). (HTML Table of Contents)
This book examines past accidents and what is currently known
about building safe electromechanical systems to see what
lessons can be applied to new computer-controlled systems.
Most accidents are not the result of unknown scientific
principles but rather of a failure to apply well-known,
standard engineering practices. In addition, accidents will
not be prevented by technological fixes alone, but will
require control of all aspects of the development and
operation of the system. A methodology for building
safety-critical systems is outlined.
- Software Challenges in Achieving Space Safety
by Nancy Leveson. Journal of the British
Interplanetary Society, Vol. 62, 2009.
(DOC)
Techniques developed for hardware reliability and
safety do not work on software-intensive systems; software does
not satisfy the assumptions underlying these techniques. The
new problems and why the current approaches are not effective
for complex, software-intensive systems are first described.
Then a new approach to hazard analysis and safety-driven design
is presented. Rather than being based on reliability theory, as
most current safety engineering techniques are, the new approach
builds on system and control theory.
- A Systems-Theoretic Approach to Safety in Software-Intensive
Systems by Nancy Leveson. IEEE Trans. on Dependable and
Secure Computing, January 2005. (PDF)
Traditional accident models were devised to explain losses
caused by failures of physical devices in relatively simple
systems. They are less useful for explaining accidents in
software-intensive systems and for non-technical aspects of
safety such as organizational culture and human decision-making.
This paper describes how systems theory can be used to form
new accident models that better explain system accidents
(accidents arising from the interactions among components rather
than individual component failure), software-related accidents,
and the role of human decision-making. Such models consider the
social and technical aspects of systems as one integrated process
and may be useful for other emergent system properties such as
security. The loss of a Milstar satellite being launched by
a Titan/Centaul launch vehicle is used as an illustration of
the approach.
- A New Approach to Hazard Analysis for Complex Systems
by Nancy Leveson. Int. Conference of the System Safety
Society, Ottawa, August 2003.
(DOC)
This paper describes a new hazard analysis approach, called
STPA (STamP-based Analysis) based on a new model of accidents
called STAMP. The paper briefly describes STPA and illustrates
it with an aircraft collision avoidance system.
- Model-Based Analysis of Socio-Technical Risk
by Nancy Leveson. Technical Report, Engineering Systems
Division, Massachusetts Institute of Technology, June 2002
(DOC)
In this report, a new type of hazard analysis, based on the STAMP
model of accident causation, is described. STPA (STAMP-based
Analysis) is illustrated by applying it to TCAS II, a complex
aircraft collision avoidance system, and to a public water
safety system in Canada. The TCAS II results are compared with
a high-quality fault tree created by MITRE for the FAA. The STPA
analysis was found to be more comprehensive and complete than the
fault tree analysis. The integration of STPA, SpecTRM-RL system
engineering tools, and system dynamics modeling creates the
potential for a simulation and analysis environment to support
and guide the initial technical and operational system design
as well as organizational and management policy design. The
results of STPA analysis can also be used to support
organizational learning and performance monitoring throughout
the system's life cycle so that degradation of safety and
increases in risk can be detected before a catastrophe results.
- An Approach to Design for Safety in Complex Systems
by Nicolas Dulac and Nancy Leveson. Int. Conference
on System Engineering (INCOSE '04), Toulouse, June 2004.
(PDF)
Most traditional hazard analysis techniques rely on discrete
failure events that do not adequately handle software-intensive
systems or system accidents resulting from dysfunctional
interactions between system components. This paper demonstrates
a methodology where a hazard analysis based on the STAMP accident
model is performed together with the system development process
to design for safety in a complex system. Unlike traditional
hazard analyses, this approach considers system accidents,
organizational factors, and the dynamics of complex systems.
The analysis is refined as the system design progresses and
produces safety-related information to help system engineers
in making design decisions for complex safety-critical systems.
The preliminary design of a Space Shuttle Thermal Tile
Processing System is used to demonstrate the approach.
- Incorporating Safety Risk in Early System Architecture Trade
Studies by Nicolas Dulac and Nancy Leveson. AIAA Journal of
Spacecraft and Rockets, Vol. 46, No. 2, March-April 2009.
(DOC)
Evaluating risk early in concept design is difficult due to the
lack of information available at that early stage. This paper
describes the approach we developed to perform a preliminary risk
evaluation to use in the trade studies by MIT and Draper Laboratory
for concept evaluation and refinement of the new NASA Space
Exploration Initiative.
- Demonstration of a Safety Analysis on a Complex System
by N. Leveson, L. Alfaro, C. Alvarado, M. Brown, E.B. Hunt,
M. Jaffe, S. Joslyn, D. Pinnel, J. Reese, J. Samarziya,
S. Sandys, A. Shaw, Z. Zabinsky. Presented at the Software
Engineering Laboratory Workshop, NASA Goddard, December 1997.
This paper describes a demonstration of the Safeware
methodology on the Center-TRACON Automation System (CTAS) portion
of the air traffic control system and procedures currently
employed at the Dallas/Fort Worth TRACON
(Postscript)
(PDF ).
The complete report can be found by clicking
(Postscript or
(PDF ).
- Use of SpecTRM in Space Applications
by Masafumi Katahira (NASDA) and Nancy Leveson. This
paper was presented at the 19th International System Safety
Conference, Huntsville, Alabama, September 2001.
( .doc (Word) ).
This paper provides an introduction to the application of
SpecTRM (Specification Tools and Requirements Methodology) to
safety-critical software in spacecraft controllers. The
SpecTRM toolset includes modeling of the behavior of
safety-critical software and its operation, while generating and
maintaining significant safety information. We studied the
applicability and effectiveness for safety-critical controllers
on the International Space Station. Errors in the original
requirements specifications of the Japanese Experimental Module
(JEM) found during the modeling process are described.
- A Safety and Human-Centered Approach to Developing New
Air Traffic Management Tools by Nancy Leveson, Maxime de
Villepin, Mirna Daouk, John Bellingham, Jayakanth Srinivasan,
Natasha Neogi, and Ed Bachelder (MIT) and Nadine Pilon and
Geraldine Flynn (Eurocontrol). This paper will be presented
at ATM 2001, Albuquerque NM, December 2001.
( PDF ).
This paper describes a safety-driven, human-centered process
for designing and integrating new components into an airspace
management system. The general design of a conflict detection
function currently being evaluated by Eurocontrol is being used as
the testbed for the methodology, alghouth the details differ
somewhat. The development and evaluation approach proposed is
based on the principle that critical properties must be designed
into a system from the start. As a result, our methodology
integrates safety analysis, functional decomposition and allocation,
and human factors from the very beginning of the system development
process. It also emphasizes using both formal and informal
modeling to accumulate the information needed to make tradeoff
decisions and ensure that desired system qualities are satisfied
early in the design process when changes are easier and less
costly. The formal modeling language was designed with readability
as a primary criterion and therefore the models can act as an
unambiguous communication medium among the developers and
implementers. The methodology is supported by a new specification
structuring approach, called Intent Specifications, that supports
traceability and documentation of design rationale as the
development process proceeds.
- Integrated Safety Analysis of Requirements
Specifications, by Francesmary Modugno, Nancy G. Leveson,
Jon D. Reese, Kurt Partridge, and Sean D. Sandys.
Requirements Engineering '97.
(Postscript).
(PDF).
This paper describes the application of manual and automated
safety analysis techniques to a prototype of an aircraft
guidance system.
-
System Safety in Computer-Controlled Automotive Systems, by
Nancy G. Leveson, SAE Congress, March, 2000.
(Postscript),
(PDF).
An invited paper that summarizes the state of the art in software
system safety and suggests some approaches possible for the
automotive and other industries.
- Software Deviation Analysis: A ``Safeware'' Technique
by Jon Damon Reese and Nancy G. Leveson. AIChe 31st Annual Loss
Prevention Symposium, Houston, TX, March 1997.
(PostScript)
(PDF).
This paper describes
one of the Safeware hazard analysis techniques, Software Deviation
Analysis, that incorporates the beneficial features of HAZOP
(such as guidewords, deviations, exploratory analysis, and a systems
engineering approach) into an automated procedure that is capable
of handling the complexity and logical nature of computer software.
- The Therac-25 Accidents by Nancy G. Leveson.
(Postscript ) or
(PDF).
This paper is an updated version of the original IEEE
Computer (July 1993) article. It also appears in the appendix
of my book.
- The following papers are not currently available in
electronic form:
Leveson, N.G. and P.R. Harvey. ``Analyzing Software Safety,''
IEEE Transactions on Software Engineering, vol. SE-9,
no. 5, 1983.
Leveson, N.G. and Stolzy, J.L. ``Safety Analysis Using
Petri Nets,'' IEEE Trans. on Software Engineering,
Vol. SE-13, No. 3, March 1987, pp. 386-397.
Leveson, N.G. ``Software Safety in Embedded Computer
Systems,'' Communications of the ACM, February, 1991.
Leveson, N.G., Cha, S.S., Shimeall, T.J. ``Safety
Verification of Ada Programs using Software Fault Trees,''
IEEE Software, July 1991.
- Back to the top
SYSTEM SAFETY AND ACCIDENT MODELS
Modeling and Hazard Analysis using STPA
by Takuto Ishimatsu, Nancy Leveson, John Thomas, Masa Katahira,
Yuko Miyamoto, Haruka Nakao. Presented at the Conference of the
International Association for the Advancement of Space Safety,
Huntsville, Alabama, May 2010.
( DOC )
A joint research project between MIT and JAXA/JAMSS is investigating
the application of a new hazard analysis technique, called STPA, to
the system and software in the HTV. STPA is based on systems theory
rather than reliability theory. It treats safety as a control problem
rather than a failure problem. Traditional hazard analysis focuses
on component failures but software does not fail in this way. Software
most often contributes to accidents by commanding the spacecraft into
an unsafe state (e.g., turning off the descent engines prematurely)
or by not issuing required commands. That makes the standard hazard
analysis techniques of limited usefulness on software-intensive
systems, which describes most spacecraft built today.
This paper describes the experimental application of STPA to the
JAXA HTV (unmanned cargo transfer vehicle to the International
Space Station). Because the HTV was originally developed using fault
tree analysis and following the NASA standards for safety-critical
systems, the results of our experimental application of STPA can
be compared with these more traditional safety engineering
approaches in terms of the problems identified and the resources
required to use it.
Applying Systems Thinking to Analyze and Learn from Events
by Nancy Leveson, presented at NeTWorK 2008: Event Analysis and
Learning from Events, Berlin, August 2008.
(DOC )
Why don't the approaches we use to learn from events, most of which
go back for decades and have been incrementally improved over time,
work well in today's world? Maybe the answer can be found by
reexamining the underlying assumptions and paradigms in safety and
identifying any potential disconnects with the world as it exists
today. While abstractions and simplications are useful in dealing
with complex systems and problems, those that are counter to reality
can hinder us from making forward progress. Most of the new research
in this field never questions these assumptions and paradigms. It is
important to devote some effort to examining our foundations, which
is what I try to do in this paper. There are too many beliefs in
accident analysis---starting with the assumption that analyzing
events and learning from them is adequate---that are accepted
without question.
A Safety-Driven, Model-Based System Engineering
Methodology, Part I by Margaret Stringfellow Herring,
Brandon D. Owens, Nancy Leveson, Michel Ingham, and Kathryn
Ann Weiss. MIT Technical Report, December 2007.
(PDF )
The final report for a JPL grant to demonstrate a safety-driven,
model-based system engineering methodology on a JPL spacecraft.
In this methodolgy, safety is folded into and drives the design
process rather than being conducted as a separate activity. The
methodology integrates MIT's STAMP accident model and the
hazard analysis method based on it (called STPA), intent
specifications(a structured system engineering specification
framework and model-based specification language), and JPL's
State Analysis (a system modeling approach).
A Safety-Driven, Model-Based System Engineering
Methodology, Part II: Application of the Methodology to an
Outer Planet Exploration Mission by Brandon D. Owens,
Margaret Stringfellow Herring, Nancy Leveson, Michel Ingham,
and Kathryn Ann Weiss. MIT Technical Report, December 2007.
(Word )
A sample intent specification created for a Outer Planets
Explorer spacecraft as part of a safety-driven, model-based
system engineering demonstration for JPL.
Application of a Safety-Driven Design Methodology to
An Outer Planet Exploration Mission by Brandon D. Owens,
Margaret Stringfellow Herring, Nicholas Dulac, Nancy Leveson,
Michel Ingham, and Kathryn Ann Weiss. IEEE Aerospace Conference,
Big Sky, Montana, March 2008.
(PDF )
A conference paper on the two JPL reports above if you don't
want all the details or to see the examples but just want
an overall description.
A Comparative Look at MBU Hazard Analysis Techniques
by Brandon Owens and Nancy Leveson. 2006 MAPLD (Military and
Aerospace Programmable Logic Device) International Conference,
Washington, D.C., September 2006.
(PDF )
The flux of radiation particles encountered by a spacecraft is a
phenomenon that can largely be understood statistically. However, the
same cannot be said for the interactions of these particles with the
spacecraft as they are far more challenging to grasp and guard against.
The ultimate impact of a radiation particle's interaction with a spacecraft
depends on factors that often extend beyond the purview of any subject
matter expert and typically cannot be represented quantitatively in
system-level trade studies without the acceptance of numerous assumptions.
In this paper, may of the assumptions associated with the probabilistic
assessment of the system-level effects of a specific type of
radiation-indusced hazard, a Multiple Bit Upset (MBU) are explored in
the light of MBU events during the Gravity Probe B, Cassini, and X-ray
Timing Explorer missions. These events highlight key problems in using
probabilistic, quantitative analysis techniques for hazards in highly
complex and unique systems such as spacecraft. As a result, a case is
made for the use of system-level, qualitative techniques for both the
identification of potential system-level hazards and the justification
of responses to them in the system design.
Safety in Integrated Systems Health Engineering and
Management by Nancy Leveson. NASA Ames Integrated System
Health Engineering and Management Forum (ISHEM), Napa, November 2005.
(DOC )
This paper describes the state of the art in system safety engineering
and management along with new models of accident causation, based on systems
theory, that may allow us to greatly expand the power of the techniques and
tools we use. The new models consider hardware, software, humans, management
decision-making, and organizational design as an integrated whole. New
hazard analysis techniques based on these expanded models of causation provide
a means for obtaining the information necessary to design safety into the
system and to determine which are the most critical parameters to monitor
during operations and how to respond to them. The paper first describes and
contrasts the current system safety and reliability engineering approaches
to safety and the traditional methods used in both these fields. It then
outlines the new system-theoretic approach being developed in Europe and
the U.S. and the application of the new approach to aerospace systems,
including a recent risk analysis and health assessment of the NASA manned
space program management structure and safety culture that used the new
approach.
A New Accident Model for Engineering Safer Systems
by Nancy Leveson. Safety Science, Vol. 42, No. 4,
April 2004.
(PDF )
A new model of accidents is proposed based on systems theory.
Systems are viewed as interrelated components that are kept in a state of
dynamic equilibrium by feedback loops of information and control. Accidents
result from inadequate control or enforcement of safety-related constraints
on the system. Instead of defining safety management in terms of preventing
component failure events, it is defined as a continuous control task to
impose the constraints necessary to limit system behavior to safe changes
and adaptations. Accidents can be understood, using this model, in terms
of why the controls that were in place did not prevent or detect maladaptive
changes, that is, by identifying the safety constraints that were violated
and determining why the controls were inadequate in enforcing them.
This model provides a theoretical foundation for the introduction of
unique new types of accident analysis, hazard analysis, design for safety,
risk assessment techniques, and approaches to designing performance
monitoring and safety metrics.
Safety and Risk Driven Design in Complex Systems of
Systems by Nancy Leveson and Nicolas Dulac. Presented at the
1st NASA/AIAA Space Exploration Conference, Orlando, February 2005.
(DOC )
This paper describes STAMP briefly and shows (1) how it can be
applied to accident/incident (root cause) analysis, using a
Titan/Milstar loss and (2) describes a new hazard analysis technique
called STPA based on STAMP, using an industrial robot example.
Applying STAMP in Accident Analysis
by Nancy Leveson, Mirna Daouk, Nicolas Dulac, and
Karen Marais,
Workshop on Investigation and Reporting of Incidents and
Accidents (IRIA), September 2003.
(PDF )
This paper shows how STAMP can be applied to accident
analysis using three different models of the accident
process and proposes a notation for describing this process.
The models are illustrated using a case study of a water
contamination accident in Walkerton, Canada.
The Analysis of a Friendly Fire Accident Using a Systems
Model of Accidents.
by Nancy Leveson.
International Conference of the System Safety Society, 2002.
(PDF )
An example of my new accident model applied to a friendly
fire accident in the Iraqi No-Fly-Aone in 1994.
The Role of Software in Spacecraft Accidents
by Nancy Leveson. This paper appeared in the AIAA Journal
of Spacecraft and Rockets, Vol. 41, No. 4, July 2004.
(PDF )
The first and most important step in solving any problem is
understanding the problem well enough to create effective solutions.
To this end, several software-related spacecraft accidents were studied
to determine common systemic factors. Although the details in each
accident were different, very similar factors related to flaws in the
safety culture, the management and organization, and technical
deficiencies were identified. These factors include complacency and
discounting of software risk, diffusion of responsibility and authority,
limited communication channels and poor information flow, inadequate
system and software engineering (poor or missing specifications,
unnecessary complexity and software functionality, software reuse without
appropriate safety anaysis, violation of basic safety engineering
practices in the digital components), inadequate review activities,
ineffective system safety engineering, flawed test and simulation
environments, and inadequate human factors engineering, Each of
these factors is discussed along with some recommendations on how to
eliminate them in future projects.
Evaluating Accident Models using Recent Aerospace
Accidents (Part 1: Event-Based Models) by Nancy Leveson
(PDF )
A report written for NASA Ames and the NASA Software IV&V Facility
evaluating common event-based accident models and identifying
underlying systemic factors in 8 aerospace accidents. Warning:
the report is 140 pages so you might want to look at it before
printing it. There is an executive summary that summarizes the
overall contents, and Chapter 4 summarizes what was learned about
the accident models and also the common factors identified in the
accidents. The paper listed immediately above summarizes the
factors found in the spacecraft accidents.
An Analysis of Causation in Aerospace Accidents (doc) .
by Kathryn Weiss, Nancy Leveson, Kristina Lundqvist, Nida
Farid, and Margaret Stringfellow. Presented at Space 2001,
Albuquerque, New Mexico, August 2001.
( DOC )
This paper describes the causal factors in the mission interruption
of the SOHO (SOlar Heliospheric Observatory) spacecraft using the
hierarchical model introduced in the NASA report listed above.
The factors in this accident are similar to common factors found
in other recent software-related aerospace losses.
Back to the top
ORGANIZATIONAL and CULTURAL ISSUES IN SAFETY
Demonstration of a New Dynamic Approach to Risk Analysis for
NASA's Constellation Program by Nicolas Dulac, Brandon Owens, Nancy
Leveson, Betty Barrett, John Carroll, Joel Cutcher-Gershenfled, Stephen
Friedenthal, Joseph Laracy, and Joseph Sussman.
Final Report of the NASA Exploration Systems Mission Directorate
Associate Administrator. , March 2007.
(PDF )
Effective risk management is the development of complex aerospace
systems requires the balancing of multiple risk components including
safety, cost, performance, and schedule. Safety considerations are
especially critical during system development because it is very
difficult to design or "inspect" safety into a system during operation.
This report describes the results of an MIT Complex Systems Research
Laboratory (CSRL) study conducted at the request of the ANSA Exploration
Systems Mission Directorate (ESMD) to evaluate the usefulness of a new
model of accident causation (STAMP) and STAMP-based system dynamics
models in the development of new spacecraft systems. In addition to
fulfilling the specific needs of ESMD, the study is part of an on-going
effort by the MIT CSRL to develop and refine techniques for modeling
and treating organizational safety culture as a dynamic control problem.
Technical and Managerial Factors in the NASA Challenger
and Columbia Losses: Looking Forward to the Future
by Nancy Leveson, in Handelsman and Kleinman (editors),
Controveries in Science and Technology (to appear) ,
University of Wisconsin Press, 2007.
(DOC )
This essay examines the technical and organizational factors leading to
the Challenger and Columbia accidents and what we can learn from them.
While accidents are often described in terms of a chain of directly
related events leading to a loss, examining this event chain does not
explain why the events themselves occurred. In fact, accidents are
better conceived as complex processes involving indirect and non-linear
interactions among people, societal and organizational structures,
engineering activities, and physical system components. They are rarely
the result of a chance occurrence of random events, but usually result
from the migration of a system (organization) toward a state of high
risk where almost any deviation will result in a loss. Understanding
enough about the Challenger and Columbia accidents to prevent future
ones, therefore, requires not only determining what was wrong at the
time of the losses, but also why the high standards of the Apollo
program deteriorated over time and allowed the conditions cited by
the Rogers Commission as the root causes of the Challenger loss and
why the fixes instituted after Challenger became ineffective over
time, i.e., why the manned space program has a tendency to migrate
to states of such high risk and poor decision-making processes that
an accident becomes almost inevitable.
What System Safety Engineering can Learn from the Columbia Accident
by Nancy Leveson and Joel Cutcher-Gershenfeld, Int. Conference of
the System Safety Society, Providence Rhode Island, August 2004.
(PDF )
Many of the dysfunctionalities in the system safety program at
NASA contributing to the Columbia accident can be seen in other
groups and industries. This paper summarizes some of the lessons
we can all learn from this tragedy. While there were many factors
involved in the loss of the Columbia Space Shuttle, this paper
concentrates on the role of system safety engineering and what can
be learned about effective (and ineffective) safety efforts.
Risk Analysis of the NASA Independent Technical Authority
by Nancy Leveson and Nicholas Dulac with contributionns by Joel
Cutcher-Gershenfeld, John Carroll, Betty Barrett and Stephen
Friedenthal.
(DOC )
The application of STAMP and STPA to an organizational risk analysis.
Modeling, Analyzing, and Engineering NASA's Safety Culture
by Nancy Leveson (with Nicolas Dulac, David Zipkin,
Joel Cutcher-Gershenfeld, Betty Barrett, and John Carroll),
Final Report of a Phase 1 NASA/USRA research grant
(PDF )
This is the final report on Phase 1 (5 months) of a research grant
STAMP and system dynamics models. We used the NASA manned space
program as our testbed.
Moving Beyond Normal Accidents and High Reliability
Organizations: An Alternative Approach to Safety in Complex Systems
by Nancy Leveson, Karen Marais, Nicolas Dulac, and John
Carroll, (DOC ) to appear in
Organizational Studies (Sage Publishers).
Organizational factors play a role in all accidents and are a
critical part of understanding and preventing them. Two prominent
sociological schools of thought have addressed the organizational
aspects of safety: normal Accident Theory and High Reliability
Organizations (HRO). In this paper, we argue that the conclusions
of HRO reseachers are limited in their applicability and usefulness
to complex, high-risk systems and following some of the
recommendations could actually contribute to accidents. Normal
Accident Theory, on the other hand, does recognize the difficulties
involved but is unnecessarily pessimistic about the possibility
of effectively dealing with them. An alternative systems
approach to safety is described.
Effectively Addressing NASA's Organizational and Safety Culture:
Insights from System Safety and Engineering Systems by
Nancy Leveson, Joel Cutcher-Gershenfeld,
Betty Barrett, Alexander Brown, John Carroll, Nicolas Dulac,
Lydia Fraile, Karen Marais. MIT ESD Symposium, March 2004
(Word)
This paper illustrates some aspects of the changes required for
a realignment of social systems as recommended by the Columbia
Accident Investigation Board (CAIB). The paper focuses on three
aspects of social systems at NASA: organizational structure;
organizational subsystems and social interaction processes
(communication systems, leadership, and information systems);
and capability and motivation. Issues of organizational vision,
strategy, and culture are woven throughout the analysis.
Archetypes for Organizational Safety by Karen Marais
and Nancy G. Leveson. Proceedings of the Workshop on
Investigation and Reporting of Incidents and Accidents,
September 2003.
(pdf),
We propose a framework using system dynamics to model the
dynamic behavior of organizations in accident anaysis. Most
current accident analysis techniques are event-based and do not
adequately capture the dynamic complexity and non-linear
interactions that characterize accidents in complex systems. In
this paper, we propose a set of system safety archetypes that
often lead to accidents. As accident analysis and investigation
tools, the archetypes can be used to develop dynamic models that
describe the systemic and organizational factors contributing
to the accident. The archetypes help to clarify why
safety-related decsion do not always result in the desired
behavior, and how independent decisions in different parts of
the organizational can combine to impact safety.
Back to the top
HUMAN-MACHINE INTERACTION
- Analyzing Software Specifications for Mode Confusion
Potential, by Nancy G. Leveson, L. Denise Pinnel, Sean
David Sandys, Shuichi Koga, Jon Damon Reese. Presented at
the Workshop on Human Error and System Development, Glascow,
March 1997. (Postscript)
(PDF ).
Increased automation in complex systems has led to changes
in the human controller's role and to new types of
technology-induced human error. Attempts to mitigate these
errors have primarily involved giving more authority to the
automation, enhancing operator training, or changing the
interface. While these responses may be reasonable under many
circumstances, an alternative is to redesign the automation in
ways that do not reduce necessary or desirable functionality
or to change functionality where the tradeoffs are judged to be
acceptable. This paper describes an approach to detecting
error-prone automation features early in the development process
while significant changes can still be made to the conceptual
design of the system. The software requirements are modeled using
a hierarchical state machine language and then analyzed (manually
or with automated assistance) to identify violations of a set
of design constraints associated with mode-confusion errors.
The approach is illustrated with a model of the software
controlling a NASA robot.
- Designing Automation to Reduce Operator Errors
by Nancy G. Leveson and Everett Palmer (NASA Ames Research
Center). In the Proceedings of Systems, Man, and Cybernetics
Conference, Oct. 1997 (PostScript)
(PDF ).
Advanced automation has been accompanied, particularly in
aircraft, with a proliferation of modes, where modes define
mutually exclusive sets of system behavior. The new mode-rich
systems provide flexibility and enhanced capabilities, but they
also increase the need for and difficulty of maintaining mode
awareness. A previous paper described some categories of
potential design flaws that can lead to mode confusion errors
and described an approach to finding these flaws by first modeling
blackbox software behavior and then using analysis methods and
tools to assist in searching the models for predictable error
forms, i.e., for automation features that can contribute to
operator mistakes. This paper shows an example of the approach
for one particular feature, i.e., indirect mode changes, using
an example from the MD-88 control logic. The particular indirect
mode transition problem used in the example, called a
``kill-the-capture bust'' has been noted in many ASRS incident
reports.
- Describing and Probing Complex System Behavior: A Graphical
Approach by Edward Bachelder and Nancy Leveson. In the
proceedings of the Aviation Safety Conference, Seattle, Sept. 2001.
( Word (.doc) )
Hands-on training and operation is generally considered the
primary means that a user of a complex system will use to build a
mental model of how that system works. However, accidents abound
where a major contributing factor was user
disorientation/misorientation with respect to the automation behavior,
even when the operator was a seasoned user. This paper presents a
compact graphical method that can be used to describe system operation,
where the system may be composed of interacting automation and/or
human entities. The fundamental goal of the model is to capture and
present critical interactive aspects of a complex system in an
integrated, intuitive fashion. This graphical approach is applied to
an actual military helicopter system, using the onboard hydraulic
leak detection/isolation system as a testbed. The helicopter Flight
Manual is used to construct the system model, whose components include:
logical structure (waiting and checking states, transitional events,
and conditions), human/automation cross communication (messages,
information sources), and automation action and associated action
limits. Using this model, examples of the following types of mode
confusion are identified in the military helicopter case study:
(1) unintended side effects, (2) indirect mode transitions,
(3) inconsistent behavior, (4) ambiguous interfaces, and (5) lack of
appropriate feedback. The model also facilitates analysis and
revision of emergency procedures, which is demonstrated using an
actual set of procedures.
- Modeling Controller Tasks for Safety Analysis, by
Molly Brown and Nancy G. Leveson. Presented at
the Workshop on Human Error and System Development, Seattle,
April 1998. (Postscript) (PDF 3.0).
As control systems become more complex, the use of
automated control has increased. At the same time, the role of the
human operator has changed from primary system controller to
supervisor or monitor. Safe design of the human--computer interaction
becomes more difficult.
In this paper, we present a visual task modeling language that can be
used by system designers to model human--computer interactions. The
visual models can be translated into SpecTRM-RL, a blackbox
specifiction language for modeling the automated portion of the
control system. The SpecTRM-RL suite of analysis tools allow the
designer to perform formal and informal safety analyses on the task
model in isolation or integrated with the rest of the modeled
system.
-
Identifying Mode Confusion Potential in Software Design
by Mario Rodriguez, Marc Zimmerman, Masafumi Katahira, Maxime de
Villepin, Benjamin Ingram, and Nancy Leveson. Digital Aviation
Systems Conference, October 2000.
(Postscript),
(PDF).
While automation has eliminated many types of operator error,
it has also created new types of technology-induced human error.
This paper shows how a formal model of an FMS similar to an MD-11
can be used to evaluate human factors aspects of the automation
design.
- An Approach to Human-Centered Design , by
Mirna Daouk and Nancy G. Leveson. Presented at
the Workshop on Human Error and System Development, Linkoping,
Sweden, June 2001.
(.doc)
Human-automation interactions are changing in nature, and new
sources of errors and hazards are being introduced. The need
for reducing human errors without sacrificing the benefits of
computers has led to the idea of human-centered system design;
little work, however, has been done as to how one would achieve
this goal. This paper provides a methdology for human-centered
design of systems including both humans and automation. The
proposed methodology integrates task allocation, task analysis,
simulations, human factors experiments, formal models, and
several safety, usability, and performance analyses into
Intent Specifications. An air traffic control conflict
detection tool, MTCD, is used to illustrate the methodology.
- Back to the top
SOFTWARE FAULT TOLERANCE
- An Experimental Evaluation of the Assumption of
Independence in Multi-Version Programming,
by John Knight and Nancy Leveson,
IEEE Transactions on Software Engineering, Vol. SE-12,
No. 1, January 1986, pp. 96-109
(
PDF (sorry, but this paper is so old, I have only
a copy that was converted from an old typesetting language).
Our original paper that got us in such hot water for the
next ten years until everyone who tried to show we were wrong,
got the same results and grudgingly admitted we were right.
Unfortunately, the same idea keeps popping up again like a
bad penny among people who do not bother to learn anything
about what has been done in the past.
- A Reply to the Criticisms of the Knight and Leveson
Experiment by John Knight and Nancy Leveson ACM
Software Engineering Notes, January 1990
( PDF ).
After years of ridiculous and mostly untrue statements
about our original multi-version programming experiment
(always in forums where we were unable to reply), we finally
had had enough and decided to respond publicly. If nothing
else, writing this paper had a cathartic effect.
- Analysis of Faults in an N-Version Software Experiment
by Susan Brilliant, John Knight, and Nancy Leveson.
IEEE Trans. on Software Engineering, Vol. SE-16, No. 2,
February 1990
(
PDF
More details about the actual errors found in the multiple
version programs and an explanation of why they caused
correlated failures.
- The Consistent Comparison Problem in N-Version
Programming by Susan Brilliant, John Knight, and
Nancy Leveson.
IEEE Trans. on Software Engineering, Vol. SE-15,
No. 11, November 1989
(
PDF ).
During the multi-version programming experimentation,
we identified a problem we called the Consistent Comparison
Problem. In this paper we showed that when versions make
comparisons involving the results of finite-precision
calculations, it is impossible to guarantee the consistency
of their results. Correct versions may therefore arrive at
completely different outputs for an application that does
not apparently have multiple correct solutions. If this
problem is not dealt with explicitly, an N-version system
may be unable to reach a consensus even when none of its
component versions fail. We discuss potential solutions,
none of which is entirely satisfactory.
- The Use of Self Checks and Voting in Software Error
Detections: An Empirical Study by Nancy Leveson, Stephen
Cha, John Knight, and Timothy Shimeall. IEEE Trans. on
Software Engineering, Vol. SE-16, No. 4, April, 1990
(
PDF ).
While we were on a roll, we decided to compare the
use of self-checks (assertions) and voting (n-versions).
- An Empirical Comparison of Software Fault Tolerance and
Fault Elimination by Timothy Shimeall and Nancy Leveson
IEEE Trans. on Software Engineering, Vol. SE-17, No. 2,
February 1991, pp. 173-183
(
PDF ).
Before bowing out gracefully (and bloodied) from the
software fault-tolerance community and taking a break from running
experiments, I decided to try one more. This paper compares
the effectiveness of two software fault tolerance techniques
(embedded self-checks and multi-version programming) with some
common fault elimination techniques.
- Back to the top
MISCELLANEOUS
- High-Pressure Steam Engines and Computer Software by
Nancy Leveson. Presented as a keynote address at the
International Conference Software Engineering in Melbourne
Australia, 1992 and published in IEEE Computer, October 1994.
(PostScript)
(PDF ).
A comparison between the history of steam engine technology
and software technology and what we can learn from the mistakes
made with steam engines.
-
An Empirical Evaluation of the MC/DC Coverage Criterion on the
HETE-2 Satellite Software , by Arnaud Dupuy (Alcatel) and
Nancy Leveson (MIT), Digital Aviations Systems Conference (DASC),
October, 2000.
(Postscript),
(PDF).
In order to be certified by the FAA, airborne software must
comply with the DO-178B standard. For the unit testing of safety-critical
software, this standard requires the testing process to meet a source code
coverage criterion called Modified Condition/Decision Coverage. This
part of the standard is controversial in the aviation community, partially
because of perceived high cost and low effectiveness. Arguments have
been made that the criterion is unrelated to the safety of the software
and does not find errors that are not detected by functional testing.
In this paper, we present the results of an empirical study that compared
functional testing and functional testing augmented with test cases to
satisfy MC/DC coverage. The evaluation was performed during the testing
of the attitude control software for the HETE-2 (High Energy Transient
Explorer) scientific satellite (since that time, the software has been
modified).
-
Baker Panel Report on Texas
City Accident
- Back to the top
|