Beutter, B. R., Matessa, M., McCandless, J. W., McCann, R. S., Spirkovska, L., Liston, D., Hayashi, M., Huemer, V., Lachter, J., Ravinder, U., Elkins, S., Renema, F., Lawrence, R., & Hamilton, A. (2006). Evaluation of an Onboard Real-Time Fault Management Support System for Next-Generation Space Vehicles (Intelligent Spacecraft Interface Systems Lab Report). Moffett Field, CA: NASA Ames Research Center.
Executive Summary

Human-rated spacecraft contain very complex and often highly interconnected engineering systems that must perform to precise operational specifications in very harsh environments. Critical systems are instrumented with sensors that provide real-time numeric readings of operating parameters. If a predetermined number of consecutive sensor readings fall outside the range consistent with normal (nominal) system operations, crew and ground personnel are alerted to the problem and the cause must be identified. If the cause is determined to be a genuine system malfunction (rather than, for example, sensor failure), the appropriate recovery procedures must be accessed and executed. Because malfunctions in the more dynamic systems can pose an immediate threat to crew safety or mission success, the crew must work the procedures to restore critical system function as quickly and accurately as possible.

Real-time fault management – the process of detecting, isolating and recovering from systems malfunctions – is one of the biggest operational challenges facing the crews of today’s space shuttles. The shuttles’ caution & warning (C&W) systems primarily use bounds checking methods to determine off-nominal performance; the crew is not aware of the potential existence of a malfunction until a threshold is reached. Moreover, off-nominal performance in one component often leads to a cascade of off-nominal performance in interconnected components, presenting the crew with a potentially large set of C&W events that they must associate with the signature of a single fault. This root-cause determination is further complicated by cockpit avionics and display limitations. Only a fraction of the sensed data is available on cockpit displays and an even smaller fraction can be viewed at once. To make matters worse, the display formats themselves are often poorly organized and highly cluttered, taking the form of closelyspaced matrices of digital data that may require considerable mental translation to understand the current operational status or functional mode of a system. Once the root cause is determined, operational challenges continue through the isolation and recovery activities. Malfunctionrecovery procedures are only available in paper checklists. Beyond the purely psychomotor issues of accessing the checklists when crewmembers are fully suited and restrained in a vibrating vehicle, checklist navigation is inherently complex. The crewmember must locate the correct procedure for the root cause, navigate through the checklist steps by deciphering specialized symbols, abbreviations, boundary delimiters and spatial configurations, evaluate logical expressions by referring to other cockpit instruments and displays, perform mode reconfigurations by finding and toggling the correct switches from the hundreds of manual switches that populate the interior, and ensure that all steps are completed accurately and that the resulting system state is as expected.

Designers of next-generation crewed exploration space vehicles have three decades of technology advances at their disposal to reduce fault management difficulty and streamline fault management operations. Integrated System Health Management (ISHM) technologies can facilitate the process of detecting and isolating faults. Some of these technologies have already been incorporated in a prototype Enhanced C&W system for shuttle. Advanced navigation schemes for electronic checklists can facilitate the process of executing recovery procedures. Lastly, Human Factors and Human-Computer Interaction technologies can facilitate the process of organizing needed information and presenting it so that it better supports the crew’s fault management tasks. As part of a shuttle cockpit avionics upgrade (CAU) program, human factors researchers and shuttle crewmembers have already developed prototype displays incorporating some of these display improvement techniques.

In this report, we describe a concept that integrates these technologies into a FAult Management Support System (FAMSS) that assists the crew with all aspects of real-time fault management, from fault detection and crew alerting through fault isolation and recovery activities. The FAMSS concept specifies an intermediate level of crew-FAMSS functional allocation and user interfaces to enable and support that allocation. FAMSS automatically performs root-cause analyses, evaluates checklist logical expressions and makes switch throws. The crew maintains overall authority and control over the fault management process because FAMSS does not execute any procedure until a crewmember gives it permission to do so. This proposed functional allocation is enabled and supported by a FAMSS user interface, the Fault Management Display, which combines C&W and electronic checklist interface design features with CAU display format principles. The Fault Management Display is divided into two sections: a localized system schematic and an area for written (text-based) fault management procedures. Where possible, procedure information is coded into the system schematic, providing a graphics-based (as well as text-based) depiction of the procedure to assist the crewmember in understanding the required system reconfigurations and their effect on system function.

We recently completed an extensive empirical evaluation of FAMSS in the Intelligent Spacecraft Interface Systems (ISIS) laboratory at NASA Ames Research Center. Fourteen highly experienced commercial airline pilots assumed the role of spacecraft operator during the launch and ascent phase of eight spacecraft missions in a part-task (single-operator with no ground support) reconfigurable cockpit simulator. The baseline condition for the evaluation combines the shuttle C&W system, a CAU display suite, paper checklists, and manual switch throws. The FAMSS condition automates switch throws and root-cause determinations, removes the need to consult paper checklists (by providing checklist steps on the Fault Management Display), and adds the Fault Management Display to the CAU display suite. The evaluation methodology combined the standard suite of human performance measurement tools – accuracy and latency performance measurements, and situation awareness and workload questionnaires – with two infrequently used methods – eye movement analyses and predictive modeling of human performance.

The variety of evaluation techniques revealed many FAMSS-related empirical benefits to onboard fault management. Working malfunctions in conjunction with FAMSS assistance improved malfunction resolution accuracy by 43% and reduced malfunction resolution time by 54%. FAMSS reduced or eliminated fault management errors in a wide variety of fault management activities, including root-cause determinations of clusters of C&W events, reading and navigating through the checklists, and manually throwing switches. Similarly, determining root cause and navigating to the appropriate recovery procedure took much longer when FAMSS was not available.

FAMSS also greatly reduced participants’ subjective perception of workload. Participants rated their workload on off-nominal (malfunction-containing) runs as 27% to 37% lower in the FAMSS condition than in the baseline condition, with perceived greater benefits of FAMSS’ automation in higher-complexity fault management situations.

If not carefully designed, automation can lead to significant decreases in situation awareness. The FAMSS concept specifies an intermediate level of automation to alleviate this potential problem, and the results indicate this goal was met. Objective situation awareness questions showed that participants’ understanding of the environment was approximately identical for the Baseline condition compared to FAMSS. Subjective situation awareness results were stronger, with the ratings indicating that the participants actually increased their perceived ability to diagnose and resolve the malfunctions.

A secondary goal of the FAMSS evaluation was to determine the benefits and deficits of an integrated evaluation methodology that blended eye movement and predictive modeling methods with analyses of traditional human performance metrics such as response time and accuracy. Eye movement data augmented the data collected by traditional means in various ways. Eye movement tracking enabled us to gather independent evidence that helped clarify or deepen our understanding of how participants utilized critical features of the Fault Management Display. In particular, eye movements show that participants generally crosschecked schematic and textbased representations of procedures on the Fault Management Display. This suggests that participants found the embedded graphical depiction of procedure steps beneficial. Further, analyses of eye movements showed that participants return to their normal methodical instrument scan more quickly when FAMSS provided fault management assistance. This corroborates the improvements suggested by subjective situation awareness ratings.

In addition to assessing the many benefits of the FAMSS concept, the evaluation revealed two potential drawbacks with the FAMSS interface. First, FAMSS provided little information on failure impacts. More explicit information would alleviate problems of mistaking a propagated “daughter” fault as a bona fide fault. Second, the interaction between crewmember and FAMSS could be clarified in the case of multiple malfunctions. Some of the participants expected FAMSS to automatically switch to the next malfunction to work when the current malfunction procedure was completed. This is not a feature of the current concept. Pending tasks need to be more clearly depicted and perhaps reminders provided.

The shuttle operations paradigm has been refined over 25 years of flight. Each task that the crew is required to accomplish onboard is developed, perfected, and practiced many times before flight. Simultaneously, ground controllers also learn, practice and perfect their tasks of systems monitoring and failure diagnosis. Though it may be desirable to reduce training time or introduce automation to lessen crew and ground controller workload, for the most part, the paradigm works well and leads to successful missions. Nevertheless, the circumstances of nextgeneration vehicle missions will require the crew to operate their vehicles in a more autonomous (independent) mode than they do today. A fault management support system could provide invaluable assistance under these conditions.


Go to ISIS Lab Publications page