Deliberative, search-based mitigation strategies for model-based software health management


Rising software complexity in aerospace systems makes them very difficult to analyze and prepare for all possible fault scenarios at design time; therefore, classical run-time fault tolerance techniques such as self-checking pairs and triple modular redundancy are used. However, several recent incidents have made it clear that existing software fault tolerance techniques alone are not sufficient. To improve system dependability, simpler, yet formally specified and verified run-time monitoring, diagnosis, and fault mitigation capabilities are needed. Such architectures are already in use for managing the health of vehicles and systems. Software health management is the application of these techniques to software systems. In this paper, we briefly describe the software health management techniques and architecture developed by our research group. The foundation of the architecture is a real-time component framework (built upon ARINC-653 platform services) that defines a model of computation for software components. Dedicated architectural elements: the Component Level Health Manager (CLHM) and System Level Health Manager (SLHM) provide the health management services: anomaly detection, fault source isolation, and fault mitigation. The SLHM includes a diagnosis engine that (1) uses a Timed Failure Propagation Graph (TFPG) model derived from the component assembly model, (2) reasons about cascading fault effects in the system, and (3) isolates the fault source component(s). Thereafter, the appropriate system-level mitigation action is taken. The main focus of this article is the description of the fault mitigation architecture that uses goal-based deliberative reasoning to determine the best mitigation actions for recovering the system from the identified failure mode.