Application of software health management techniques

Abstract

The growing complexity of software used in large-scale, safety critical cyber-physical systems makes it increasingly difficult to expose and hence correct all potential defects. There is a need to augment the existing fault tolerance methodologies with new approaches that address latent software defects exposed at runtime. This paper describes an approach that borrows and adapts traditional textquoteleftSystem Health Managementtextquoteright techniques to improve software dependability through simple formal specification of runtime monitoring, diagnosis, and mitigation strategies. The two-level approach to health management at the component and system level is demonstrated on a simulated case study of an Air Data Inertial Reference Unit (ADIRU). An ADIRU was categorized as the primary failure source for the in-flight upset caused in the Malaysian Air flight 124 over Perth, Australia in 2005.

Publication
2011 ICSE Symposium on Software Engineering for Adaptive and Self-Managing Systems, SEAMS 2011, Waikiki, Honolulu , HI, USA, May 23-24, 2011