Scientific Computing Autonomic Reliability Framework

Abstract

Large scientific computing clusters require a distributed dependability subsystem that can provide fault isolation and recovery and is capable of learning and predicting failures, to improve the reliability of scientific workflows. In this paper, we outline the key ideas in the design of a Scientific Computing Autonomic Reliability Framework (SCARF) for large computing clusters used in the Lattice Quantum Chromo Dynamics project at Fermi Lab.

Publication
Fourth International Conference on e-Science, e-Science 2008, 7-12 December 2008, Indianapolis, IN, USA