Scientific Computing Autonomic Reliability Framework

Abhishek Dubey, Sandeep Neema, Jim Kowalkowski, Amitoj Singh

January, 2008

Abstract

Large scientific computing clusters require a distributed dependability subsystem that can provide fault isolation and recovery and is capable of learning and predicting failures, to improve the reliability of scientific workflows. In this paper, we outline the key ideas in the design of a Scientific Computing Autonomic Reliability Framework (SCARF) for large computing clusters used in the Lattice Quantum Chromo Dynamics project at Fermi Lab.

Type

Conference paper

Publication

Fourth International Conference on e-Science, e-Science 2008, 7-12 December 2008, Indianapolis, IN, USA