Beyond Monitoring: Proactive Server Preservation in an HPC Environment
AdvisorHarris, Frederick C.
Computer Science and Engineering
AltmetricsView Usage Statistics
Monitoring has long been the challenge of a server administrator. Monitoring diskhealth, system load, network congestion, and environmental conditions like temperature are all things that can be tied into monitoring systems. Monitoring systemsvary in scope and capabilities, and many can fire off alerts for just about any configuration. The sysadmin then has the responsibility of weighing the alert and decidingif and when to act. In a High Performance Computing (HPC) environment, someof these failures can have a ripple effect, affecting a larger area than the physicalproblem. Furthermore, some temperature and load swings can be more drastic in anHPC environment than they would be otherwise. Because of this a timely, measuredresponse is critical. When a timely response is not possible, conditions can escalaterapidly in an HPC environment, leading to component failure. In this situation, anintelligent, automatic, measured response is critical. Here we present such a system, anovel approach to server monitoring using integrated server hardware operating independently of the operating sytem, and capable not only of monitoring temperatures,but also automatically responding to temperature events. Our proactive response system leverages standard HPC software and integrated server hardware. It is designedto intelligently respond to temperature events from a High Performance Computingperspective, looking at both compute jobs and server hardware.