Over 10 years we help companies reach their financial and branding goals. Engitech is a values-driven technology agency dedicated.

Gallery

Contacts

411 University St, Seattle, USA

engitech@oceanthemes.net

+1 -800-456-478-23

Development Technology

Fine-grained Automated Failure Management on Extreme-Scale GPU Accelerated Systems by Balazs Gerofi



Failures in leadership-class accelerated HPC and AI systems have become increasingly common, and as these systems continue to scale, the frequency of failures is expected to rise. With hundreds of thousands of field-replaceable parts in such systems, automated failure management is essential. This talk introduces StabilityDB, a failure management automation framework that leverages real-time data analytics to drive failure servicing and maintenance on a per-failure mode basis. This approach ensures minimal compute node downtimes and high overall system availability. We will provide an architectural overview of StabilityDB and present statistical information on the failure characteristics that guide our automation policies. StabilityDB has been deployed on the Aurora supercomputer at Argonne National Laboratory, a system with over 63,744 GPUs, and is contributing to its efficient operation.

source

Author

MQ

Leave a comment

Your email address will not be published. Required fields are marked *