CSE Speaker Series – Dr. Song Fu
Dr. Song Fu from UNT will give a talk on Exploiting Disk Performance Signatures for Cost-Effective Management of Large-Scale Storage Systems in Cramer 221 at 11:00 am.
ABSTRACT
The huge number of nodes along with the overwhelming complexity of interactions among system components and highly dynamic application and system behaviors makes management of high-end computers extremely challenging. Occurrences of component failures and critical errors as well as their impact on system performance and operating costs is becoming an increasingly important concern to systems designers and operators. Although hard drives are reliable in general, they are believed to be the most commonly replaced hardware components. It is reported that 78% of all hardware replacements were for hard drives in data centers. Moreover, with the increased capacity of single drives and an entire system, block and sector level failures, such as latent sector errors and silent data corruption, cannot be ignored anymore. Existing disk failure management approaches are mostly reactive (replacement and possible diagnosis only after disk drives have failed) and incur high overhead (disk rebuilds); they do not provide a cost-effective solution to managing large-scale production storage systems. To overcome these problems, we design, develop and evaluate novel disk health analysis and failure prediction technologies and tools for production storage systems. Specifically, we address a series of crucial issues: systematic mechanisms for online disk health probing, failure categorization and modeling to discover types of disk failures and derive disk performance signatures; innovative disk failure prediction techniques for accurate forecast of the occurrence time of disk failures in each discovered failure type by leveraging disk performance signatures in performance degradation tracking; proactive data rescue and preventive disk reliability enhancement with easy-to-use APIs for storage users and developers. This research enables a deep understanding of storage’s health and reliability and a natural, cost-effective support of data protection with a low overhead.