Motivation

Whereas even a few years ago a terabyte was seen as a large amount of data, today individual application can generate petabytes of data per second. The tremendous advances in low-cost, high-capacity magnetic hard disk drives (HDD) and relatively new flash-based solid state drives (SSD) have been among the key factors supporting big data and various computing and storage services that our modern society deeply rely on. Datacenter owners all have mission-critical workloads and need to guarantee quality of service to their customers, which are heavily reliant on their HDD and SSD based storage systems.

However disk drives are reported to be the most commonly replaced hardware components. It has been reported that the annualized failure rate (AFR) of disk drives can reach 15%, with 2-4% common for enterprise-class drives and 8-9% for consumer-grade drives. A modern datacenter usually has tens to hundreds of thousands of disk drives installed. At such a scale, disk failures are common with tens of instances every day, not to mention the larger number of logical failures that make disk drives inaccessible. It is reported that 78% of all hardware replacements were for hard drives in production data centers. Storage downtime and data loss cost enterprises $1.7 trillion per year. 

The objective of this project is to achieve a deep understanding of the real-world storage reliability, and to develop a cost-effective data and storage resource management system for reliability enhancement.

[top]

Our Appaorch

We propose Wizard, a novel architecture that explores disk performance signatures for automated and systematic management of storage resources for reliability assurance targeting production, large-scale storage environments. The development of Wizard is based on a deep and comprehensive characterization of runtime I/O workloads and disk health data collected from several leadership-class production datacenters, available for this project. These datacenters run diverse application workloads (web services, e-commerce, and high performance computation) and have different storage architectures (consumer-grade HDDs, enterprise-class HDDs, SMR HDDs and SSDs). We treat disk health records as first-class data and discover the categories and types of disk failures, quantify disk performance degradation processes with performance signatures, and forecast occurrence time of future failures, by extensively exploring advanced machine learning technologies. In this way, Wizard manages heterogeneous disk devices under diverse storage workloads in a consistent and cost-effective manner.

Moreover, we incorporate proactive disk and data protection as the next natural step in the storage resource management architecture. Compared with reactive data recovery methods through disk rebuilds, proactive approach reduces data loss and recovery overhead by supporting data migration from an unhealthy storage device to a healthy one prior to a disk failure. Thus, the risk of data loss and the overhead of disk rebuilds can be dramatically reduced. In addition to efficient data rescue, we propose a factor-aware resource scheduling approach in Wizard to extend disk lifetime by smartly distributing storage workloads and other resources among disk drives at different health stages. Wizard also provides a set of APIs to allow storage users and developers to customize data protection and disk health control for flexible, reliable storage management.

[top]

Collaboration

  
HGST

MI-OSiRIS


NetEase

[top]

People

[top]

Publication

  • Biao Xu, Zujie Ren, Weisong Shi, Yongjian Ren, Feng Cao and Jiangbin Lin, iGen: A Realistic Request Generator for Cloud File Systems Benchmarking, in Proceedings of IEEE CLOUD 2016, July 27-July 2 , 2016. San Francisco, USA.
  • Qiang Guan and Song Fu, Autonomic Failure Identification and Diagnosis for Building Dependable Computing Systems, Proc. of ACM/IEEE Supercomputing Conference (SC'13), November 2013.

[top]

Software

[top]