Whereas even a few years ago a terabyte was seen as a large amount of data, today individual application can generate petabytes of data per second. The tremendous advances in low-cost, high-capacity magnetic hard disk drives (HDD) and relatively new flash-based solid state drives (SSD) have been among the key factors supporting big data and various computing and storage services that our modern society deeply rely on. Datacenter owners all have mission-critical workloads and need to guarantee quality of service to their customers, which are heavily reliant on their HDD and SSD based storage systems.
However disk drives are reported to be the most commonly replaced hardware components. It has been reported that the annualized failure rate (AFR) of disk drives can reach 15%, with 2-4% common for enterprise-class drives and 8-9% for consumer-grade drives. A modern datacenter usually has tens to hundreds of thousands of disk drives installed. At such a scale, disk failures are common with tens of instances every day, not to mention the larger number of logical failures that make disk drives inaccessible. It is reported that 78% of all hardware replacements were for hard drives in production data centers. Storage downtime and data loss cost enterprises $1.7 trillion per year.
The objective of this project is to achieve a deep understanding of the real-world storage reliability, and to develop a cost-effective data and storage resource management system for reliability enhancement.
We propose Wizard, a novel architecture that explores disk performance signatures for automated and systematic management of storage resources for reliability assurance targeting production, large-scale storage environments. The development of Wizard is based on a deep and comprehensive characterization of runtime I/O workloads and disk health data collected from several leadership-class production datacenters, available for this project. These datacenters run diverse application workloads (web services, e-commerce, and high performance computation) and have different storage architectures (consumer-grade HDDs, enterprise-class HDDs, SMR HDDs and SSDs). We treat disk health records as first-class data and discover the categories and types of disk failures, quantify disk performance degradation processes with performance signatures, and forecast occurrence time of future failures, by extensively exploring advanced machine learning technologies. In this way, Wizard manages heterogeneous disk devices under diverse storage workloads in a consistent and cost-effective manner.
Moreover, we incorporate proactive disk and data protection as the next natural step in the storage resource management architecture. Compared with reactive data recovery methods through disk rebuilds, proactive approach reduces data loss and recovery overhead by supporting data migration from an unhealthy storage device to a healthy one prior to a disk failure. Thus, the risk of data loss and the overhead of disk rebuilds can be dramatically reduced. In addition to efficient data rescue, we propose a factor-aware resource scheduling approach in Wizard to extend disk lifetime by smartly distributing storage workloads and other resources among disk drives at different health stages. Wizard also provides a set of APIs to allow storage users and developers to customize data protection and disk health control for flexible, reliable storage management.
- Biao Xu, Zujie Ren, Weisong Shi, Yongjian Ren, Feng Cao and Jiangbin Lin, iGen: A Realistic Request Generator for Cloud File Systems Benchmarking, in Proceedings of IEEE CLOUD 2016, July 27-July 2 , 2016. San Francisco, USA.
- Song Huang, Song Fu, Quan Zhang and Weisong Shi, Characterizing Disk Failures with
Quantified Disk Degradation Signatures: An Early Experience,
in Proceedings of 2015 IEEE International Symposium on Workload
Characterization (IISWC), Atlanta, GA. Oct 4-6, 2015.
- Qiang Guan and Song Fu, Autonomic Failure Identification and Diagnosis for Building Dependable Computing Systems, Proc. of ACM/IEEE Supercomputing Conference (SC'13), November 2013.
- Zujie Ren, Xianghua Xu, Jian Wan, Weisong Shi and Min Zhou, Workload
Characterization on a Production Hadoop Cluster: A Case Study on Taobao, 2012 IEEE International Symposium
on Workload Characterization (IISWC), November 4-6, 2012, San Diego,
USA. Best Paper Award.