Proactively Handling Failures in Extreme-Scale Big Data Storage: A Data Driven Approach
StatisticsView Usage Statistics
With the prosperity of Big Data, the performance and robustness of storage systems have become ever more important. Today's Big Data enterprise storage systems operate at petabyte scale and are composed of thousands to tens of thousands of data drives organized in a hierarchical architecture. These storage systems usually host many applications concurrently and perform continuous data collection and analysis. Such complex hardware and software stacks under intensive usage can easily result in hundreds of critical storage failures in a daily manner and pose significant challenges on the robustness of these systems. State of the practice solution reactively moves data from failed disks to spare disks, which usually leads to severe performance degradation and unrecoverable data loss. This thesis targets at developing a proactive methodology to predict critical storage failures in advance so that the backup process can start even before a failure occurs to reduce the performance impact and the risk of data loss. Given there are hundreds of critical failures in modern Big Data storage systems and there are complex dependencies among them, we propose a data-driven approach by characterizing an extreme-scale storage system that is composed of 8 storage clusters with 462,578 data drives over a one-year period, which adds up to 857,183,442 data drive hours. Considering the lightweight requirement of practical systems and the sparsity of the data, we opt for a statistical approach instead of heavy overhead machine learning based methods that usually require tremendous computing power with a large number of training samples. The proposed spatial-temporal prediction methodology builds on the findings from characterization results, especially the spatial-temporal dependencies in each failure type as well as cross failure types. The extensive evaluation suggests that the proposed prediction methodology can outperform state of the art Markov Chain based methods by more than 20% and it can also adapt swiftly in dynamic environments.