If you have any problems related to the accessibility of any content (or if you want to request that a specific publication be accessible), please contact us at firstname.lastname@example.org.
Probabilistic Approach to Avoid Uncorrectable Bit Errors in Storage Systems
AuthorBhuiyan, Masudul Hasan Masud
AltmetricsView Usage Statistics
Silent data corruption in storage system poses a significant risk to the integrity ofdata. While error correction codes (ECC) can recover the majority of the errors, a non-negligible portion of them escape ECC, referred to as uncorrectable errors. As the scale of storage systems increases, the mean time between uncorrectable errors is reduced from months to hours, necessitating efficient ways to detect and handle them. In this thesis, we propose prediction models for uncorrectable errors by analyzing 150M daily SMART logs from 143K hard drives collected over the period of five years. The models achieve up-to 97% accuracy in uncorrectable bit error prediction while keeping false positive rates less than 3%. We further introduce two use cases to utilize highly accurate error prediction models to (i) mitigate the I/O overhead of file transfer integrity verification on file systems and to (ii) reduce the amount of I/O that is processed by disks with uncorrectable errors. Evaluation results show that running integrity verification only for disks with high error probability allows up to 97% decrease in I/O overhead of file transfers while avoiding more than 90% of uncorrectable errors. Moreover, diverting I/O operations from high-risk disks to low-risk disks can reduce the amount of data exposed to an uncorrectable error by 80% while keeping the overhead on low-risk disks less than 5%.