论文标题

关于储存堆栈的故障诊断

On Failure Diagnosis of the Storage Stack

论文作者

Zhang, Duo, Gatla, Om Rameshwar, Han, Runzhou, Zheng, Mai

论文摘要

诊断存储系统失败即使对于专业人员来说也是有挑战性的。一个示例是“当固态驱动器不是那样固体”事件发生在阿尔戈利亚数据中心,在那里,三星SSD被错误地指责为由Linux内核错误引起的故障。随着系统复杂性的不断增加,这种晦涩的失败可能会更频繁地发生。作为应对挑战的一步,我们提出了称为X射线的持续努力。与关注软件或硬件的传统方法不同,X射线利用虚拟化来收集跨层的事件,并将它们关联以生成相关树。此外,通过应用简单的规则,X射线可以自动突出关键节点。基于5例故障情况的初步结果表明,X射线可以有效地缩小故障搜索空间。

Diagnosing storage system failures is challenging even for professionals. One example is the "When Solid State Drives Are Not That Solid" incident occurred at Algolia data center, where Samsung SSDs were mistakenly blamed for failures caused by a Linux kernel bug. With the system complexity keeps increasing, such obscure failures will likely occur more often. As one step to address the challenge, we present our on-going efforts called X-Ray. Different from traditional methods that focus on either the software or the hardware, X-Ray leverages virtualization to collects events across layers, and correlates them to generate a correlation tree. Moreover, by applying simple rules, X-Ray can highlight critical nodes automatically. Preliminary results based on 5 failure cases shows that X-Ray can effectively narrow down the search space for failures.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源