论文标题
用于实施不可靠故障检测器的分布式系统级诊断模型
A Distributed System-level Diagnosis Model for the Implementation of Unreliable Failure Detectors
论文作者
论文摘要
可靠的系统需要有效的监视技术来识别故障。系统级诊断最初是在1960年代提出的,是一种基于测试的方法,以监视和识别通用系统的故障组件。在过去的几十年中,已经根据不同的故障模型提出了几种诊断模型和策略,并应用于最多样化的计算机系统。在1990年代,不可靠的故障探测器作为抽象出现,以在异步系统中达成共识。从那时起,故障检测器已成为用于监视分布式系统的\ textit {de exto}标准。本工作的目的是通过提出与不可靠的失败探测器一致的分布式诊断模型来填补概念上的空白。提出了所需的测试/监视消息的数量,事件检测的延迟以及完整性和准确性。介绍了符合提议模型的三个不同的故障探测器,包括VRING和VCUBE,它们为大多数现有故障探测器采用的传统全监测策略提供了可扩展的替代方案。
Reliable systems require effective monitoring techniques for fault identification. System-level diagnosis was originally proposed in the 1960s as a test-based approach to monitor and identify faulty components of a general system. Over the last decades, several diagnosis models and strategies have been proposed, based on different fault models, and applied to the most diverse types of computer systems. In the 1990s, unreliable failure detectors emerged as an abstraction to enable consensus in asynchronous systems subject to crash faults. Since then, failure detectors have become the \textit{de facto} standard for monitoring distributed systems. The purpose of the present work is to fill a conceptual gap by presenting a distributed diagnosis model that is consistent with unreliable failure detectors. Results are presented for the number of tests/monitoring messages required, latency for event detection, as well as completeness and accuracy. Three different failure detectors compliant with the proposed model are presented, including vRing and vCube which provide scalable alternatives to the traditional all-monitor-all strategy adopted by most existing failure detectors.