论文标题
Bistloc:通过加权风险来定位多维根本原因
RiskLoc: Localization of Multi-dimensional Root Causes by Weighted Risk
论文作者
论文摘要
大规模软件系统中的故障和异常是不可避免的事件。当发现问题时,操作员需要快速,正确地确定其位置,以促进快速维修。在这项工作中,我们考虑了确定根本原因集的问题,该问题集最能解释具有分类属性的多维时间序列中的异常。巨大的搜索空间是主要的挑战,即使对于少数属性和少量值集,理论组合的数量太大而无法蛮力。因此,先前的方法集中在减少搜索空间上,但是它们都遭受了各种问题的困扰,需要大量的手动参数调整,太慢,因此不切实际,或者无法找到更复杂的根本原因。我们提出了Bistloc来解决多维根部原因定位的问题。 Bistloc采用2路分区方案,并分配元素权重,该元素权重随着距离分区点的距离而线性增加。将风险分数分配给整合两个因素的每个元素,1)其加权比例在异常分区中,2)针对波纹效应属性调整的偏差分数的相对变化。在多个数据集上进行了广泛的实验,验证了Imbloc的有效性和效率,并且为了进行全面的评估,我们介绍了三个合成生成的数据集,可补充现有数据集。我们证明,Bistloc始终优于最先进的基线,尤其是在更具挑战性的根本原因场景中,在第二好的方法中,F1得分的增长速度高达57%,而可比的运行时间。
Failures and anomalies in large-scale software systems are unavoidable incidents. When an issue is detected, operators need to quickly and correctly identify its location to facilitate a swift repair. In this work, we consider the problem of identifying the root cause set that best explains an anomaly in multi-dimensional time series with categorical attributes. The huge search space is the main challenge, even for a small number of attributes and small value sets, the number of theoretical combinations is too large to brute force. Previous approaches have thus focused on reducing the search space, but they all suffer from various issues, requiring extensive manual parameter tuning, being too slow and thus impractical, or being incapable of finding more complex root causes. We propose RiskLoc to solve the problem of multidimensional root cause localization. RiskLoc applies a 2-way partitioning scheme and assigns element weights that linearly increase with the distance from the partitioning point. A risk score is assigned to each element that integrates two factors, 1) its weighted proportion within the abnormal partition, and 2) the relative change in the deviation score adjusted for the ripple effect property. Extensive experiments on multiple datasets verify the effectiveness and efficiency of RiskLoc, and for a comprehensive evaluation, we introduce three synthetically generated datasets that complement existing datasets. We demonstrate that RiskLoc consistently outperforms state-of-the-art baselines, especially in more challenging root cause scenarios, with gains in F1-score up to 57% over the second-best approach with comparable running times.