论文标题
基于频谱的日志诊断
Spectrum-Based Log Diagnosis
论文作者
论文摘要
我们介绍和评估基于频谱的日志诊断(SBLD),这是一种帮助开发人员快速诊断复杂整合和部署运行中发现的问题的方法。受基于频谱的故障本地化的启发,SBLD利用日志之间的事件发生差异用于失败和传递,以突出显示与失败运行相关的更强事件。 使用我们的工业合作伙伴提供的数据,我们从经验上研究以下问题:(i)SBLD如何减少确定日志中所有与故障相关的事件所需的努力,以进行失败? (ii)如何受到可用数据影响的SBLD的性能? (iii)SBLD与搜索与失败相关事件中经常发生的简单文本模式相比如何?我们使用摘要统计数据和热图可视化回答(i)和(ii),对于(iii),我们比较了SBLD的三种配置(最小值,中位数和最大数据)与使用Wilcoxon签名的秩检验以及Vargha-Delaney衡量随机优势的衡量标准。 我们的评估表明,(i)SBLD对所使用的数据集进行了重大减少的努力,(ii)SBLD通常从其他日志中获得通过运行的其他日志,并且在数据中的传递运行中有大量的日志时,它可以从其他日志中受益于失败运行。最后,(iii)SBLD和文本搜索在减少努力方面大致相同,而文本搜索的回忆稍好一些。我们研究原因,并讨论由于我们数据的特定部分的特征所致。 我们得出的结论是,SBLD显示出有望是诊断失败运行的一种方法,其性能会受到其他数据的积极影响,但在所考虑的数据集中却没有表现优于文本搜索。未来的工作包括调查SBLD对其他数据集的概括性。
We present and evaluate Spectrum-Based Log Diagnosis (SBLD), a method to help developers quickly diagnose problems found in complex integration and deployment runs. Inspired by Spectrum-Based Fault Localization, SBLD leverages the differences in event occurrences between logs for failing and passing runs, to highlight events that are stronger associated with failing runs. Using data provided by our industrial partner, we empirically investigate the following questions: (i) How well does SBLD reduce the effort needed to identify all failure-relevant events in the log for a failing run? (ii) How is the performance of SBLD affected by available data? (iii) How does SBLD compare to searching for simple textual patterns that often occur in failure-relevant events? We answer (i) and (ii) using summary statistics and heatmap visualizations, and for (iii) we compare three configurations of SBLD (with resp. minimum, median and maximum data) against a textual search using Wilcoxon signed-rank tests and the Vargha-Delaney measure of stochastic superiority. Our evaluation shows that (i) SBLD achieves a significant effort reduction for the dataset used, (ii) SBLD benefits from additional logs for passing runs in general, and it benefits from additional logs for failing runs when there is a proportional amount of logs for passing runs in the data. Finally, (iii) SBLD and textual search are roughly equally effective at effort-reduction, while textual search has a slightly better recall. We investigate the cause, and discuss how it is due to the characteristics of a specific part of our data. We conclude that SBLD shows promise as a method for diagnosing failing runs, that its performance is positively affected by additional data, but that it does not outperform textual search on the dataset considered. Future work includes investigating SBLD's generalizability on additional datasets.