论文标题

装饰:分层多元异常在云规模上检测

DeCorus: Hierarchical Multivariate Anomaly Detection at Cloud-Scale

论文作者

Wassermann, Bruno, Ohana, David, Schaffer, Ronen, Shahla, Robert, Kolodner, Elliot K., Raichstein, Eran, Malka, Michal

论文摘要

多元异常检测可用于识别用于计算系统的大量遥测数据中的中断。但是,开发一个可以为用户提供相关信息的有效异常检测器是一个具有挑战性的问题。我们介绍了称为Decorus的分层多元异常检测方法,该检测称为Decorus,这是一种实现线性复杂性的统计多元异常检测器。它扩展了标准的统计技术,以提高其在嘈杂信号中查找相关异常的能力,并利用系统操作员通常拥有的域类型知识来计算系统级别的异常分数。我们描述了Decorus的实现,用于在云服务提供商中部署的网络设备系统syslog消息的在线日志异常检测工具。我们使用由150亿美元的网络设备系统列出消息和数百张事件票组成的现实世界数据集来表征装饰的性能,并比较其检测事件与五个替代异常检测器的能力。虽然装饰的表现优于其他异常检测器,但它们都受到我们的数据集的挑战。我们分享了Decorus如何在现场提供价值以及我们计划如何提高其事件检测准确性。

Multivariate anomaly detection can be used to identify outages within large volumes of telemetry data for computing systems. However, developing an efficient anomaly detector that can provide users with relevant information is a challenging problem. We introduce our approach to hierarchical multivariate anomaly detection called DeCorus, a statistical multivariate anomaly detector which achieves linear complexity. It extends standard statistical techniques to improve their ability to find relevant anomalies within noisy signals and makes use of types of domain knowledge that system operators commonly possess to compute system-level anomaly scores. We describe the implementation of DeCorus an online log anomaly detection tool for network device syslog messages deployed at a cloud service provider. We use real-world data sets that consist of $1.5$ billion network device syslog messages and hundreds of incident tickets to characterize the performance of DeCorus and compare its ability to detect incidents with five alternative anomaly detectors. While DeCorus outperforms the other anomaly detectors, all of them are challenged by our data set. We share how DeCorus provides value in the field and how we plan to improve its incident detection accuracy.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源