论文标题
分类和异常检测的有效分层聚类
Efficient Hierarchical Clustering for Classification and Anomaly Detection
论文作者
论文摘要
我们解决了社交网络上发布的内容的大规模实时分类的问题,以及需要快速识别新型垃圾邮件类型的问题。与需要对新内容类型进行分类的速率相比,使用编辑标签和分类学开发滞后获得用户生成的内容的手册标签。我们提出了一类层次聚类算法,这些算法可用于有效且可扩展的实时多类分类以及检测用户生成的内容中的新异常。我们的方法具有较低的查询时间,线性空间的使用情况,并具有有关特定层次集群成本函数的理论保证(Dasgupta,2016年)。我们将解决方案与一系列分类技术进行比较,并表现出出色的经验性能。
We address the problem of large scale real-time classification of content posted on social networks, along with the need to rapidly identify novel spam types. Obtaining manual labels for user-generated content using editorial labeling and taxonomy development lags compared to the rate at which new content type needs to be classified. We propose a class of hierarchical clustering algorithms that can be used both for efficient and scalable real-time multiclass classification as well as in detecting new anomalies in user-generated content. Our methods have low query time, linear space usage, and come with theoretical guarantees with respect to a specific hierarchical clustering cost function (Dasgupta, 2016). We compare our solutions against a range of classification techniques and demonstrate excellent empirical performance.