参数化KTerm哈希

论文标题

参数化KTerm哈希

Parameterizing Kterm Hashing

论文作者

Wurzer, Dominik, Qin, Yumeng

论文摘要

Kterm哈希提供了一种创新的方法来对大量数据流进行新颖性检测。先前的研究重点是最大化KTerm散列的效率，并在不牺牲检测准确性的情况下成功地扩展了首个故事检测到Twitter大小的数据流。在本文中，我们专注于提高KTerm哈希的有效性。传统上，在计算文档的新颖程度相对于过去时，所有kterm都被认为同样重要。我们认为，某些KTERM比其他KTER更重要，并假设均匀的KTerm权重是确定数据流中新颖性的优化。为了验证我们的假设，我们通过根据其特征将权重分配给KTERM进行参数化。我们的实验在第一个故事检测环境中应用KTerm散列，并揭示了参数化的KTerm散列可以超过最新的检测准确性，并且显着超过了均匀加权的方法。

Kterm Hashing provides an innovative approach to novelty detection on massive data streams. Previous research focused on maximizing the efficiency of Kterm Hashing and succeeded in scaling First Story Detection to Twitter-size data stream without sacrificing detection accuracy. In this paper, we focus on improving the effectiveness of Kterm Hashing. Traditionally, all kterms are considered as equally important when calculating a document's degree of novelty with respect to the past. We believe that certain kterms are more important than others and hypothesize that uniform kterm weights are sub-optimal for determining novelty in data streams. To validate our hypothesis, we parameterize Kterm Hashing by assigning weights to kterms based on their characteristics. Our experiments apply Kterm Hashing in a First Story Detection setting and reveal that parameterized Kterm Hashing can surpass state-of-the-art detection accuracy and significantly outperform the uniformly weighted approach.

下载PDF全文

下载文献需遵守相关版权规定

论文标题