论文标题
基于语义输注的分类数据的语义噪声清洁
Towards Semantic Noise Cleansing of Categorical Data based on Semantic Infusion
论文作者
论文摘要
语义噪声会严重影响特定领域特定行业的文本分析活动。它阻碍了文本理解在关键决策任务中至关重要的文本理解。在这项工作中,我们将语义噪声形式化为一系列术语,这些术语不影响文本的叙述。我们超越了标准基于统计的停止单词的概念,并考虑了术语的语义,以排除语义噪声。我们提出了一种新型的语义输液技术,将元数据与分类语料库文本联系起来,并证明其近乎无情的性质。基于此技术,我们提出了一个无监督的文本预处理框架,以使用术语上下文过滤语义噪声。稍后,我们使用汽车域中的Web论坛数据集提出了拟议框架的评估结果。
Semantic Noise affects text analytics activities for the domain-specific industries significantly. It impedes the text understanding which holds prime importance in the critical decision making tasks. In this work, we formalize semantic noise as a sequence of terms that do not contribute to the narrative of the text. We look beyond the notion of standard statistically-based stop words and consider the semantics of terms to exclude the semantic noise. We present a novel Semantic Infusion technique to associate meta-data with the categorical corpus text and demonstrate its near-lossless nature. Based on this technique, we propose an unsupervised text-preprocessing framework to filter the semantic noise using the context of the terms. Later we present the evaluation results of the proposed framework using a web forum dataset from the automobile-domain.