comptext：可视化，比较和理解文本语料库

论文标题

comptext：可视化，比较和理解文本语料库

CompText: Visualizing, Comparing & Understanding Text Corpus

论文作者

Varshney, Suvi, Jas, Divjeet Singh

论文摘要

自然语言处理（NLP）中的一种常见实践是在不阅读整个文献的情况下可视化文本语料库，但仍掌握所描述的核心思想和关键点。长期以来，研究人员致力于从文本中提取主题，并根据其在语料库中的相对意义进行可视化。但是，最近，研究人员开始提出更复杂的系统，这些系统不仅揭示了语料库的主题，而且与主题密切相关的一句话，以使用户具有整体视图。这些详细的可视化催生了基于其可视化比较文本语料库的研究。通常将主题进行比较以理想化语料库之间的差异。但是，为了捕获来自不同语料库的更多语义，研究人员已经开始根据与文本相关的主题的观点进行比较。比较携带重量最多的单词，我们可以了解语料库的重要主题。有多个现有的文本比较存在比较主题而不是情感的方法，但是我们认为专注于携带情感的单词可以更好地比较这两个语料库。由于只有情感可以解释文本的真实感觉，而不仅仅是主题，因此没有情感的话题只是名词。我们的目的是将语料库区分开来，而不是比较两个语料库中出现的所有单词。背后的理由是，这两个语料库没有很多单词以进行并排比较，因此，比较情感词语使我们了解了语料库如何吸引读者的情感。我们可以说，熵，主题的意外性和差异也应该很重要，并帮助我们确定关键的枢轴点以及某些主题在语料库中以及相对情感的重要性。

A common practice in Natural Language Processing (NLP) is to visualize the text corpus without reading through the entire literature, still grasping the central idea and key points described. For a long time, researchers focused on extracting topics from the text and visualizing them based on their relative significance in the corpus. However, recently, researchers started coming up with more complex systems that not only expose the topics of the corpus but also word closely related to the topic to give users a holistic view. These detailed visualizations spawned research on comparing text corpora based on their visualization. Topics are often compared to idealize the difference between corpora. However, to capture greater semantics from different corpora, researchers have started to compare texts based on the sentiment of the topics related to the text. Comparing the words carrying the most weightage, we can get an idea about the important topics for corpus. There are multiple existing texts comparing methods present that compare topics rather than sentiments but we feel that focusing on sentiment-carrying words would better compare the two corpora. Since only sentiments can explain the real feeling of the text and not just the topic, topics without sentiments are just nouns. We aim to differentiate the corpus with a focus on sentiment, as opposed to comparing all the words appearing in the two corpora. The rationale behind this is, that the two corpora do not many have identical words for side-by-side comparison, so comparing the sentiment words gives us an idea of how the corpora are appealing to the emotions of the reader. We can argue that the entropy or the unexpectedness and divergence of topics should also be of importance and help us to identify key pivot points and the importance of certain topics in the corpus alongside relative sentiment.

下载PDF全文

下载文献需遵守相关版权规定

论文标题