论文标题
菲律宾Wordnet的自动构造:使用句子嵌入的词感应感应和同步诱导
Towards Automatic Construction of Filipino WordNet: Word Sense Induction and Synset Induction Using Sentence Embeddings
论文作者
论文摘要
WordNet是各种自然语言处理应用程序必不可少的工具。不幸的是,在时间和资源方面,WordNet已过时了,生产或更新WordNet的生产或更新的WordNet可能会很慢且昂贵。对于低资源语言,此问题加剧了。这项研究提出了一种仅使用两个语言资源的单词感应诱导和综合诱导方法,即未标记的语料库和一个基于句子的语言模型。最终的感觉清单和同义词集可以自动创建WordNet。我们将此方法应用于菲律宾文本语料库。通过将其与机器翻译为Princeton WordNet的机器的感觉清单相匹配,并将Synset与Filipino WordNet进行比较,从而评估了Sense库存和合成器。这项研究从经验上表明,30%的诱导单词感官有效,40%的诱导合成器有效,其中20%是新型的合成器。
Wordnets are indispensable tools for various natural language processing applications. Unfortunately, wordnets get outdated, and producing or updating wordnets can be slow and costly in terms of time and resources. This problem intensifies for low-resource languages. This study proposes a method for word sense induction and synset induction using only two linguistic resources, namely, an unlabeled corpus and a sentence embeddings-based language model. The resulting sense inventory and synonym sets can be used in automatically creating a wordnet. We applied this method on a corpus of Filipino text. The sense inventory and synsets were evaluated by matching them with the sense inventory of the machine translated Princeton WordNet, as well as comparing the synsets to the Filipino WordNet. This study empirically shows that the 30% of the induced word senses are valid and 40% of the induced synsets are valid in which 20% are novel synsets.