越南大写和标点符号恢复模型

论文标题

越南大写和标点符号恢复模型

Vietnamese Capitalization and Punctuation Recovery Models

论文作者

Uyen, Hoang Thi Thu, Tu, Nguyen Anh, Huy, Ta Duc

论文摘要

尽管在自动语音识别（ASR）中的最新表现方法增加了，但这种方法并不能确保其输出的适当套管和标点符号。这个问题对自然语言处理（NLP）算法和人类的理解都有重大影响。对于原始文本输入的预处理管道，必须进行资本化和标点符号恢复。对于越南人等低资源语言，此任务的公共数据集很少。在本文中，我们为越南人的资本化和标点符号恢复贡献了一个公共数据集；并提出了一个名为intercappunc的任务的联合模型。越南数据集的实验结果显示了我们联合模型的有效性与单个模型和先前的联合学习模型相比。我们在https://github.com/anhtunguyen98/jointcappund上公开发布数据集和模型的实现

Despite the rise of recent performant methods in Automatic Speech Recognition (ASR), such methods do not ensure proper casing and punctuation for their outputs. This problem has a significant impact on the comprehension of both Natural Language Processing (NLP) algorithms and human to process. Capitalization and punctuation restoration is imperative in pre-processing pipelines for raw textual inputs. For low resource languages like Vietnamese, public datasets for this task are scarce. In this paper, we contribute a public dataset for capitalization and punctuation recovery for Vietnamese; and propose a joint model for both tasks named JointCapPunc. Experimental results on the Vietnamese dataset show the effectiveness of our joint model compare to single model and previous joint learning model. We publicly release our dataset and the implementation of our model at https://github.com/anhtunguyen98/JointCapPunc

下载PDF全文

下载文献需遵守相关版权规定

论文标题