论文标题
指示符:指示语言的自动标点恢复和逆文本归一化框架
indic-punct: An automatic punctuation restoration and inverse text normalization framework for Indic languages
论文作者
论文摘要
自动语音识别(ASR)生成的文本大多数没有任何标点符号。缺乏标点符号是文本会影响可读性。此外,下流的NLP任务,例如情感分析,机器翻译,通过标点符号和句子边界信息受益匪浅。我们提出了一种使用预验证的Indienbert模型自动标点符号的方法。逆文本归一化是通过手写加权有限状态传感器(WFST)语法完成的。我们已经为11种指示语言开发了此工具,即印度语,泰米尔语,泰卢固语,卡纳达语,古吉拉特语,马拉松,奥迪亚,孟加拉语,阿萨姆语,马拉雅拉姆语和旁遮普语。所有代码和数据都是公开的。可用的
Automatic Speech Recognition (ASR) generates text which is most of the times devoid of any punctuation. Absence of punctuation is text can affect readability. Also, down stream NLP tasks such as sentiment analysis, machine translation, greatly benefit by having punctuation and sentence boundary information. We present an approach for automatic punctuation of text using a pretrained IndicBERT model. Inverse text normalization is done by hand writing weighted finite state transducer (WFST) grammars. We have developed this tool for 11 Indic languages namely Hindi, Tamil, Telugu, Kannada, Gujarati, Marathi, Odia, Bengali, Assamese, Malayalam and Punjabi. All code and data is publicly. available