看马，只有400个样品！重新审查自动N-gram规则生成在菲律宾的拼写标准化的有效性

论文标题

看马，只有400个样品！重新审查自动N-gram规则生成在菲律宾的拼写标准化的有效性

Look Ma, Only 400 Samples! Revisiting the Effectiveness of Automatic N-Gram Rule Generation for Spelling Normalization in Filipino

论文作者

Flores, Lorenzo Jaime Yu, Radev, Dragomir

论文摘要

凭借在线8475万菲律宾人，模型处理在线文本的能力对于开发菲律宾NLP应用程序至关重要。为此，拼写校正是下游处理的关键预处理步骤。但是，缺乏数据阻止使用语言模型来完成此任务。在本文中，我们提出了一个具有自动规则提取的N-gram + Damerau Levenshtein距离模型。我们在300个样本上训练模型，并表明，尽管训练数据有限，但它的性能良好，并且在准确性和编辑距离方面胜过其他深度学习方法。此外，模型（1）几乎不需要计算功率，（2）在很少的时间内进行训练，因此可以进行重新训练，并且（3）易于解释，可以直接进行故障排除，从而突出了传统方法在数据不可避免的情况下，在数据的设置中，传统方法的成功对更复杂的深度学习模型的成功。

With 84.75 million Filipinos online, the ability for models to process online text is crucial for developing Filipino NLP applications. To this end, spelling correction is a crucial preprocessing step for downstream processing. However, the lack of data prevents the use of language models for this task. In this paper, we propose an N-Gram + Damerau Levenshtein distance model with automatic rule extraction. We train the model on 300 samples, and show that despite limited training data, it achieves good performance and outperforms other deep learning approaches in terms of accuracy and edit distance. Moreover, the model (1) requires little compute power, (2) trains in little time, thus allowing for retraining, and (3) is easily interpretable, allowing for direct troubleshooting, highlighting the success of traditional approaches over more complex deep learning models in settings where data is unavailable.

下载PDF全文

下载文献需遵守相关版权规定

论文标题