论文标题
Vartani Spellcheck-使用BERT和LEVENSHTEIN距离对OCR生成的印地语文本进行自动上下文敏感的拼写校正
Vartani Spellcheck -- Automatic Context-Sensitive Spelling Correction of OCR-generated Hindi Text Using BERT and Levenshtein Distance
论文作者
论文摘要
产生高度拐点文本的传统光学特征识别(OCR)系统,例如印地语等语言,由于广泛的字母集,复合字符和单词分割字符的难度,往往会遭受精确的差。自动拼写错误检测和上下文敏感误差校正可以通过后处理这些OCR系统生成的文本来提高准确性。大多数先前开发的语言模型用于校正印地语拼写是没有上下文的。在本文中,我们提出了Vartani SpellCheck-一种使用最先进的变压器对印地语文本拼写校正的上下文敏感方法 - 与Levenshtein距离远程算法(通常称为编辑距离)结合使用。我们使用查找字典和基于上下文的命名实体识别(NER)来检测文本中可能的拼写错误。我们提出的技术已在印地语史诗般的Ramayana上广泛使用的Tesseract OCR产生的大量文本中进行了测试。该结果的准确性为81%,对印地语的某些先前建立的上下文相关误差校正机制显示出显着改善。我们还解释了如何在文本编辑器环境中连续打字过程中使用Vartani SpellCheck进行自动更正建议。
Traditional Optical Character Recognition (OCR) systems that generate text of highly inflectional Indic languages like Hindi tend to suffer from poor accuracy due to a wide alphabet set, compound characters and difficulty in segmenting characters in a word. Automatic spelling error detection and context-sensitive error correction can be used to improve accuracy by post-processing the text generated by these OCR systems. A majority of previously developed language models for error correction of Hindi spelling have been context-free. In this paper, we present Vartani Spellcheck - a context-sensitive approach for spelling correction of Hindi text using a state-of-the-art transformer - BERT in conjunction with the Levenshtein distance algorithm, popularly known as Edit Distance. We use a lookup dictionary and context-based named entity recognition (NER) for detection of possible spelling errors in the text. Our proposed technique has been tested on a large corpus of text generated by the widely used Tesseract OCR on the Hindi epic Ramayana. With an accuracy of 81%, the results show a significant improvement over some of the previously established context-sensitive error correction mechanisms for Hindi. We also explain how Vartani Spellcheck may be used for on-the-fly autocorrect suggestion during continuous typing in a text editor environment.