论文标题
大型字体独立乌尔都语文本识别系统
Large Scale Font Independent Urdu Text Recognition System
论文作者
论文摘要
OCR算法最近在性能方面得到了显着改善,这主要是由于人工智能算法的能力增加。但是,这种进步并非在所有语言上均匀分布。乌尔都语是没有得到太多关注的语言之一,尤其是在字体独立角度上。没有自动化系统可以可靠地识别不同字体的图像和视频中的印刷乌尔都语文本。为了帮助弥合这一差距,我们开发了Qaida,这是一个带有256个字体的大型数据集,以及完整的乌尔都语词典。我们还开发了一个基于卷积的神经网络(CNN)的分类模型,该模型可以识别精度为84.2%的乌尔都语连接。此外,我们证明了我们的识别网络不仅可以识别受过训练的字体中的文本,而且还可以可靠地识别出未见字体(新)字体的文本。为此,本文做出了以下贡献:(i)我们为印刷乌尔都语文本识别引入了一个大规模的,基于多个字体的数据集;(ii)我们设计,培训和评估了基于CNN的乌尔都语文本识别模型; (iii)我们尝试使用增量学习方法,以产生乌尔都语文本识别的最新结果。所有实验选择均通过详细的经验分析进行了彻底验证。我们认为,这项研究可以作为进一步改善字体独立乌尔都语OCR系统性能的基础。
OCR algorithms have received a significant improvement in performance recently, mainly due to the increase in the capabilities of artificial intelligence algorithms. However, this advancement is not evenly distributed over all languages. Urdu is among the languages which did not receive much attention, especially in the font independent perspective. There exists no automated system that can reliably recognize printed Urdu text in images and videos across different fonts. To help bridge this gap, we have developed Qaida, a large scale data set with 256 fonts, and a complete Urdu lexicon. We have also developed a Convolutional Neural Network (CNN) based classification model which can recognize Urdu ligatures with 84.2% accuracy. Moreover, we demonstrate that our recognition network can not only recognize the text in the fonts it is trained on but can also reliably recognize text in unseen (new) fonts. To this end, this paper makes following contributions: (i) we introduce a large scale, multiple fonts based data set for printed Urdu text recognition;(ii) we have designed, trained and evaluated a CNN based model for Urdu text recognition; (iii) we experiment with incremental learning methods to produce state-of-the-art results for Urdu text recognition. All the experiment choices were thoroughly validated via detailed empirical analysis. We believe that this study can serve as the basis for further improvement in the performance of font independent Urdu OCR systems.