论文标题
开发用于Pashto,Farsi和传统中文的新图像转换系统
Development of a New Image-to-text Conversion System for Pashto, Farsi and Traditional Chinese
论文作者
论文摘要
我们报告了一个研究和原型构建项目的结果\ emph {willly〜ocr},该项目致力于为几种语言和写作系统开发新的,更准确的图像转换软件。其中包括草书脚本Farsi和Pashto以及拉丁草书脚本。我们还描述了针对传统中文的方法,该方法是非象征性的,但具有65,000个字符的特色集。我们的方法基于机器学习,尤其是深度学习和数据科学,并针对大量原始文档,超过十亿页。本文的目标受众是对数字人文科学的兴趣或从数字图像中检索准确的全文和元数据的普通受众。
We report upon the results of a research and prototype building project \emph{Worldly~OCR} dedicated to developing new, more accurate image-to-text conversion software for several languages and writing systems. These include the cursive scripts Farsi and Pashto, and Latin cursive scripts. We also describe approaches geared towards Traditional Chinese, which is non-cursive, but features an extremely large character set of 65,000 characters. Our methodology is based on Machine Learning, especially Deep Learning, and Data Science, and is directed towards vast quantities of original documents, exceeding a billion pages. The target audience of this paper is a general audience with interest in Digital Humanities or in retrieval of accurate full-text and metadata from digital images.