论文标题
文档流上神经网络分类系统的评估
Evaluation of Neural Network Classification Systems on Document Stream
论文作者
论文摘要
用于文档分类目的的艺术神经网络(NN)方法的主要缺点是获得有效分类所需的大量培训样本。所需的最低数字约为每个班级的一千个注释文档。在许多情况下,在实际工业过程中收集这一数量的样本是非常困难的,即使不是不可能的。在本文中,我们根据公司文档流的情况分析了在次优培训案例中基于NN的文档分类系统的效率。我们评估了三种不同的方法,一种基于图像内容,两种基于文本内容。评估分为四个部分:参考案例,以评估实验室中系统的性能;两种情况都模拟了与文档流处理链接的特定难度;以及一个结合了所有这些困难的现实情况。现实的案例强调了一个事实,即基于NN的文档分类系统的效率显着下降。尽管它们对于代表良好的课程(对于这些类别的系统过度拟合)仍然有效,但他们不可能处理适当的代表班级。基于NN的文档分类系统需要调整以解决这两个问题,然后才能将其考虑在公司文档流中。
One major drawback of state of the art Neural Networks (NN)-based approaches for document classification purposes is the large number of training samples required to obtain an efficient classification. The minimum required number is around one thousand annotated documents for each class. In many cases it is very difficult, if not impossible, to gather this number of samples in real industrial processes. In this paper, we analyse the efficiency of NN-based document classification systems in a sub-optimal training case, based on the situation of a company document stream. We evaluated three different approaches, one based on image content and two on textual content. The evaluation was divided into four parts: a reference case, to assess the performance of the system in the lab; two cases that each simulate a specific difficulty linked to document stream processing; and a realistic case that combined all of these difficulties. The realistic case highlighted the fact that there is a significant drop in the efficiency of NN-Based document classification systems. Although they remain efficient for well represented classes (with an over-fitting of the system for those classes), it is impossible for them to handle appropriately less well represented classes. NN-Based document classification systems need to be adapted to resolve these two problems before they can be considered for use in a company document stream.