论文标题
VLCDOC:跨模式文档分类的视觉对比培训模型
VLCDoC: Vision-Language Contrastive Pre-Training Model for Cross-Modal Document Classification
论文作者
论文摘要
从文档数据中进行的多模式学习最近取得了巨大的成功,因为它允许将语义有意义的功能预先培训为前面的可学习下游任务。在本文中,我们通过使用语言和视觉提示来考虑跨模式的表示,考虑了模式内和模式间的关系,我们通过语言和视觉线索来解决文档分类问题。该方法没有将不同模态的特征合并为联合表示空间,而是利用高级相互作用,并从跨模态内外的有效注意流中学习相关的语义信息。提出的学习目标是在内部和模式间比对任务之间设计的,其中每任务的相似性分布是通过收缩阳性样本对计算的,同时在关节表示空间中同时对比}。公共文档分类数据集的广泛实验证明了我们模型在低规模和大规模数据集上的有效性和一般性。
Multimodal learning from document data has achieved great success lately as it allows to pre-train semantically meaningful features as a prior into a learnable downstream task. In this paper, we approach the document classification problem by learning cross-modal representations through language and vision cues, considering intra- and inter-modality relationships. Instead of merging features from different modalities into a joint representation space, the proposed method exploits high-level interactions and learns relevant semantic information from effective attention flows within and across modalities. The proposed learning objective is devised between intra- and inter-modality alignment tasks, where the similarity distribution per task is computed by contracting positive sample pairs while simultaneously contrasting negative ones in the joint representation space}. Extensive experiments on public document classification datasets demonstrate the effectiveness and the generality of our model on low-scale and large-scale datasets.