大规模多模式分类，使用变压器模型和共同注意

论文标题

大规模多模式分类，使用变压器模型和共同注意

Large Scale Multimodal Classification Using an Ensemble of Transformer Models and Co-Attention

论文作者

Chordia, Varnith, BG, Vijay Kumar

论文摘要

准确有效的产品分类对于电子商务应用程序非常重要，因为它可以实现各种下游任务，例如建议，检索和定价。项目通常包含文本和视觉信息，并且使用这两种模式通常都比单独使用任何模式优于分类。在本文中，我们描述了Sigir Ecom Rakuten数据挑战的方法和结果。我们采用双重注意技术，使用验证的语言和图像嵌入来对图像文本关系进行建模。虽然双重关注已被广泛用于视觉问题回答（VQA）任务，但我们的是将概念应用于多模式分类的首次尝试。

Accurate and efficient product classification is significant for E-commerce applications, as it enables various downstream tasks such as recommendation, retrieval, and pricing. Items often contain textual and visual information, and utilizing both modalities usually outperforms classification utilizing either mode alone. In this paper we describe our methodology and results for the SIGIR eCom Rakuten Data Challenge. We employ a dual attention technique to model image-text relationships using pretrained language and image embeddings. While dual attention has been widely used for Visual Question Answering(VQA) tasks, ours is the first attempt to apply the concept for multimodal classification.

下载PDF全文

下载文献需遵守相关版权规定

论文标题