FashionVQA：一个特定领域的视觉问题回答系统

论文标题

FashionVQA：一个特定领域的视觉问题回答系统

FashionVQA: A Domain-Specific Visual Question Answering System

论文作者

Wang, Min, Mahjoubfar, Ata, Joshi, Anupama

论文摘要

人类通过各种感官方式逮捕了世界，但语言是他们主要的交流渠道。机器学习系统需要利用相同的多模式丰富性，以自然语言的人类知情的话语；对于专门从事视觉密集信息的系统，例如对话，建议和搜索引擎，尤其如此。为此，我们训练一个视觉问题回答（VQA）系统，以回答有关时尚拍摄图像中服装的复杂自然语言问题。成功培训我们VQA模型的关键是使用不同模板从207,000张图像的项目属性中自动创建一个视觉提问数据集。样本生成采用了一种策略，该策略考虑了提问的困难，以强调具有挑战性的概念。与使用多个数据集预处理视觉问题答案模型的最新趋势相反，我们专注于保持数据集的固定，同时从头开始训练各种模型以隔离模型体系结构的改进。我们看到，使用相同的变压器编码问题并解码答案，就像在语言模型中一样，可以达到最大的准确性，表明视觉语言模型（VLMS）为我们的数据集提供了最佳的视觉问题答案系统。即使回答不限于模板格式的人类生成的问题，最佳模型的准确性也超过了人类专家的水平。我们生成大规模多模式域特异性数据集的方法为训练能够以自然语言进行交流的专业模型提供了途径。这样的域 - 专家模型的培训，例如我们的时尚VLM模型，不能仅依靠从网络收集的大规模通用数据集。

Humans apprehend the world through various sensory modalities, yet language is their predominant communication channel. Machine learning systems need to draw on the same multimodal richness to have informed discourses with humans in natural language; this is particularly true for systems specialized in visually-dense information, such as dialogue, recommendation, and search engines for clothing. To this end, we train a visual question answering (VQA) system to answer complex natural language questions about apparel in fashion photoshoot images. The key to the successful training of our VQA model is the automatic creation of a visual question-answering dataset with 168 million samples from item attributes of 207 thousand images using diverse templates. The sample generation employs a strategy that considers the difficulty of the question-answer pairs to emphasize challenging concepts. Contrary to the recent trends in using several datasets for pretraining the visual question answering models, we focused on keeping the dataset fixed while training various models from scratch to isolate the improvements from model architecture changes. We see that using the same transformer for encoding the question and decoding the answer, as in language models, achieves maximum accuracy, showing that visual language models (VLMs) make the best visual question answering systems for our dataset. The accuracy of the best model surpasses the human expert level, even when answering human-generated questions that are not confined to the template formats. Our approach for generating a large-scale multimodal domain-specific dataset provides a path for training specialized models capable of communicating in natural language. The training of such domain-expert models, e.g., our fashion VLM model, cannot rely solely on the large-scale general-purpose datasets collected from the web.

下载PDF全文

下载文献需遵守相关版权规定

论文标题