Pixel-Bert：将图像像素与深层多模式变压器的文本对齐像素

论文标题

Pixel-Bert：将图像像素与深层多模式变压器的文本对齐像素

Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers

论文作者

Huang, Zhicheng, Zeng, Zhaoyang, Liu, Bei, Fu, Dongmei, Fu, Jianlong

论文摘要

我们建议Pixel-Bert将图像像素与深层多模式变压器的文本相结合，这些变压器共同学习统一的端到端框架中的视觉和语言。我们旨在直接从图像像素和语言语义对图像和句子对之间建立更准确，更彻底的联系，而不是将基于区域的图像特征作为最新的视觉和语言任务。我们的Pixel-Bert在像素和文本级别中对齐语义连接，解决了视觉和语言任务的特定任务视觉表示的限制。它还减轻了边界框注释的成本，并克服了视觉任务和语言语义中语义标签之间的不平衡。为了为下游任务提供更好的表示形式，我们预先训练了一个通用端到端模型，该模型具有来自Visual Genome数据集和MS-Coco数据集的图像和句子对。我们建议使用随机的像素采样机制来增强视觉表示的鲁棒性，并将蒙版语言模型和图像文本匹配作为预训练任务。通过我们的预训练模型，对下游任务进行了广泛的实验表明，我们的方法是下游任务中最先进的方法，包括视觉问题答案（VQA），图像文本检索，自然语言，用于真实的视觉推理（NLVR）。特别是，与SOTA相比，我们在VQA任务中的单个模型的性能提高了2.17分。

We propose Pixel-BERT to align image pixels with text by deep multi-modal transformers that jointly learn visual and language embedding in a unified end-to-end framework. We aim to build a more accurate and thorough connection between image pixels and language semantics directly from image and sentence pairs instead of using region-based image features as the most recent vision and language tasks. Our Pixel-BERT which aligns semantic connection in pixel and text level solves the limitation of task-specific visual representation for vision and language tasks. It also relieves the cost of bounding box annotations and overcomes the unbalance between semantic labels in visual task and language semantic. To provide a better representation for down-stream tasks, we pre-train a universal end-to-end model with image and sentence pairs from Visual Genome dataset and MS-COCO dataset. We propose to use a random pixel sampling mechanism to enhance the robustness of visual representation and to apply the Masked Language Model and Image-Text Matching as pre-training tasks. Extensive experiments on downstream tasks with our pre-trained model show that our approach makes the most state-of-the-arts in downstream tasks, including Visual Question Answering (VQA), image-text retrieval, Natural Language for Visual Reasoning for Real (NLVR). Particularly, we boost the performance of a single model in VQA task by 2.17 points compared with SOTA under fair comparison.

下载PDF全文

下载文献需遵守相关版权规定

论文标题