论文标题
自我监督的视力语言预测用于医学视觉问题回答
Self-supervised vision-language pretraining for Medical visual question answering
论文作者
论文摘要
鉴于放射线图像,医学图像视觉问题回答(VQA)是回答临床问题的任务,这是一个具有挑战性的问题,需要模型来整合视觉和语言信息。为了解决有限数量的培训数据的医疗VQA问题,预处理范式被广泛用于改善模型概括。在本文中,我们提出了一种自我监督的方法,该方法应用了掩盖的图像建模,掩盖语言建模,图像文本匹配和图像文本对比度学习(M2I2),以在医学图像字幕数据集上进行预处理,以及登录到下游医疗VQA任务。所提出的方法在所有三个公共医疗VQA数据集上实现了最先进的性能。我们的代码和模型可在https://github.com/pengfeiliheu/m2i2上找到。
Medical image visual question answering (VQA) is a task to answer clinical questions, given a radiographic image, which is a challenging problem that requires a model to integrate both vision and language information. To solve medical VQA problems with a limited number of training data, pretrain-finetune paradigm is widely used to improve the model generalization. In this paper, we propose a self-supervised method that applies Masked image modeling, Masked language modeling, Image text matching and Image text alignment via contrastive learning (M2I2) for pretraining on medical image caption dataset, and finetunes to downstream medical VQA tasks. The proposed method achieves state-of-the-art performance on all the three public medical VQA datasets. Our codes and models are available at https://github.com/pengfeiliHEU/M2I2.