图像字幕系统的自动测试

论文标题

图像字幕系统的自动测试

Automated Testing of Image Captioning Systems

论文作者

Yu, Boxi, Zhong, Zhiqing, Qin, Xinran, Yao, Jiayi, Wang, Yuancheng, He, Pinjia

论文摘要

图像字幕（IC）系统自动生成图像中显着对象的文本描述（实际或合成），在过去的几年中，由于发展深神经网络的发展，它在过去几年中取得了长足的进步。 IC在人类社会中起着必不可少的作用，例如，为科学研究的大量照片标记，并协助视觉受损的人感知到世界。但是，即使是一流的IC系统，例如Microsoft Azure认知服务和IBM图像标题生成器，也可能返回不正确的结果，从而导致省略重要对象，深刻误解和对人身安全的威胁。为了解决这个问题，我们提出了元数据，\ textIt {first}变质测试方法以验证IC系统。我们的核心想法是，对象名称应在对象插入后显示方向变化。具体而言，元（1）从现有图像中提取对象来构建对象语料库；（2）通过新颖的对象调整大小和调整算法将对象插入图像中；（3）报告标题的图像对没有以预期的方式表现出差异。在我们的评估中，我们使用Metaic测试了一个广泛的图像字幕API和五个最先进的图像图像字幕模型。使用1,000个种子，元地区成功地报告了16,825个错误问题（84.9 \％-98.4 \％）。有三种错误：错误分类，遗漏和数量不正确。我们可以看到元元报道的错误，这表明灵活的重叠设置通过增加和多样化报告的错误来促进IC测试。此外，可以进一步概括元数据以检测训练数据集中的标签错误，该数据集已成功检测到MS Coco标题中的151个错误标签，这是图像字幕中的标准数据集。

Image captioning (IC) systems, which automatically generate a text description of the salient objects in an image (real or synthetic), have seen great progress over the past few years due to the development of deep neural networks. IC plays an indispensable role in human society, for example, labeling massive photos for scientific studies and assisting visually-impaired people in perceiving the world. However, even the top-notch IC systems, such as Microsoft Azure Cognitive Services and IBM Image Caption Generator, may return incorrect results, leading to the omission of important objects, deep misunderstanding, and threats to personal safety. To address this problem, we propose MetaIC, the \textit{first} metamorphic testing approach to validate IC systems. Our core idea is that the object names should exhibit directional changes after object insertion. Specifically, MetaIC (1) extracts objects from existing images to construct an object corpus; (2) inserts an object into an image via novel object resizing and location tuning algorithms; and (3) reports image pairs whose captions do not exhibit differences in an expected way. In our evaluation, we use MetaIC to test one widely-adopted image captioning API and five state-of-the-art (SOTA) image captioning models. Using 1,000 seeds, MetaIC successfully reports 16,825 erroneous issues with high precision (84.9\%-98.4\%). There are three kinds of errors: misclassification, omission, and incorrect quantity. We visualize the errors reported by MetaIC, which shows that flexible overlapping setting facilitates IC testing by increasing and diversifying the reported errors. In addition, MetaIC can be further generalized to detect label errors in the training dataset, which has successfully detected 151 incorrect labels in MS COCO Caption, a standard dataset in image captioning.

下载PDF全文

下载文献需遵守相关版权规定

论文标题