论文标题
胸部X光片多标签疾病分类的数据有效的视觉变压器
Data-Efficient Vision Transformers for Multi-Label Disease Classification on Chest Radiographs
论文作者
论文摘要
X光片是用于检测和评估病理,治疗计划或用于导航和本地化目的的多功能诊断工具。但是,放射科医生的解释和评估可能乏味且容易出错。因此,已经提出了多种深度学习方法来支持放射线射线照片的放射学家。通常,这些方法依靠卷积神经网络(CNN)从图像中提取特征。特别是对于胸部X光片(胸部X射线,CXR)的病理学多标签分类,CNN已被证明非常适合。相反,尽管在通用图像和可解释的局部显着性图上的分类性能很高,但视觉变压器(VIT)并未应用于此任务,这可能会增加临床干预措施的价值。 VIT并不依赖于卷积,而是基于基于补丁的自我注意力,与CNN相比,没有先验的局部连接知识。尽管这导致容量增加,但VIT通常需要过多的培训数据,这代表了医疗领域的障碍,因为高成本与收集大型医疗数据集有关。在这项工作中,我们可以系统地比较不同数据集大小的VIT和CNN的分类性能,并评估更多数据有效的VIT变体(DEIT)。我们的结果表明,虽然VIT和CNN之间的性能与VIT相当,但如果可以培训相当大的数据集,DEIT的表现就胜过前者。
Radiographs are a versatile diagnostic tool for the detection and assessment of pathologies, for treatment planning or for navigation and localization purposes in clinical interventions. However, their interpretation and assessment by radiologists can be tedious and error-prone. Thus, a wide variety of deep learning methods have been proposed to support radiologists interpreting radiographs. Mostly, these approaches rely on convolutional neural networks (CNN) to extract features from images. Especially for the multi-label classification of pathologies on chest radiographs (Chest X-Rays, CXR), CNNs have proven to be well suited. On the Contrary, Vision Transformers (ViTs) have not been applied to this task despite their high classification performance on generic images and interpretable local saliency maps which could add value to clinical interventions. ViTs do not rely on convolutions but on patch-based self-attention and in contrast to CNNs, no prior knowledge of local connectivity is present. While this leads to increased capacity, ViTs typically require an excessive amount of training data which represents a hurdle in the medical domain as high costs are associated with collecting large medical data sets. In this work, we systematically compare the classification performance of ViTs and CNNs for different data set sizes and evaluate more data-efficient ViT variants (DeiT). Our results show that while the performance between ViTs and CNNs is on par with a small benefit for ViTs, DeiTs outperform the former if a reasonably large data set is available for training.