论文标题
Roentgen:胸部X射线产生的视觉基础模型
RoentGen: Vision-Language Foundation Model for Chest X-ray Generation
论文作者
论文摘要
在大型自然图像文本对数据集中训练的多模型模型在生成高质量图像方面表现出惊人的能力。医学成像数据从根本上与自然图像不同,用于简洁捕获医学数据中相关细节的语言使用了不同的,狭窄但语义上富含域的特定词汇。毫不奇怪,在自然图像文本对中训练的多模式模型往往不会很好地推广到医疗领域。在提供组成多样性的同时,忠实地代表医学概念的生成成像模型可以减轻现有的高质量,带注释的医学成像数据集的匮乏。在这项工作中,我们通过在公共可用的胸部X射线(CXR)及其相应的放射学(文本)报告中调整预训练的潜在扩散模型来制定一种克服大型自然医学分布转移的策略。我们研究了该模型在文本提示下产生高保真性,多样的合成CXR的能力。我们使用图像质量指标进行定量评估模型输出,并评估人类领域专家的图像质量和文本图像对齐。我们提供了证据表明,所得模型(Roentgen)能够创建可令人信服的,多样化的合成CXR图像,并且可以通过使用包括放射学特定语言在内的自由形式文本提示来在新的范围内进行新的范围。在固定训练集中对该模型进行微调并将其用作数据增强方法,我们测量了经过合成和真实图像的共同培训的分类器提高5%,并且在接受更大但纯粹合成的训练集接受培训时,培训3%。最后,我们观察到,这种微调的蒸馏剂在文本编码器中具有内域知识,并且可以提高其某些疾病(例如气胸)的表示能力,将其提高25%。
Multimodal models trained on large natural image-text pair datasets have exhibited astounding abilities in generating high-quality images. Medical imaging data is fundamentally different to natural images, and the language used to succinctly capture relevant details in medical data uses a different, narrow but semantically rich, domain-specific vocabulary. Not surprisingly, multi-modal models trained on natural image-text pairs do not tend to generalize well to the medical domain. Developing generative imaging models faithfully representing medical concepts while providing compositional diversity could mitigate the existing paucity of high-quality, annotated medical imaging datasets. In this work, we develop a strategy to overcome the large natural-medical distributional shift by adapting a pre-trained latent diffusion model on a corpus of publicly available chest x-rays (CXR) and their corresponding radiology (text) reports. We investigate the model's ability to generate high-fidelity, diverse synthetic CXR conditioned on text prompts. We assess the model outputs quantitatively using image quality metrics, and evaluate image quality and text-image alignment by human domain experts. We present evidence that the resulting model (RoentGen) is able to create visually convincing, diverse synthetic CXR images, and that the output can be controlled to a new extent by using free-form text prompts including radiology-specific language. Fine-tuning this model on a fixed training set and using it as a data augmentation method, we measure a 5% improvement of a classifier trained jointly on synthetic and real images, and a 3% improvement when trained on a larger but purely synthetic training set. Finally, we observe that this fine-tuning distills in-domain knowledge in the text-encoder and can improve its representation capabilities of certain diseases like pneumothorax by 25%.