论文标题
复杂的场景图像编辑图片图理解
Complex Scene Image Editing by Scene Graph Comprehension
论文作者
论文摘要
条件扩散模型在各种任务上表现出令人印象深刻的性能,例如文本引导的语义图像编辑。先前的工作要求人用户手动识别图像区域,或使用仅在以对象为中心的操作方面表现良好的对象检测器。例如,如果输入图像包含具有相同语义含义的多个对象(例如一组鸟类),则对象探测器可能难以识别和定位目标对象,更不用说准确地对其进行操作。为了应对这些挑战,我们提出了一种两阶段的方法,用于通过场景图理解(SGC-NET)实现复杂的场景图像编辑。在第一阶段,我们训练使用场景图并预测目标对象的位置的感兴趣区域(ROI)预测网络。与仅基于对象类别的对象检测方法不同,我们的方法可以通过在复杂场景中理解对象及其语义关系来准确识别目标对象。第二阶段使用条件扩散模型根据我们的ROI预测来编辑图像。我们评估了方法对CLEVR和视觉基因组数据集的有效性。我们报告了CLEVR的SSIM的8点改善,而我们的编辑图像比先前在视觉基因组上的工作更受到9-33%的优势,从而验证了我们所提出的方法的有效性。代码可在github.com/zhongping-zhang/sgc_net上找到。
Conditional diffusion models have demonstrated impressive performance on various tasks like text-guided semantic image editing. Prior work requires image regions to be identified manually by human users or use an object detector that only perform well for object-centric manipulations. For example, if an input image contains multiple objects with the same semantic meaning (such as a group of birds), object detectors may struggle to recognize and localize the target object, let alone accurately manipulate it. To address these challenges, we propose a two-stage method for achieving complex scene image editing by Scene Graph Comprehension (SGC-Net). In the first stage, we train a Region of Interest (RoI) prediction network that uses scene graphs and predict the locations of the target objects. Unlike object detection methods based solely on object category, our method can accurately recognize the target object by comprehending the objects and their semantic relationships within a complex scene. The second stage uses a conditional diffusion model to edit the image based on our RoI predictions. We evaluate the effectiveness of our approach on the CLEVR and Visual Genome datasets. We report an 8 point improvement in SSIM on CLEVR and our edited images were preferred by human users by 9-33% over prior work on Visual Genome, validating the effectiveness of our proposed method. Code is available at github.com/Zhongping-Zhang/SGC_Net.