论文标题
ISS:图像作为文本指导的3D形状的垫脚石
ISS: Image as Stepping Stone for Text-Guided 3D Shape Generation
论文作者
论文摘要
由于没有大型配对的文本形状数据,这两种方式之间的大量语义差距以及3D形状的结构复杂性,因此文本指导的3D形状生成仍然具有挑战性。本文通过引入2D图像作为垫脚石来连接两种方式并消除对成对的文本形状数据的需求,提出了一个名为“图像”的新框架,称为“垫脚石”(ISS)。我们的关键贡献是一种两阶段的功能空间对象方法,它通过利用具有多视图Supperions的预先训练的单视重构造(SVR)模型来映射剪辑功能以形成形状:首先将剪辑图像特征映射到SVR中的详细图像中,然后在SVR模型中绘制剪辑文本和绘制剪辑的绘制,并在剪辑剪辑中绘制剪辑和绘制图像,并在绘制剪辑之间进行绘制,以绘制绘制剪辑,并将其绘制为绘制剪辑之间的绘制。此外,我们制定了一个文本引导的形状样式化模块,以用新颖的纹理打扮出输出形状。除了从文本上生成3D形状生成的现有作品外,我们的新方法一般可以在各种类别中创建形状,而无需配对文本形状数据。实验结果表明,我们的方法在忠诚度和与文本一致性方面优于最先进的和我们的基线。此外,我们的方法可以通过逼真的和幻想结构和纹理对生成的形状进行样式化。
Text-guided 3D shape generation remains challenging due to the absence of large paired text-shape data, the substantial semantic gap between these two modalities, and the structural complexity of 3D shapes. This paper presents a new framework called Image as Stepping Stone (ISS) for the task by introducing 2D image as a stepping stone to connect the two modalities and to eliminate the need for paired text-shape data. Our key contribution is a two-stage feature-space-alignment approach that maps CLIP features to shapes by harnessing a pre-trained single-view reconstruction (SVR) model with multi-view supervisions: first map the CLIP image feature to the detail-rich shape space in the SVR model, then map the CLIP text feature to the shape space and optimize the mapping by encouraging CLIP consistency between the input text and the rendered images. Further, we formulate a text-guided shape stylization module to dress up the output shapes with novel textures. Beyond existing works on 3D shape generation from text, our new approach is general for creating shapes in a broad range of categories, without requiring paired text-shape data. Experimental results manifest that our approach outperforms the state-of-the-arts and our baselines in terms of fidelity and consistency with text. Further, our approach can stylize the generated shapes with both realistic and fantasy structures and textures.