论文标题
text2light:零击文本驱动的HDR Panorama Generation
Text2Light: Zero-Shot Text-Driven HDR Panorama Generation
论文作者
论文摘要
高质量的HDRI(高动态范围图像),通常是HDR全景,是创建图形中3D场景的3D场景的最受欢迎的方法之一。考虑到捕获HDRI的困难,高度需要一种多功能且可控制的生成模型,外行用户可以直观地控制生成过程。但是,现有的最新方法仍然难以合成复杂场景的高质量全景。在这项工作中,我们提出了一个零击文本驱动的框架Text2light,以生成4K+分辨率HDRIS,而无需配对培训数据。给定一个自由形式的文本作为场景的描述,我们通过两个专用步骤合成相应的HDRI:1)文本驱动的全景范围以低动态范围(LDR)和低分辨率以及2)超分辨率逆音映射以扩大分辨率和动态范围的LDR Panorama。具体而言,为了实现零击文本驱动的全景生成,我们首先将双代码簿作为不同环境纹理的离散表示形式。然后,在预先训练的剪辑模型的驱动下,由文本条件的全局采样器学会了根据输入文本从全球代码簿中采样整体语义。此外,一个结构感知的本地采样器学会了以整体语义为指导的LDR Panoramas补丁。为了实现超分辨率逆音映射,我们得出了从LDR Panorama的360度成像的连续表示,这是一组固定在球体上的结构性潜在代码。这种连续的表示可以使多功能模块同时提高分辨率和动态范围。广泛的实验表明,Text2light在产生高质量的HDR全景方面具有卓越的能力。此外,我们还展示了我们在现实渲染和沉浸式VR中工作的可行性。
High-quality HDRIs(High Dynamic Range Images), typically HDR panoramas, are one of the most popular ways to create photorealistic lighting and 360-degree reflections of 3D scenes in graphics. Given the difficulty of capturing HDRIs, a versatile and controllable generative model is highly desired, where layman users can intuitively control the generation process. However, existing state-of-the-art methods still struggle to synthesize high-quality panoramas for complex scenes. In this work, we propose a zero-shot text-driven framework, Text2Light, to generate 4K+ resolution HDRIs without paired training data. Given a free-form text as the description of the scene, we synthesize the corresponding HDRI with two dedicated steps: 1) text-driven panorama generation in low dynamic range(LDR) and low resolution, and 2) super-resolution inverse tone mapping to scale up the LDR panorama both in resolution and dynamic range. Specifically, to achieve zero-shot text-driven panorama generation, we first build dual codebooks as the discrete representation for diverse environmental textures. Then, driven by the pre-trained CLIP model, a text-conditioned global sampler learns to sample holistic semantics from the global codebook according to the input text. Furthermore, a structure-aware local sampler learns to synthesize LDR panoramas patch-by-patch, guided by holistic semantics. To achieve super-resolution inverse tone mapping, we derive a continuous representation of 360-degree imaging from the LDR panorama as a set of structured latent codes anchored to the sphere. This continuous representation enables a versatile module to upscale the resolution and dynamic range simultaneously. Extensive experiments demonstrate the superior capability of Text2Light in generating high-quality HDR panoramas. In addition, we show the feasibility of our work in realistic rendering and immersive VR.