LAION-5B：开放的大型数据集用于培训下一代图像文本模型

论文标题

LAION-5B：开放的大型数据集用于培训下一代图像文本模型

LAION-5B: An open large-scale dataset for training next generation image-text models

论文作者

Schuhmann, Christoph, Beaumont, Romain, Vencu, Richard, Gordon, Cade, Wightman, Ross, Cherti, Mehdi, Coombes, Theo, Katta, Aarush, Mullis, Clayton, Wortsman, Mitchell, Schramowski, Patrick, Kundurthy, Srivatsa, Crowson, Katherine, Schmidt, Ludwig, Kaczmarczyk, Robert, Jitsev, Jenia

论文摘要

开创性的语言视觉架构（例如夹子和Dall-e）证明了对大量嘈杂的图像文本数据进行培训的实用性，而不依赖于标准视觉单峰监督学习中使用的昂贵准确标签。最终的模型显示出强大的文本引导图像生成和转移到下游任务的能力，同时以零摄影分类表现出色，并具有值得注意的分布外鲁棒性。从那时起，诸如Align，Basic，Glide，Flamingo和Imagen之类的大规模语言视觉模型做出了进一步的改进。研究此类模型的培训和功能需要包含数十亿个图像文本对的数据集。到目前为止，尚未为更广泛的研究社区公开提供这种规模的数据集。为了解决这个问题并对大型多模式模型进行民主化，我们介绍LAION-5B-一个由58.5亿个剪辑过滤的图像文本对组成的数据集，其中2.32b包含英语。我们使用数据集对基础模型（如剪辑，滑行和稳定扩散）进行了成功的复制和微调，并讨论了使用此量表的公开可用数据集启用了进一步的实验。此外，我们还提供了几个最近的邻居指数，改进的数据集探索和子集生成的Web-Interface以及Watermark，NSFW和有毒内容检测的检测分数。公告页面https://laion.ai/laion-5b-a-new-era-open-open-large-scale-scale-multi-modal-datasets/

Groundbreaking language-vision architectures like CLIP and DALL-E proved the utility of training on large amounts of noisy image-text data, without relying on expensive accurate labels used in standard vision unimodal supervised learning. The resulting models showed capabilities of strong text-guided image generation and transfer to downstream tasks, while performing remarkably at zero-shot classification with noteworthy out-of-distribution robustness. Since then, large-scale language-vision models like ALIGN, BASIC, GLIDE, Flamingo and Imagen made further improvements. Studying the training and capabilities of such models requires datasets containing billions of image-text pairs. Until now, no datasets of this size have been made openly available for the broader research community. To address this problem and democratize research on large-scale multi-modal models, we present LAION-5B - a dataset consisting of 5.85 billion CLIP-filtered image-text pairs, of which 2.32B contain English language. We show successful replication and fine-tuning of foundational models like CLIP, GLIDE and Stable Diffusion using the dataset, and discuss further experiments enabled with an openly available dataset of this scale. Additionally we provide several nearest neighbor indices, an improved web-interface for dataset exploration and subset generation, and detection scores for watermark, NSFW, and toxic content detection. Announcement page https://laion.ai/laion-5b-a-new-era-of-open-large-scale-multi-modal-datasets/

下载PDF全文

下载文献需遵守相关版权规定

论文标题