MCEN：用潜在变量模型之间桥接烹饪食谱和菜肴图像之间的跨模式差距

论文标题

MCEN：用潜在变量模型之间桥接烹饪食谱和菜肴图像之间的跨模式差距

MCEN: Bridging Cross-Modal Gap between Cooking Recipes and Dish Images with Latent Variable Model

论文作者

Fu, Han, Wu, Rui, Liu, Chenghao, Sun, Jianling

论文摘要

如今，由于对饮食和健康的关注，食品计算引起了行业和研究社区的极大关注。该领域最受欢迎的研究主题之一是食品检索，因为它对面向健康的应用产生了深远的影响。在本文中，我们专注于食品图像和烹饪食谱之间的跨模式检索任务。我们介绍了偶像矛盾的嵌入式网络（MCEN），该网络通过将图像和文本投影到相同的嵌入空间来学习模态不变表示。为了捕获模式之间的潜在比对，我们结合了随机的潜在变量，以明确利用文本和视觉特征之间的相互作用。重要的是，我们的方法在训练过程中学习了跨模式的对准，但是为了提高效率，在推理时独立地计算不同方式的嵌入。广泛的实验结果清楚地表明，所提出的MCEN在基准配方1M数据集上优于所有现有方法，并且需要更少的计算成本。

Nowadays, driven by the increasing concern on diet and health, food computing has attracted enormous attention from both industry and research community. One of the most popular research topics in this domain is Food Retrieval, due to its profound influence on health-oriented applications. In this paper, we focus on the task of cross-modal retrieval between food images and cooking recipes. We present Modality-Consistent Embedding Network (MCEN) that learns modality-invariant representations by projecting images and texts to the same embedding space. To capture the latent alignments between modalities, we incorporate stochastic latent variables to explicitly exploit the interactions between textual and visual features. Importantly, our method learns the cross-modal alignments during training but computes embeddings of different modalities independently at inference time for the sake of efficiency. Extensive experimental results clearly demonstrate that the proposed MCEN outperforms all existing approaches on the benchmark Recipe1M dataset and requires less computational cost.

下载PDF全文

下载文献需遵守相关版权规定

论文标题