对比度多视文本 - 视觉编码：迈向一十万尺度的一杆徽标标识

论文标题

对比度多视文本 - 视觉编码：迈向一十万尺度的一杆徽标标识

Contrastive Multi-View Textual-Visual Encoding: Towards One Hundred Thousand-Scale One-Shot Logo Identification

论文作者

Sharma, Nakul, Penamakuri, Abhirama S., Mishra, Anand

论文摘要

在本文中，我们研究了在开放设定的单次设置中识别自然场景中商业品牌徽标的问题。与传统研究的“封闭设置”和“每个类别的大规模培训样本”徽标识别设置相比，此问题设置更具挑战性。我们提出了一个新颖的多视文本 - 视觉编码框架，该框架编码徽标中出现的文本以及徽标的图形设计，以学习强大的对比表示。这些表示形式是在批处理上共同学习的徽标多种视图，从而很好地概括了看不见的徽标。我们在自然场景任务中评估了我们提出的用于裁剪徽标验证，裁剪徽标识别和端到端徽标识别的框架；并将其与最先进的方法进行比较。此外，文献缺乏参考徽标图像的“非常大规模”的集合，可以促进研究一千级徽标识别。为了填补文献中的这一空白，我们介绍了Wikidata参考徽标数据集（WIRLD），其中包含从Wikidata收获的100K业务品牌的徽标。我们提出的框架在Qmul-Openlogo数据集上以ROC曲线为91.3％，用于验证任务，在TopLogos-10和Flickrlogos32数据集中，在One-Shot logo识别任务上，最先进的方法的最先进方法在单杆徽标识别任务上均优于9.1％和2.6％。此外，我们表明，即使候选徽标的数量为100K，我们的方法与其他基线相比更稳定。

In this paper, we study the problem of identifying logos of business brands in natural scenes in an open-set one-shot setting. This problem setup is significantly more challenging than traditionally-studied 'closed-set' and 'large-scale training samples per category' logo recognition settings. We propose a novel multi-view textual-visual encoding framework that encodes text appearing in the logos as well as the graphical design of the logos to learn robust contrastive representations. These representations are jointly learned for multiple views of logos over a batch and thereby they generalize well to unseen logos. We evaluate our proposed framework for cropped logo verification, cropped logo identification, and end-to-end logo identification in natural scene tasks; and compare it against state-of-the-art methods. Further, the literature lacks a 'very-large-scale' collection of reference logo images that can facilitate the study of one-hundred thousand-scale logo identification. To fill this gap in the literature, we introduce Wikidata Reference Logo Dataset (WiRLD), containing logos for 100K business brands harvested from Wikidata. Our proposed framework that achieves an area under the ROC curve of 91.3% on the QMUL-OpenLogo dataset for the verification task, outperforms state-of-the-art methods by 9.1% and 2.6% on the one-shot logo identification task on the Toplogos-10 and the FlickrLogos32 datasets, respectively. Further, we show that our method is more stable compared to other baselines even when the number of candidate logos is on a 100K scale.

下载PDF全文

下载文献需遵守相关版权规定

论文标题