多视图对比学习，用于完全盲目视频质量评估用户生成的内容

论文标题

多视图对比学习，用于完全盲目视频质量评估用户生成的内容

Multiview Contrastive Learning for Completely Blind Video Quality Assessment of User Generated Content

论文作者

Mitra, Shankhanil, Soundararajan, Rajiv

论文摘要

完全盲目的视频质量评估（VQA）是指不使用目标数据库中的任何参考视频，人类意见分数或培训视频来学习质量模型的一类质量评估方法。这类方法的设计尤为重要，因为它可以允许在各种数据集中进行性能进行卓越的概括。我们考虑用于用户生成的内容的完全盲目VQA的设计。尽管在受监督和弱监督的设置中考虑了几种深层提取方法，但在完全盲目的VQA的背景下，尚未研究此类方法。我们通过提出一个自我监管的多视图对比学习框架来学习时空质量表示形式来弥合这一差距。特别是，我们通过将框架差异和帧视为一对视图来捕获框架差异和帧之间的共同信息，并类似地获得框架差异和光流之间的共享表示形式。然后将所得的功能与原始自然视频贴片的语料库进行比较，以预测扭曲视频的质量。在未经人类分数训练的情况下评估时，在多个摄像机捕获的VQA数据集上进行的详细实验揭示了我们方法的优越性能。

Completely blind video quality assessment (VQA) refers to a class of quality assessment methods that do not use any reference videos, human opinion scores or training videos from the target database to learn a quality model. The design of this class of methods is particularly important since it can allow for superior generalization in performance across various datasets. We consider the design of completely blind VQA for user generated content. While several deep feature extraction methods have been considered in supervised and weakly supervised settings, such approaches have not been studied in the context of completely blind VQA. We bridge this gap by presenting a self-supervised multiview contrastive learning framework to learn spatio-temporal quality representations. In particular, we capture the common information between frame differences and frames by treating them as a pair of views and similarly obtain the shared representations between frame differences and optical flow. The resulting features are then compared with a corpus of pristine natural video patches to predict the quality of the distorted video. Detailed experiments on multiple camera captured VQA datasets reveal the superior performance of our method over other features when evaluated without training on human scores.

下载PDF全文

下载文献需遵守相关版权规定

论文标题