论文标题
基于相互信息的无监督视频表示解开的方法
Mutual Information Based Method for Unsupervised Disentanglement of Video Representation
论文作者
论文摘要
视频预测是一项有趣且具有挑战性的任务,它可以从属于视频序列的给定背景帧中预测未来的帧。视频预测模型在操纵计划,医疗保健,自主导航和仿真中发现了潜在的应用。未来框架生成的主要挑战之一是由于视觉数据的高维质。在这项工作中,我们提出了相互信息预测性自动编码器(MIPAE)框架,该框架通过将视频表示形式分配到内容和易于预测的低维姿势潜在变量来减少预测高维视频帧的任务。标准LSTM网络用于预测这些低维姿势表示。内容和预测的姿势表示形式被解码以生成未来的框架。我们的方法利用了视频的潜在生成因素的时间结构以及新的相互信息丢失,以学习分离的视频表示。我们还提出了一个基于相互信息差距(MIG)的度量,以定量访问DSPRITES和MPI3D-REAL数据集的分解有效性。 MIG得分证实了Mipae预测的框架的视觉优势。我们还将方法定量地比较了评估指标LPIP,SSIM和PSNR。
Video Prediction is an interesting and challenging task of predicting future frames from a given set context frames that belong to a video sequence. Video prediction models have found prospective applications in Maneuver Planning, Health care, Autonomous Navigation and Simulation. One of the major challenges in future frame generation is due to the high dimensional nature of visual data. In this work, we propose Mutual Information Predictive Auto-Encoder (MIPAE) framework, that reduces the task of predicting high dimensional video frames by factorising video representations into content and low dimensional pose latent variables that are easy to predict. A standard LSTM network is used to predict these low dimensional pose representations. Content and the predicted pose representations are decoded to generate future frames. Our approach leverages the temporal structure of the latent generative factors of a video and a novel mutual information loss to learn disentangled video representations. We also propose a metric based on mutual information gap (MIG) to quantitatively access the effectiveness of disentanglement on DSprites and MPI3D-real datasets. MIG scores corroborate with the visual superiority of frames predicted by MIPAE. We also compare our method quantitatively on evaluation metrics LPIPS, SSIM and PSNR.