论文标题
通过Frank-Wolfe政策优化,在HEVC/H.265中进行帧级位分配的动作限制的强化学习
Action-Constrained Reinforcement Learning for Frame-Level Bit Allocation in HEVC/H.265 through Frank-Wolfe Policy Optimization
论文作者
论文摘要
本文提出了一个强化学习(RL)框架,该框架利用Frank-Wolfe策略优化来解决HEVC/H.265的框架级分配。大多数以前的基于RL的方法都采用了单批批评设计,该设计通过经验选择的超参数加权了变形最小化和速率正则化的奖励。最近,提出了双批评设计,以通过交替的速度和失真评论家来更新演员网络。但是,不能保证培训的融合。为了解决这个问题,我们引入了神经弗兰克 - 沃尔夫政策优化(NFWPO),以将框架级别的位分配作为动作约束的RL问题。在这个新框架中,费率评论家有助于指定可行的行动集,而失真评论家则更新了演员网络,以最大程度地提高重建质量,同时符合动作约束。实验结果表明,经过培训以优化视频多方法评估融合(VMAF)度量标准时,我们的基于NFWPO的模型的表现都优于单批评和双重评分方法。它还证明了与X265的2次平均比特率控制相当的率延伸性能。
This paper presents a reinforcement learning (RL) framework that leverages Frank-Wolfe policy optimization to address frame-level bit allocation for HEVC/H.265. Most previous RL-based approaches adopt the single-critic design, which weights the rewards for distortion minimization and rate regularization by an empirically chosen hyper-parameter. More recently, the dual-critic design is proposed to update the actor network by alternating the rate and distortion critics. However, the convergence of training is not guaranteed. To address this issue, we introduce Neural Frank-Wolfe Policy Optimization (NFWPO) in formulating the frame-level bit allocation as an action-constrained RL problem. In this new framework, the rate critic serves to specify a feasible action set, and the distortion critic updates the actor network towards maximizing the reconstruction quality while conforming to the action constraint. Experimental results show that when trained to optimize the video multi-method assessment fusion (VMAF) metric, our NFWPO-based model outperforms both the single-critic and the dual-critic methods. It also demonstrates comparable rate-distortion performance to the 2-pass average bit rate control of x265.