论文标题
视频字幕带有堆叠的注意力和语义硬拉
Video captioning with stacked attention and semantic hard pull
论文作者
论文摘要
视频字幕,即从视频序列生成字幕的任务在自然语言处理与计算机科学的计算机视觉域之间创建了桥梁。对视频的语义准确描述产生的任务非常复杂。考虑到问题的复杂性,在最近的研究工作中获得的结果值得称赞。但是,还有很多范围需要进一步调查。本文解决了这一范围,并提出了一种新颖的解决方案。大多数视频字幕模型包括两个顺序/循环层 - 一个层作为视频到文本编码器,另一个作为上下文对电容器解码器。本文提出了一种新颖的体系结构,即具有语义上明智的视频字幕(SSVC),它通过使用两种新颖的方法来修改上下文生成机制 - “堆叠的注意力”和“空间硬拉”。由于没有用于评估视频字幕模型的独家指标,因此我们强调了对模型的定量和定性分析。因此,我们使用了BLEU评分度量进行定量分析,并提出了人类评估度量标准进行定性分析,即语义敏感性(SS)评分度量。 SS得分克服了普通自动评分指标的缺点。本文报告说,上述新颖性的使用改善了最先进的体系结构的性能。
Video captioning, i.e. the task of generating captions from video sequences creates a bridge between the Natural Language Processing and Computer Vision domains of computer science. The task of generating a semantically accurate description of a video is quite complex. Considering the complexity, of the problem, the results obtained in recent research works are praiseworthy. However, there is plenty of scope for further investigation. This paper addresses this scope and proposes a novel solution. Most video captioning models comprise two sequential/recurrent layers - one as a video-to-context encoder and the other as a context-to-caption decoder. This paper proposes a novel architecture, namely Semantically Sensible Video Captioning (SSVC) which modifies the context generation mechanism by using two novel approaches - "stacked attention" and "spatial hard pull". As there are no exclusive metrics for evaluating video captioning models, we emphasize both quantitative and qualitative analysis of our model. Hence, we have used the BLEU scoring metric for quantitative analysis and have proposed a human evaluation metric for qualitative analysis, namely the Semantic Sensibility (SS) scoring metric. SS Score overcomes the shortcomings of common automated scoring metrics. This paper reports that the use of the aforementioned novelties improves the performance of state-of-the-art architectures.