论文标题

使用困惑分析解开唇部阅读中的同质体

Disentangling Homophemes in Lip Reading using Perplexity Analysis

论文作者

Fenghour, Souheil, Chen, Daqing, Guo, Kun, Xiao, Perry

论文摘要

与使用ASCII字符和单词的使用相比,使用Visemes作为分类模式的自动唇读的性能在很大程度上是由于共享相同的访问的不同单词的问题而取得的成功。生成的预训练变压器是一种有效的自学语言模型,用于自然语言处理中的许多任务,包括句子预测和文本分类。 本文为该模型提出了一个新的应用程序,并将其应用于唇部阅读的背景下,它用作语言模型,以访问的形式将视觉语音转换为以单词和句子的形式转换为语言。该网络使用寻找最佳困惑的搜索来执行Viseme-word映射,因此是对存在的一对多映射问题的解决方案,在该问题中,当口语看起来相同时,各种单词听起来不同。本文提出了一种解决自动化的唇部读数时使用单独的视觉提示进行自动唇读时解决一对多映射问题的方法:第一种情况是界限,即单词的开始和单词的结尾,一个尚不清楚;第二种情况是已知边界的地方。 基准BBC数据集“野生中的唇读句子”(LRS2)的句子被分类为10.7%的字符错误率,单词错误率为18.0%。本文的主要贡献是提出一种通过使用自回归语言模型仅存在视觉提示时使用困惑分析来预测单词的方法。

The performance of automated lip reading using visemes as a classification schema has achieved less success compared with the use of ASCII characters and words largely due to the problem of different words sharing identical visemes. The Generative Pre-Training transformer is an effective autoregressive language model used for many tasks in Natural Language Processing, including sentence prediction and text classification. This paper proposes a new application for this model and applies it in the context of lip reading, where it serves as a language model to convert visual speech in the form of visemes, to language in the form of words and sentences. The network uses the search for optimal perplexity to perform the viseme-to-word mapping and is thus a solution to the one-to-many mapping problem that exists whereby various words that sound different when spoken look identical. This paper proposes a method to tackle the one-to-many mapping problem when performing automated lip reading using solely visual cues in two separate scenarios: the first scenario is where the word boundary, that is, the beginning and the ending of a word, is unknown; and the second scenario is where the boundary is known. Sentences from the benchmark BBC dataset "Lip Reading Sentences in the Wild"(LRS2), are classified with a character error rate of 10.7% and a word error rate of 18.0%. The main contribution of this paper is to propose a method of predicting words through the use of perplexity analysis when only visual cues are present, using an autoregressive language model.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源