论文标题
评估图像自动生成的音素标题
Evaluating Automatically Generated Phoneme Captions for Images
论文作者
论文摘要
Image2Speech是生成图像的口语描述的相对较新的任务。本文介绍了对该任务评估的调查。为此,首先实现了一个Image2speech系统,该系统生成了由音素序列组成的图像字幕。该系统的表现优于FlickR8K语料库上的原始Image2speech系统。随后,这些音素字幕转换为单词句子。标题由人类评估者评为描述图像的良好性。最后,结果的几个客观度量得分与这些人类评分相关。尽管BLEU4与人类评分并不完全相关,但它在研究的指标中获得了最高的相关性,并且是Image2speech任务当前最好的指标。当前的指标受到他们认为输入为单词的事实的限制。对于Image2speech任务,应该将其输入为单词的一部分,即音素,而是将其输入为单词。
Image2Speech is the relatively new task of generating a spoken description of an image. This paper presents an investigation into the evaluation of this task. For this, first an Image2Speech system was implemented which generates image captions consisting of phoneme sequences. This system outperformed the original Image2Speech system on the Flickr8k corpus. Subsequently, these phoneme captions were converted into sentences of words. The captions were rated by human evaluators for their goodness of describing the image. Finally, several objective metric scores of the results were correlated with these human ratings. Although BLEU4 does not perfectly correlate with human ratings, it obtained the highest correlation among the investigated metrics, and is the best currently existing metric for the Image2Speech task. Current metrics are limited by the fact that they assume their input to be words. A more appropriate metric for the Image2Speech task should assume its input to be parts of words, i.e. phonemes, instead.