论文标题

大量英语文学中文本的主题,情感,文学,创造力和美丽的计算分析

Computational analyses of the topics, sentiments, literariness, creativity and beauty of texts in a large Corpus of English Literature

论文作者

Jacobs, Arthur M., Kinder, Annette

论文摘要

古腾堡文学英语语料库(GLEC,JACOBS,2018a)为数字人文,计算语言学或神经认知诗学研究提供了丰富的文本数据来源。在这项研究中,我们解决了GLEC不同文献类别之间的差异,以及作者之间的差异。 We report the results of three studies providing i) topic and sentiment analyses for six text categories of GLEC (i.e., children and youth, essays, novels, plays, poems, stories) and its >100 authors, ii) novel measures of semantic complexity as indices of the literariness, creativity and book beauty of the works in GLEC (e.g., Jane Austen's six novels), and iii) two experiments on text classification and使用语义复杂性的新特征的作者识别。关于两种新颖衡量标准的数据,估计了文本的文字性,文本差异和逐步距离(van Cranenburgh等,2019)表明,戏剧是GLEC中最文学的文本,其次是诗歌和小说。新颖的文本创造力索引的计算(Gray等,2016)揭示了诗歌和戏剧是最具创造力的类别,最具创造力的作者都是诗人(米尔顿,教皇,济慈,拜伦或华盛顿州)。我们还为GLEC中的作品计算了一个新颖的言语艺术之美的索引(Kintsch,2012),并预测Emma在理论上是奥斯丁的小说中最美丽的。最后,我们证明了这些新颖的语义复杂度度量是文本分类和作者身份识别的重要特征,其整体预测精度在.75至.97之间。我们的数据为对阅读心理学的文献或实验的未来计算和实证研究铺平了道路,并为分析和验证其他书库提供了多种基准和基准。

The Gutenberg Literary English Corpus (GLEC, Jacobs, 2018a) provides a rich source of textual data for research in digital humanities, computational linguistics or neurocognitive poetics. In this study we address differences among the different literature categories in GLEC, as well as differences between authors. We report the results of three studies providing i) topic and sentiment analyses for six text categories of GLEC (i.e., children and youth, essays, novels, plays, poems, stories) and its >100 authors, ii) novel measures of semantic complexity as indices of the literariness, creativity and book beauty of the works in GLEC (e.g., Jane Austen's six novels), and iii) two experiments on text classification and authorship recognition using novel features of semantic complexity. The data on two novel measures estimating a text's literariness, intratextual variance and stepwise distance (van Cranenburgh et al., 2019) revealed that plays are the most literary texts in GLEC, followed by poems and novels. Computation of a novel index of text creativity (Gray et al., 2016) revealed poems and plays as the most creative categories with the most creative authors all being poets (Milton, Pope, Keats, Byron, or Wordsworth). We also computed a novel index of perceived beauty of verbal art (Kintsch, 2012) for the works in GLEC and predict that Emma is the theoretically most beautiful of Austen's novels. Finally, we demonstrate that these novel measures of semantic complexity are important features for text classification and authorship recognition with overall predictive accuracies in the range of .75 to .97. Our data pave the way for future computational and empirical studies of literature or experiments in reading psychology and offer multiple baselines and benchmarks for analysing and validating other book corpora.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源