论文标题

NILC-Metrix:评估巴西葡萄牙语中书面和口语的复杂性

NILC-Metrix: assessing the complexity of written and spoken language in Brazilian Portuguese

论文作者

Leal, Sidney Evaldo, Duran, Magali Sanches, Scarton, Carolina Evaristo, Hartmann, Nathan Siegle, Aluísio, Sandra Maria

论文摘要

本文介绍并公开提供NILC-Metrix,这是一个计算系统,其中包括200个在话语,心理语言学,认知和计算语言学的研究中提出的指标,以评估巴西葡萄牙语(BP)的文本复杂性。这些指标与描述性分析和计算模型的创建有关,可用于从各种语言级别的书面和口头语言中提取信息。 NILC-Metrix中的指标是在过去13年中开发的,从2008年开始使用Coh-Metrix-Port,这是一个在Porsimples项目范围内开发的工具。 COH-Metrix-port从COH-Metrix工具中将一些指标改编为BP,该工具计算与英语文本的凝聚力和连贯性有关的指标。在2010年的前示像结束后,将新的指标添加到COH-Metrix-port的最初48个指标中。鉴于大量指标,我们介绍了与COH-Metrix v3.0相似的组织,以促进与葡萄牙语和英语中的指标进行比较。在本文中,我们通过提出三个应用来说明NILC-Metrix的潜力:(i)对儿童电影字幕和为小学I和II(最后几年)编写的差异的描述性分析; (ii)一个新的Prosimples项目原始文本和简化文本语料库的文本复杂性的新预测指标; (iii)使用青少年讲述的儿童故事叙事的成绩单,是学校成绩的复杂性预测模型。对于每个应用程序,我们评估哪些指标组更具歧视性,显示它们对每个任务的贡献。

This paper presents and makes publicly available the NILC-Metrix, a computational system comprising 200 metrics proposed in studies on discourse, psycholinguistics, cognitive and computational linguistics, to assess textual complexity in Brazilian Portuguese (BP). These metrics are relevant for descriptive analysis and the creation of computational models and can be used to extract information from various linguistic levels of written and spoken language. The metrics in NILC-Metrix were developed during the last 13 years, starting in 2008 with Coh-Metrix-Port, a tool developed within the scope of the PorSimples project. Coh-Metrix-Port adapted some metrics to BP from the Coh-Metrix tool that computes metrics related to cohesion and coherence of texts in English. After the end of PorSimples in 2010, new metrics were added to the initial 48 metrics of Coh-Metrix-Port. Given the large number of metrics, we present them following an organisation similar to the metrics of Coh-Metrix v3.0 to facilitate comparisons made with metrics in Portuguese and English. In this paper, we illustrate the potential of NILC-Metrix by presenting three applications: (i) a descriptive analysis of the differences between children's film subtitles and texts written for Elementary School I and II (Final Years); (ii) a new predictor of textual complexity for the corpus of original and simplified texts of the PorSimples project; (iii) a complexity prediction model for school grades, using transcripts of children's story narratives told by teenagers. For each application, we evaluate which groups of metrics are more discriminative, showing their contribution for each task.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源