论文标题

Alexsis-PT:葡萄牙词汇简化的新资源

ALEXSIS-PT: A New Resource for Portuguese Lexical Simplification

论文作者

North, Kai, Zampieri, Marcos, Ranasinghe, Tharindu

论文摘要

词汇简化(LS)是自动替换复杂词的任务,以使其更容易使文本更容易被各种目标人群访问(例如,识字率低,学习障碍的人,第二语言学习者)。为了训练和测试模型,LS系统通常需要在上下文中具有复杂单词的语料库以及其候选替代。为了继续提高LS系统的性能,我们引入了Alexsis-PT,这是一个新型的用于巴西葡萄牙LS的多候选数据集,其中包含387个复杂词的9,605个候选替代。 Alexsis-PT已按照Alexsis协议进行编译,用于西班牙开放跨语言模型的令人兴奋的新途径。 Alexsis-PT是第一个包含巴西报纸文章的LS多候车数据集。我们评估了该数据集上替代生成的四个模型,即Mdistilbert,Mbert,XLM-R和Bertimbau。 Bertimbau在所有评估指标中取得了最高的性能。

Lexical simplification (LS) is the task of automatically replacing complex words for easier ones making texts more accessible to various target populations (e.g. individuals with low literacy, individuals with learning disabilities, second language learners). To train and test models, LS systems usually require corpora that feature complex words in context along with their candidate substitutions. To continue improving the performance of LS systems we introduce ALEXSIS-PT, a novel multi-candidate dataset for Brazilian Portuguese LS containing 9,605 candidate substitutions for 387 complex words. ALEXSIS-PT has been compiled following the ALEXSIS protocol for Spanish opening exciting new avenues for cross-lingual models. ALEXSIS-PT is the first LS multi-candidate dataset that contains Brazilian newspaper articles. We evaluated four models for substitute generation on this dataset, namely mDistilBERT, mBERT, XLM-R, and BERTimbau. BERTimbau achieved the highest performance across all evaluation metrics.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源