论文标题

CroudeNews:带注释的原油新闻语料库,用于提取事件

CrudeOilNews: An Annotated Crude Oil News Corpus for Event Extraction

论文作者

Lee, Meisin, Soon, Lay-Ki, Siew, Eu-Gene, Sugianto, Ly Fie

论文摘要

在本文中,我们介绍了CroudeNews,这是一大批英国原油新闻,用于提取活动。这是商品新闻的第一个此类,并为经济和财务文本挖掘的资源建设做出贡献。本文介绍了数据收集过程,注释方法和用于生产该语料库的事件类型。首先,手动注释了175篇新闻文章的种子集,其中25个新闻的子集用作通知者间和系统评估的裁决参考测试集。同意通常是实质性的,注释者的表现足够,表明注释方案会产生高质量的一致事件注释。随后,数据集通过(1)数据扩展和(2)在环上的活动学习扩展。由此产生的语料库有425篇新闻文章,大约有11K事件注释。作为主动学习过程的一部分,该语料库被用来训练用于机器标签的基本事件提取模型,结果模型还可以作为验证或试点研究,证明了语料库在机器学习目的中的使用。带注释的语料库可用于学术研究目的,网址为https://github.com/meisin/crudeoilnews-corpus。

In this paper, we present CrudeOilNews, a corpus of English Crude Oil news for event extraction. It is the first of its kind for Commodity News and serve to contribute towards resource building for economic and financial text mining. This paper describes the data collection process, the annotation methodology and the event typology used in producing the corpus. Firstly, a seed set of 175 news articles were manually annotated, of which a subset of 25 news were used as the adjudicated reference test set for inter-annotator and system evaluation. Agreement was generally substantial and annotator performance was adequate, indicating that the annotation scheme produces consistent event annotations of high quality. Subsequently the dataset is expanded through (1) data augmentation and (2) Human-in-the-loop active learning. The resulting corpus has 425 news articles with approximately 11k events annotated. As part of active learning process, the corpus was used to train basic event extraction models for machine labeling, the resulting models also serve as a validation or as a pilot study demonstrating the use of the corpus in machine learning purposes. The annotated corpus is made available for academic research purpose at https://github.com/meisin/CrudeOilNews-Corpus.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源