论文标题
因果新闻语料库:新闻事件句子中的因果关系注释
The Causal News Corpus: Annotating Causal Relations in Event Sentences from News
论文作者
论文摘要
尽管了解因果关系的重要性,但解决因果关系的语料库仍然有限。事件因果关系的现有注释准则与传统因果关系语料库之间存在差异,该准则更多地关注语言学。许多准则限制了自己仅包括明确关系或基于子句的参数。因此,我们为事件因果关系提出了一个解决这些问题的注释模式。我们在抗议活动新闻中注明了3,559个事件句子,上面有标签,内容涉及是否包含因果关系。我们的语料库被称为因果新闻语料库(CNC)。建立在最先进的预训练的语言模型上的神经网络在测试集的81.20%F1得分中表现良好,在5倍跨验证中的得分为83.46%。 CNC可以在两个外部语料库中转移:Causaltimebank(CTB)和Penn Oudise Treebank(PDTB)。利用这些外部数据集进行培训,我们在CNC测试集中达到了大约64%的F1,而无需进行其他微调。 CNC还作为两个外部语料库的有效培训和预培训数据集。最后,我们在众包注释练习中向外行证明了我们的任务困难。我们的注释语料库公开可用,为因果文本矿业研究人员提供了宝贵的资源。
Despite the importance of understanding causality, corpora addressing causal relations are limited. There is a discrepancy between existing annotation guidelines of event causality and conventional causality corpora that focus more on linguistics. Many guidelines restrict themselves to include only explicit relations or clause-based arguments. Therefore, we propose an annotation schema for event causality that addresses these concerns. We annotated 3,559 event sentences from protest event news with labels on whether it contains causal relations or not. Our corpus is known as the Causal News Corpus (CNC). A neural network built upon a state-of-the-art pre-trained language model performed well with 81.20% F1 score on test set, and 83.46% in 5-folds cross-validation. CNC is transferable across two external corpora: CausalTimeBank (CTB) and Penn Discourse Treebank (PDTB). Leveraging each of these external datasets for training, we achieved up to approximately 64% F1 on the CNC test set without additional fine-tuning. CNC also served as an effective training and pre-training dataset for the two external corpora. Lastly, we demonstrate the difficulty of our task to the layman in a crowd-sourced annotation exercise. Our annotated corpus is publicly available, providing a valuable resource for causal text mining researchers.