论文标题
指南和提取传记事件的语料库
Guidelines and a Corpus for Extracting Biographical Events
论文作者
论文摘要
尽管传记在语义网络中广泛传播,但自动提取传记事件的资源和方法受到限制。这种限制减少了结构化的,可读的传记信息的数量,尤其是关于属于代表性不足的人的人的数量。我们的工作通过为生活事件的语义注释提供了一组准则来挑战这一限制。该指南旨在与现有的ISO语义注释标准可互操作:ISO-TIMEML(ISO-24617-1)和SEMAF(ISO-24617-4)。通过代表不足的作家的Wikipedia传记的注释任务,即在非西方国家,移民或属于少数民族的属于少数族裔的作者,对指南进行了测试。 4个注释者注释了1,000个句子,平均通道间协议为0.825。由此产生的语料库被映射在Ontonotes上。允许这种映射可以扩展我们的语料库,这表明可能已经为传记事件提取任务所利用现有资源。
Despite biographies are widely spread within the Semantic Web, resources and approaches to automatically extract biographical events are limited. Such limitation reduces the amount of structured, machine-readable biographical information, especially about people belonging to underrepresented groups. Our work challenges this limitation by providing a set of guidelines for the semantic annotation of life events. The guidelines are designed to be interoperable with existing ISO-standards for semantic annotation: ISO-TimeML (ISO-24617-1), and SemAF (ISO-24617-4). Guidelines were tested through an annotation task of Wikipedia biographies of underrepresented writers, namely authors born in non-Western countries, migrants, or belonging to ethnic minorities. 1,000 sentences were annotated by 4 annotators with an average Inter-Annotator Agreement of 0.825. The resulting corpus was mapped on OntoNotes. Such mapping allowed to to expand our corpus, showing that already existing resources may be exploited for the biographical event extraction task.