论文标题
UOB在Semeval-2020任务12:用语料库级别的信息增加BERT
UoB at SemEval-2020 Task 12: Boosting BERT with Corpus Level Information
论文作者
论文摘要
预先训练的语言模型表示(例如BERT)在几种自然语言处理任务中非常成功,从而在最新的艺术中有了显着改善。这在很大程度上可以归因于他们更好地捕获句子中包含的语义信息的能力。但是,几个任务可以从语料库级别可用的信息中受益,例如术语频率插图频率(TF-IDF)。在这项工作中,我们测试了将此信息与BERT整合到社交媒体上识别滥用的任务的有效性,并表明将此信息与Bert整合在一起确实可以显着提高绩效。我们参加了子任务A(滥用检测),其中我们在表现最高的团队的两个分和子任务B(目标检测)中获得得分,其中我们将在44个参与的团队中排名4。
Pre-trained language model word representation, such as BERT, have been extremely successful in several Natural Language Processing tasks significantly improving on the state-of-the-art. This can largely be attributed to their ability to better capture semantic information contained within a sentence. Several tasks, however, can benefit from information available at a corpus level, such as Term Frequency-Inverse Document Frequency (TF-IDF). In this work we test the effectiveness of integrating this information with BERT on the task of identifying abuse on social media and show that integrating this information with BERT does indeed significantly improve performance. We participate in Sub-Task A (abuse detection) wherein we achieve a score within two points of the top performing team and in Sub-Task B (target detection) wherein we are ranked 4 of the 44 participating teams.