通过领先段落生成的应用程序发现教育资源的转移学习管道

论文标题

通过领先段落生成的应用程序发现教育资源的转移学习管道

A Transfer Learning Pipeline for Educational Resource Discovery with Application in Leading Paragraph Generation

论文作者

Li, Irene, George, Thomas, Fabbri, Alexander, Liao, Tammy, Chen, Benjamin, Kawamura, Rina, Zhou, Richard, Yan, Vanessa, Hingmire, Swapnil, Radev, Dragomir

论文摘要

有效的人类学习取决于各种教育材料，这些教育材料与学习者当前对该主题的理解保持一致。尽管互联网彻底改变了人类的学习或教育，但仍然存在大量资源可及性障碍。也就是说，在线信息过多会使导航和发现高质量的学习材料变得具有挑战性。在本文中，我们提出了教育资源发现（ERD）管道，该管道可自动化新颖领域的Web资源发现。管道包括三个主要步骤：数据收集，功能提取和资源分类。我们从已知的源域开始，并通过转移学习对两个看不见的目标域进行资源发现。我们首先从一组种子文档中收集频繁的查询，然后在网络上搜索以获取候选资源，例如演讲幻灯片和介绍性博客文章。然后，我们介绍了一种新颖的信息检索深度神经网络模型，查询文档掩盖语言建模（QD-MLM），以提取这些候选资源的深度特征。我们应用基于树的分类器来决定候选人是否是积极的学习资源。当在两个相似但新颖的目标域评估时，该管道的F1得分为0.94和0.82。最后，我们演示了该管道如何使应用程序受益：用于调查的领先段落。据我们所知，这是第一个考虑各种Web资源的研究。我们还发布了39,728个手动标记的Web资源和659个查询的语料库，来自NLP，计算机视觉（CV）和统计信息（Stats）。

Effective human learning depends on a wide selection of educational materials that align with the learner's current understanding of the topic. While the Internet has revolutionized human learning or education, a substantial resource accessibility barrier still exists. Namely, the excess of online information can make it challenging to navigate and discover high-quality learning materials. In this paper, we propose the educational resource discovery (ERD) pipeline that automates web resource discovery for novel domains. The pipeline consists of three main steps: data collection, feature extraction, and resource classification. We start with a known source domain and conduct resource discovery on two unseen target domains via transfer learning. We first collect frequent queries from a set of seed documents and search on the web to obtain candidate resources, such as lecture slides and introductory blog posts. Then we introduce a novel pretrained information retrieval deep neural network model, query-document masked language modeling (QD-MLM), to extract deep features of these candidate resources. We apply a tree-based classifier to decide whether the candidate is a positive learning resource. The pipeline achieves F1 scores of 0.94 and 0.82 when evaluated on two similar but novel target domains. Finally, we demonstrate how this pipeline can benefit an application: leading paragraph generation for surveys. This is the first study that considers various web resources for survey generation, to the best of our knowledge. We also release a corpus of 39,728 manually labeled web resources and 659 queries from NLP, Computer Vision (CV), and Statistics (STATS).

下载PDF全文

下载文献需遵守相关版权规定

论文标题