朝着值得信赖的自身自动化的简短，多语言，多类型答案

论文标题

朝着值得信赖的自身自动化的简短，多语言，多类型答案

Towards Trustworthy AutoGrading of Short, Multi-lingual, Multi-type Answers

论文作者

Schneider, Johannes, Richner, Robin, Riser, Micha

论文摘要

由于NLP的兴起以及提出的答案对的可用性增加，因此自动性简短的文本答案变得更加可行。自运动性能仍然不如人类的评分。最先进的机器学习模型的统计和黑盒性质使它们不可信，从而提出了道德问题并限制了其实际实用性。此外，自动载体的评估通常仅限于针对特定问题类型的小型单语数据集。这项研究使用了一个大型数据集，该数据集由大约1000万个问答对组成，来自多种语言，涵盖了数学和语言等各种领域，以及问题和回答语法的较大差异。我们证明了微调变压器模型对于此类复杂数据集的自动化的有效性。我们最好的高参数调整模型的准确性约为86.5 \％，可与最先进的模型相媲美，这些模型较不通用，更调整为特定类型的问题，主题和语言。更重要的是，我们解决了信任和道德问题。通过让人类参与自动化过程，我们展示了如何提高自动分级答案的准确性，从而实现了与助教相等的精度。我们还展示了教师如何有效地控制系统造成的错误类型，以及他们如何有效地验证自动载体在单个考试中的表现接近预期的表现。

Autograding short textual answers has become much more feasible due to the rise of NLP and the increased availability of question-answer pairs brought about by a shift to online education. Autograding performance is still inferior to human grading. The statistical and black-box nature of state-of-the-art machine learning models makes them untrustworthy, raising ethical concerns and limiting their practical utility. Furthermore, the evaluation of autograding is typically confined to small, monolingual datasets for a specific question type. This study uses a large dataset consisting of about 10 million question-answer pairs from multiple languages covering diverse fields such as math and language, and strong variation in question and answer syntax. We demonstrate the effectiveness of fine-tuning transformer models for autograding for such complex datasets. Our best hyperparameter-tuned model yields an accuracy of about 86.5\%, comparable to the state-of-the-art models that are less general and more tuned to a specific type of question, subject, and language. More importantly, we address trust and ethical concerns. By involving humans in the autograding process, we show how to improve the accuracy of automatically graded answers, achieving accuracy equivalent to that of teaching assistants. We also show how teachers can effectively control the type of errors made by the system and how they can validate efficiently that the autograder's performance on individual exams is close to the expected performance.

下载PDF全文

下载文献需遵守相关版权规定

论文标题