驱动人工智能开发的基准数据集未能捕获医疗专业人员的需求

论文标题

驱动人工智能开发的基准数据集未能捕获医疗专业人员的需求

Benchmark datasets driving artificial intelligence development fail to capture the needs of medical professionals

论文作者

Blagec, Kathrin, Kraiger, Jakob, Frühwirt, Wolfgang, Samwald, Matthias

论文摘要

可公开访问的基准测试允许评估和比较模型性能是人工智能进度（AI）进步的重要驱动力。尽管AI能力的最新进展具有通过协助和增强医疗保健专业人员的认知过程来改变医疗实践的潜力，但AI基准测试对临床相关任务的覆盖范围在很大程度上不清楚。此外，缺乏系统化的元信息，使临床AI研究人员能够快速确定数据集的可访问性，范围，内容和其他特征与临床领域相关的基准数据集。为了解决这些问题，我们基于对文献和在线资源的系统评价，策划并发布了与临床和生物医学自然语言处理（NLP）有关的广泛领域（NLP）的全面目录。总共将450个NLP数据集手动系统化并用丰富的元数据注释，例如目标任务，临床适用性，数据类型，绩效指标，可访问性和许可信息以及数据拆分的可用性。然后，我们比较了AI基准数据集涵盖的任务与医生在先前的实证研究中报告为自动化的非常理想的目标的相关任务。我们的分析表明，直接临床相关性的AI基准是稀缺的，无法涵盖临床医生希望看到的大多数工作活动。特别是，尽管有重大关联的工作量，但与常规文档和患者数据管理工作流相关的任务并未表示。因此，目前可用的AI基准与临床环境中AI自动化的所需目标不当一致，并且应创建新的基准测试以填补这些空白。

Publicly accessible benchmarks that allow for assessing and comparing model performances are important drivers of progress in artificial intelligence (AI). While recent advances in AI capabilities hold the potential to transform medical practice by assisting and augmenting the cognitive processes of healthcare professionals, the coverage of clinically relevant tasks by AI benchmarks is largely unclear. Furthermore, there is a lack of systematized meta-information that allows clinical AI researchers to quickly determine accessibility, scope, content and other characteristics of datasets and benchmark datasets relevant to the clinical domain. To address these issues, we curated and released a comprehensive catalogue of datasets and benchmarks pertaining to the broad domain of clinical and biomedical natural language processing (NLP), based on a systematic review of literature and online resources. A total of 450 NLP datasets were manually systematized and annotated with rich metadata, such as targeted tasks, clinical applicability, data types, performance metrics, accessibility and licensing information, and availability of data splits. We then compared tasks covered by AI benchmark datasets with relevant tasks that medical practitioners reported as highly desirable targets for automation in a previous empirical study. Our analysis indicates that AI benchmarks of direct clinical relevance are scarce and fail to cover most work activities that clinicians want to see addressed. In particular, tasks associated with routine documentation and patient data administration workflows are not represented despite significant associated workloads. Thus, currently available AI benchmarks are improperly aligned with desired targets for AI automation in clinical settings, and novel benchmarks should be created to fill these gaps.

下载PDF全文

下载文献需遵守相关版权规定

论文标题