论文标题
部分可观测时空混沌系统的无模型预测
High-Resource Methodological Bias in Low-Resource Investigations
论文作者
论文摘要
低资源NLP的中央瓶颈通常被认为是可访问数据的数量,忽略了数据质量的贡献。通过减少高资源语言数据的采样,在低资源系统的开发和评估中尤其重要。在这项工作中,我们研究了这种方法的有效性,我们专门针对我们的实证研究的两个众所周知的NLP任务:Pos-Tagging和Machine Translation。我们表明,从高资源语言中进行的采样会导致具有与低资源数据集不同属性的数据集,从而影响了POS-Tagging和Machine Translation的模型性能。基于这些结果,我们得出的结论是,数据集对数据集的幼稚采样导致对这些系统在低资源场景中的工作状况有偏见。
The central bottleneck for low-resource NLP is typically regarded to be the quantity of accessible data, overlooking the contribution of data quality. This is particularly seen in the development and evaluation of low-resource systems via down sampling of high-resource language data. In this work we investigate the validity of this approach, and we specifically focus on two well-known NLP tasks for our empirical investigations: POS-tagging and machine translation. We show that down sampling from a high-resource language results in datasets with different properties than the low-resource datasets, impacting the model performance for both POS-tagging and machine translation. Based on these results we conclude that naive down sampling of datasets results in a biased view of how well these systems work in a low-resource scenario.