基于名称的性别分类的开源文化共识方法

论文标题

基于名称的性别分类的开源文化共识方法

An Open-Source Cultural Consensus Approach to Name-Based Gender Classification

论文作者

Van Buskirk, Ian, Clauset, Aaron, Larremore, Daniel B.

论文摘要

基于名称的性别分类已使数百种原本不可行的性别科学研究。然而，缺乏标准化，临时方法的扩散，对付费服务的依赖，研究局限性的限制和概念性辩论对许多应用程序造成了阴影。为了解决这些问题，我们开发并评估了一种基于合奏的开源方法，建立在经验名称性别关联的公开数据上。我们的方法集成了36个不同的来源，跨越了150个国家 /地区，并在一个世纪以上的VIA中，一种受文化共识理论（CCT）启发的元学习算法。我们还构建了一个分类法，可以将名称本身分类。我们发现我们的方法的性能与付费服务具有竞争力，而我们的方法和其他方法都接近了性能的上限。我们表明，对其他元数据（例如文化背景），进一步结合方法或收集其他名称性别关联数据的条件估计不太可能有意义地提高性能。这项工作明确地表明，基于名称的性别分类可以是科学研究的可靠部分，并提供了一对工具，一种分类方法和名称的分类学，可以意识到这一潜力。

Name-based gender classification has enabled hundreds of otherwise infeasible scientific studies of gender. Yet, the lack of standardization, proliferation of ad hoc methods, reliance on paid services, understudied limitations, and conceptual debates cast a shadow over many applications. To address these problems we develop and evaluate an ensemble-based open-source method built on publicly available data of empirical name-gender associations. Our method integrates 36 distinct sources-spanning over 150 countries and more than a century-via a meta-learning algorithm inspired by Cultural Consensus Theory (CCT). We also construct a taxonomy with which names themselves can be classified. We find that our method's performance is competitive with paid services and that our method, and others, approach the upper limits of performance; we show that conditioning estimates on additional metadata (e.g. cultural context), further combining methods, or collecting additional name-gender association data is unlikely to meaningfully improve performance. This work definitively shows that name-based gender classification can be a reliable part of scientific research and provides a pair of tools, a classification method and a taxonomy of names, that realize this potential.

下载PDF全文

下载文献需遵守相关版权规定

论文标题