Araweat：阿拉伯语单词嵌入中偏见的多维分析

论文标题

Araweat：阿拉伯语单词嵌入中偏见的多维分析

AraWEAT: Multidimensional Analysis of Biases in Arabic Word Embeddings

论文作者

Lauscher, Anne, Takieddin, Rafik, Ponzetto, Simone Paolo, Glavaš, Goran

论文摘要

最近的工作表明，分布词向量空间通常编码人类偏见，例如性别歧视或种族主义。在这项工作中，我们通过对来自阿拉伯语Corpora引起的各种嵌入空间进行了一系列最近引入的偏置测试，对阿拉伯语单词嵌入的偏见进行了广泛的分析。 We measure the presence of biases across several dimensions, namely: embedding models (Skip-Gram, CBOW, and FastText) and vector sizes, types of text (encyclopedic text, and news vs. user-generated content), dialects (Egyptian Arabic vs. Modern Standard Arabic), and time (diachronic analyses over corpora from different time periods).我们的分析得出了一些有趣的发现，例如，随着时间的推移，对阿拉伯新闻语料库进行培训的嵌入式性别偏见（2007年至2017年之间）。我们公开提供阿拉伯语偏见规格（ARAWEAT）。

Recent work has shown that distributional word vector spaces often encode human biases like sexism or racism. In this work, we conduct an extensive analysis of biases in Arabic word embeddings by applying a range of recently introduced bias tests on a variety of embedding spaces induced from corpora in Arabic. We measure the presence of biases across several dimensions, namely: embedding models (Skip-Gram, CBOW, and FastText) and vector sizes, types of text (encyclopedic text, and news vs. user-generated content), dialects (Egyptian Arabic vs. Modern Standard Arabic), and time (diachronic analyses over corpora from different time periods). Our analysis yields several interesting findings, e.g., that implicit gender bias in embeddings trained on Arabic news corpora steadily increases over time (between 2007 and 2017). We make the Arabic bias specifications (AraWEAT) publicly available.

下载PDF全文

下载文献需遵守相关版权规定

论文标题