一种预测印地语推文中表情符号的联合方法

论文标题

一种预测印地语推文中表情符号的联合方法

A Federated Approach to Predicting Emojis in Hindi Tweets

论文作者

Gandhi, Deep, Mehta, Jash, Parekh, Nirali, Waghela, Karan, D'Mello, Lynette, Talat, Zeerak

论文摘要

表情符号的使用提供了一种视觉方式，通常是私人的文本交流。但是，预测表情符号的任务为机器学习提供了挑战，因为表情符号的使用倾向于将其聚集到经常使用的表情符号和罕见的表情符号中。关于表情符号使用的大部分机器学习研究都集中在高资源语言上，并概念化了预测传统服务器端机器学习方法表情符号的任务。但是，传统的机器学习方法可以引入隐私问题，因为这些方法要求将所有数据传输到中央存储。在本文中，我们试图解决强调表情符号预测的高资源语言并冒着人们数据隐私风险的双重问题。我们为印地语中的表情符号预测提供了一个新的数据集（从$ 25 $ k唯一的推文中增强），并提出了对联盟学习算法Causalfedgsd的修改，该算法旨在在模型绩效和用户隐私之间实现平衡。我们表明，我们的方法以更复杂的集中模型获得了比较得分，同时减少了优化模型所需的数据量并最大程度地降低用户隐私的风险。

The use of emojis affords a visual modality to, often private, textual communication. The task of predicting emojis however provides a challenge for machine learning as emoji use tends to cluster into the frequently used and the rarely used emojis. Much of the machine learning research on emoji use has focused on high resource languages and has conceptualised the task of predicting emojis around traditional server-side machine learning approaches. However, traditional machine learning approaches for private communication can introduce privacy concerns, as these approaches require all data to be transmitted to a central storage. In this paper, we seek to address the dual concerns of emphasising high resource languages for emoji prediction and risking the privacy of people's data. We introduce a new dataset of $118$k tweets (augmented from $25$k unique tweets) for emoji prediction in Hindi, and propose a modification to the federated learning algorithm, CausalFedGSD, which aims to strike a balance between model performance and user privacy. We show that our approach obtains comparative scores with more complex centralised models while reducing the amount of data required to optimise the models and minimising risks to user privacy.

下载PDF全文

下载文献需遵守相关版权规定

论文标题