论文标题
Stackoverflow与Kaggle:开发人员讨论有关数据科学的研究
StackOverflow vs Kaggle: A Study of Developer Discussions About Data Science
论文作者
论文摘要
越来越多地要求软件开发人员了解基本数据科学(DS)概念。最近,在用户应用程序的开发中,机器学习(ML)和深度学习(DL)的存在急剧增加,无论它们是通过框架利用还是从头开始实施。这些主题在在线平台上吸引了许多讨论。本文进行了大规模的定性和定量实验,以研究Stackoverflow和Kaggle的197836帖子的特征。潜在的DIRICHLET分配主题建模用于提取二十四个DS讨论主题。主要发现包括与张量相关的主题在Stackoverflow中最普遍,而元讨论主题是Kaggle上普遍存在的主题。 Stackoverflow倾向于包括较低级别的故障排除,而Kaggle专注于实用性和优化排行榜的性能。此外,在两个社区中,DS讨论都以巨大的速度增加。尽管关于Stackoverflow的Tensorflow讨论正在放缓,但对Keras的兴趣正在上升。最后,集合算法是Kaggle中最提到的ML/DL算法,但很少在Stackoverflow上讨论。这些发现可以帮助教育工作者和研究人员更有效地量身定制和优先考虑针对不同开发人员社区的DS概念的努力。
Software developers are increasingly required to understand fundamental Data science (DS) concepts. Recently, the presence of machine learning (ML) and deep learning (DL) has dramatically increased in the development of user applications, whether they are leveraged through frameworks or implemented from scratch. These topics attract much discussion on online platforms. This paper conducts large-scale qualitative and quantitative experiments to study the characteristics of 197836 posts from StackOverflow and Kaggle. Latent Dirichlet Allocation topic modelling is used to extract twenty-four DS discussion topics. The main findings include that TensorFlow-related topics were most prevalent in StackOverflow, while meta discussion topics were the prevalent ones on Kaggle. StackOverflow tends to include lower-level troubleshooting, while Kaggle focuses on practicality and optimising leaderboard performance. In addition, across both communities, DS discussion is increasing at a dramatic rate. While TensorFlow discussion on StackOverflow is slowing, interest in Keras is rising. Finally, ensemble algorithms are the most mentioned ML/DL algorithms in Kaggle but are rarely discussed on StackOverflow. These findings can help educators and researchers to more effectively tailor and prioritise efforts in researching and communicating DS concepts towards different developer communities.