论文标题
高维调查数据中用于特征选择的模糊森林:2020年美国总统选举的申请
Fuzzy Forests For Feature Selection in High-Dimensional Survey Data: An Application to the 2020 U.S. Presidential Election
论文作者
论文摘要
社会科学领域中日益普遍的方法论问题是高维且高度相关的数据集,与传统的研究框架无关。在2020年总统大选中对候选人选择的分析是本期介绍本身的一个领域:为了测试许多解释选举结果的理论,有必要使用诸如2020年合作选举研究的数据,具有数百个高度相关的特征。我们介绍了模糊的森林算法,这是流行的随机森林合奏方法的一种变体,是在这种情况下以最小的偏见来减少特征空间的有效方法,同时还可以在常见算法(如随机森林和logits和logits和logits)中保持预测性能。使用模糊的森林,我们隔离了候选人选择的最高相关性,发现党派两极分化是推动2020年总统大选的最强因素。
An increasingly common methodological issue in the field of social science is high-dimensional and highly correlated datasets that are unamenable to the traditional deductive framework of study. Analysis of candidate choice in the 2020 Presidential Election is one area in which this issue presents itself: in order to test the many theories explaining the outcome of the election, it is necessary to use data such as the 2020 Cooperative Election Study Common Content, with hundreds of highly correlated features. We present the Fuzzy Forests algorithm, a variant of the popular Random Forests ensemble method, as an efficient way to reduce the feature space in such cases with minimal bias, while also maintaining predictive performance on par with common algorithms like Random Forests and logit. Using Fuzzy Forests, we isolate the top correlates of candidate choice and find that partisan polarization was the strongest factor driving the 2020 presidential election.