论文标题
BP依赖性函数:随机变量之间依赖性的一般度量
The BP Dependency Function: a Generic Measure of Dependence between Random Variables
论文作者
论文摘要
测量和量化随机变量(RV)之间的依赖关系可以使数据集具有关键的见解。典型的问题是:“存在基本关系吗?有趣的是,尽管很明显需要对RV之间的依赖性进行通用度量,但数据分析的常见实践是,大多数数据分析人员都使用Pearson相关系数(PCC)来量化RV之间的依赖性,而众所周知,PCC基本上是对线性依赖性的量度。尽管已经进行了许多尝试来定义更通用的依赖措施,但在标准的通用依赖性函数上仍未达成共识。实际上,已经提出了依赖关系函数的几种理想特性,但没有太多论证。由此激励,在本文中,我们将讨论和修改所需的属性列表,并提出一个满足所有这些要求的新依赖性功能。这种通用依赖性功能为数据分析师提供了一种量化变量之间依赖程度的强大手段。为此,我们还提供Python代码来确定在实践中使用的依赖性功能。
Measuring and quantifying dependencies between random variables (RV's) can give critical insights into a data-set. Typical questions are: `Do underlying relationships exist?', `Are some variables redundant?', and `Is some target variable $Y$ highly or weakly dependent on variable $X$?' Interestingly, despite the evident need for a general-purpose measure of dependency between RV's, common practice of data analysis is that most data analysts use the Pearson correlation coefficient (PCC) to quantify dependence between RV's, while it is well-recognized that the PCC is essentially a measure for linear dependency only. Although many attempts have been made to define more generic dependency measures, there is yet no consensus on a standard, general-purpose dependency function. In fact, several ideal properties of a dependency function have been proposed, but without much argumentation. Motivated by this, in this paper we will discuss and revise the list of desired properties and propose a new dependency function that meets all these requirements. This general-purpose dependency function provides data analysts a powerful means to quantify the level of dependence between variables. To this end, we also provide Python code to determine the dependency function for use in practice.