使用组成数据的监督学习和模型分析

论文标题

使用组成数据的监督学习和模型分析

Supervised Learning and Model Analysis with Compositional Data

论文作者

Huang, Shimeng, Ailer, Elisabeth, Kilbertus, Niki, Pfister, Niklas

论文摘要

高通量测序数据的组成性和稀疏性对回归和分类构成了挑战。然而，在微生物组研究中，条件建模是研究表型与微生物组之间关系的重要工具。现有技术通常是不足的：它们要么依赖于线性对比对比度模型的扩展（该模型调整了组合性，但通常无法捕获有用的信号），要么基于黑盒机器学习方法（这可能捕获有用的信号，但在下游分析中忽略了组合性）。我们提出了基于内核的非参数回归和组成数据的分类框架内核Biome。它针对稀疏组成数据量身定制，并能够纳入先验知识，例如系统发育结构。 kernelbiome捕获复杂的信号，包括在零结构中，同时自动调整模型复杂性。与最先进的机器学习方法相比，我们在PAR或改进的预测性能方面证明了这一点。此外，我们的框架提供了两个关键优势：（i）我们提出了两个新数量来解释单个组件的贡献，并证明它们始终如一地估计条件均值的平均扰动效应，从而将线性对比度模型的可解释性扩展到非参数模型。（ii）我们表明，内核和距离之间的连接有助于解释性，并提供了数据驱动的嵌入，可以增强进一步的分析。最后，我们将内核框架应用于两个公共微生物组研究，并说明了所提出的模型分析。 kernelbiome可作为开源python软件包，网址为https://github.com/shimenghuang/kernelbiome。

The compositionality and sparsity of high-throughput sequencing data poses a challenge for regression and classification. However, in microbiome research in particular, conditional modeling is an essential tool to investigate relationships between phenotypes and the microbiome. Existing techniques are often inadequate: they either rely on extensions of the linear log-contrast model (which adjusts for compositionality, but is often unable to capture useful signals), or they are based on black-box machine learning methods (which may capture useful signals, but ignore compositionality in downstream analyses). We propose KernelBiome, a kernel-based nonparametric regression and classification framework for compositional data. It is tailored to sparse compositional data and is able to incorporate prior knowledge, such as phylogenetic structure. KernelBiome captures complex signals, including in the zero-structure, while automatically adapting model complexity. We demonstrate on par or improved predictive performance compared with state-of-the-art machine learning methods. Additionally, our framework provides two key advantages: (i) We propose two novel quantities to interpret contributions of individual components and prove that they consistently estimate average perturbation effects of the conditional mean, extending the interpretability of linear log-contrast models to nonparametric models. (ii) We show that the connection between kernels and distances aids interpretability and provides a data-driven embedding that can augment further analysis. Finally, we apply the KernelBiome framework to two public microbiome studies and illustrate the proposed model analysis. KernelBiome is available as an open-source Python package at https://github.com/shimenghuang/KernelBiome.

下载PDF全文

下载文献需遵守相关版权规定

论文标题