用于验证类标签的统计测试程序

论文标题

用于验证类标签的统计测试程序

A statistical Testing Procedure for Validating Class Labels

论文作者

Key, Melissa C., Boukai, Ben

论文摘要

由验证无标签shot弹枪蛋白质组学工作流程中蛋白质身份的开放问题的动机，我们提出了一种测试程序，以使用跨实例/肽的可用测量结果来验证类/蛋白质标签。更普遍地，我们为识别一定距离（或准距离）度量的实例的问题提供了一种解决方案，作为与分配给同一类的实例的子集相对于异常值。所提出的程序是非参数，并且不需要在测得的距离上进行特定的分布假设。测试程序的唯一假设是，同一类内的实例之间的测量距离在随机上比不同类别实例之间的测量距离小。该测试显示以同时控制I型和II型错误概率，同时还控制在初始类标记的验证过程中调用的重复测试的总体错误概率。理论结果补充了一项广泛的数值研究的结果，模拟了典型的设置，用于在蛋白质组学工作流应用中标记验证。这些结果说明了我们方法的适用性和生存能力。即使有多达25％的实例错误标记，我们的测试程序仍保持高特异性，并大大降低了错误标签实例的比例。

Motivated by an open problem of validating protein identities in label-free shotgun proteomics work-flows, we present a testing procedure to validate class/protein labels using available measurements across instances/peptides. More generally, we present a solution to the problem of identifying instances that are deemed, based on some distance (or quasi-distance) measure, as outliers relative to the subset of instances assigned to the same class. The proposed procedure is non-parametric and requires no specific distributional assumption on the measured distances. The only assumption underlying the testing procedure is that measured distances between instances within the same class are stochastically smaller than measured distances between instances from different classes. The test is shown to simultaneously control the Type I and Type II error probabilities whilst also controlling the overall error probability of the repeated testing invoked in the validation procedure of initial class labeling. The theoretical results are supplemented with results from an extensive numerical study, simulating a typical setup for labeling validation in proteomics work-flow applications. These results illustrate the applicability and viability of our method. Even with up to 25% of instances mislabeled, our testing procedure maintains a high specificity and greatly reduces the proportion of mislabeled instances.

下载PDF全文

下载文献需遵守相关版权规定

论文标题