论文标题
半监督聚类和分类的协作学习,用于标记未经修剪的数据
Collaborative Learning of Semi-Supervised Clustering and Classification for Labeling Uncurated Data
论文作者
论文摘要
特定于域的图像收集在科学和业务的各个领域都具有潜在价值,但通常没有策划,也没有任何方法可以轻松提取相关内容。要在此类图像数据上采用当代监督的图像分析方法,必须首先清洁和组织它们,然后手动标记特定领域中使用的命名法,这是一项耗时且昂贵的努力。为了解决这个问题,我们设计并实施了PLUD系统。 PLUD提供了一个迭代的半监督工作流程,以最大程度地减少专家所花费的努力,并处理逼真的大量图像集合。我们认为,无论其大小和类型如何,它都可以支持标签数据集。 PLUD是无监督聚类,人类援助和监督分类的迭代序列。每次迭代1)标记的数据集增长,2)分类方法的通用性及其准确性提高,3)手动努力减少。我们通过将其应用于记录人类分解的一百万张图像上,评估了系统的有效性。在我们的实验中,将手动标记与在PLUD的支持下进行的标记进行了比较,我们发现它减少了标记数据并为该新领域产生高度准确的模型所需的时间。
Domain-specific image collections present potential value in various areas of science and business but are often not curated nor have any way to readily extract relevant content. To employ contemporary supervised image analysis methods on such image data, they must first be cleaned and organized, and then manually labeled for the nomenclature employed in the specific domain, which is a time consuming and expensive endeavor. To address this issue, we designed and implemented the Plud system. Plud provides an iterative semi-supervised workflow to minimize the effort spent by an expert and handles realistic large collections of images. We believe it can support labeling datasets regardless of their size and type. Plud is an iterative sequence of unsupervised clustering, human assistance, and supervised classification. With each iteration 1) the labeled dataset grows, 2) the generality of the classification method and its accuracy increases, and 3) manual effort is reduced. We evaluated the effectiveness of our system, by applying it on over a million images documenting human decomposition. In our experiment comparing manual labeling with labeling conducted with the support of Plud, we found that it reduces the time needed to label data and produces highly accurate models for this new domain.