彗星：深度学习库测试的覆盖范围引导的模型生成

论文标题

彗星：深度学习库测试的覆盖范围引导的模型生成

COMET: Coverage-guided Model Generation For Deep Learning Library Testing

论文作者

Li, Meiziniu, Cao, Jialun, Tian, Yongqiang, Li, Tsz On, Wen, Ming, Cheung, Shing-Chi

论文摘要

最近的深度学习（DL）应用主要建立在DL库的顶部。这些库的质量保证对于可靠的DL应用程序的可靠部署至关重要。已经提出了生成各种DL模型的技术，并将它们应用它们来测试这些库。但是，它们的测试效果受生成的DL模型中API调用的多样性的限制。我们的研究表明，这些技术最多可以覆盖34.1％的层输入，25.9％的层参数值和15.6％的层序列。结果，我们发现许多由特定层API调用（即特定层输入，参数值或层序列）引起的许多错误都可以通过现有技术丢失。由于这种限制，我们建议彗星有效地生成具有不同层API的DL模型，要求进行DL库测试。彗星：（1）设计一组突变操作员和基于覆盖的搜索算法，以在DL模型中多样化层输入，层参数值和层序列。（2）提出了一种模型合成方法，以提高测试效率，而不会损害层API调用多样性。我们的评估结果表明，彗星通过覆盖两倍的层输入（69.7％vs. 34.1％），层参数值（50.2％对25.9％）和层序列（39.0％vs. 15.6％）的覆盖量是基准的两倍。此外，彗星覆盖的图书馆分支比现有技术多3.4％。最后，彗星在八个流行的DL库中检测到32个新错误，包括Tensorflow和MXNET，其中21个由DL库开发人员确认，其中7个已确认的错误已由开发人员修复。

Recent deep learning (DL) applications are mostly built on top of DL libraries. The quality assurance of these libraries is critical to the dependable deployment of DL applications. Techniques have been proposed to generate various DL models and apply them to test these libraries. However, their test effectiveness is constrained by the diversity of layer API calls in their generated DL models. Our study reveals that these techniques can cover at most 34.1% layer inputs, 25.9% layer parameter values, and 15.6% layer sequences. As a result, we find that many bugs arising from specific layer API calls (i.e., specific layer inputs, parameter values, or layer sequences) can be missed by existing techniques. Because of this limitation, we propose COMET to effectively generate DL models with diverse layer API calls for DL library testing. COMET: (1) designs a set of mutation operators and a coverage-based search algorithm to diversify layer inputs, layer parameter values, and layer sequences in DL models. (2) proposes a model synthesis method to boost the test efficiency without compromising the layer API call diversity. Our evaluation result shows that COMET outperforms baselines by covering twice as many layer inputs (69.7% vs. 34.1%), layer parameter values (50.2% vs. 25.9%), and layer sequences (39.0% vs. 15.6%) as those by the state-of-the-art. Moreover, COMET covers 3.4% more library branches than those by existing techniques. Finally, COMET detects 32 new bugs in the latest version of eight popular DL libraries, including TensorFlow and MXNet, with 21 of them confirmed by DL library developers and 7 of those confirmed bugs have been fixed by developers.

下载PDF全文

下载文献需遵守相关版权规定

论文标题