论文标题
使用来自多个解析器的错误消息寻找不合格的文档
Looking for non-compliant documents using error messages from multiple parsers
论文作者
论文摘要
单个解析器是否接受文件并不可靠地表明文件是否符合其陈述格式。解析器和格式规范中的错误意味着符合性文件可能无法解析,或者可能会读取不合格的文件而不会出现任何明显的麻烦。后一种情况带来了重大的安全风险,应避免。本文建议,评估格式规范合规性的更好方法是检查一组解析器而不是单个解析器产生的错误消息集。如果可以使用符合性文件的样本和不合格文件的样本,那么我们展示了基于伪样比率的统计测试如何在确定文件的合规性方面非常有效。我们的方法是格式不可知,不直接依赖格式的形式规范。尽管本文重点介绍了PDF格式(ISO 32000-2)的情况,但我们没有尝试使用该格式的任何特定细节。此外,我们展示了主要组件分析如何对格式规范设计人员评估这些文件和解析器样本的质量和结构有用。尽管这些测试绝对是基本的,但似乎它们用于测量文件格式的可变性和识别不合格的文件既新颖又具有出乎意料的效果。
Whether a file is accepted by a single parser is not a reliable indication of whether a file complies with its stated format. Bugs within both the parser and the format specification mean that a compliant file may fail to parse, or that a non-compliant file might be read without any apparent trouble. The latter situation presents a significant security risk, and should be avoided. This article suggests that a better way to assess format specification compliance is to examine the set of error messages produced by a set of parsers rather than a single parser. If both a sample of compliant files and a sample of non-compliant files are available, then we show how a statistical test based on a pseudo-likelihood ratio can be very effective at determining a file's compliance. Our method is format agnostic, and does not directly rely upon a formal specification of the format. Although this article focuses upon the case of the PDF format (ISO 32000-2), we make no attempt to use any specific details of the format. Furthermore, we show how principal components analysis can be useful for a format specification designer to assess the quality and structure of these samples of files and parsers. While these tests are absolutely rudimentary, it appears that their use to measure file format variability and to identify non-compliant files is both novel and surprisingly effective.