论文标题
对FPGA的JSON数据的原始过滤
Raw Filtering of JSON Data on FPGAs
论文作者
论文摘要
许多大数据应用程序包括在半结构化数据格式(例如JSON)上处理数据流。此类格式的缺点是,应用程序可能会花费大量的处理时间,而只是在不选择地解析所有数据上。为了放松此问题,提出了原始过滤的概念,并在昂贵的解析阶段之前从流中删除数据。但是,由于只有在解析数据后,通常才有可能对原始数据进行准确的过滤,因此,在允许假阳性以有效实现的意义上,原始过滤器被设计为近似。 与以前提出的基于CPU的原始过滤技术相反,该技术仅限于字符串匹配,我们提出了用于过滤字符串,数字和数字范围的基于FPGA的原语。此外,提出了一个尊重JSON数据的基本结构的原始结构,该结构可用于进一步提高引入的原始过滤器的准确性。 所提出的原始滤光片旨在根据查询的给定滤波器表达式允许其组成。因此,可以为FPGA创建复杂的原始过滤器,从而使生成的假阳性量的数量急剧减少,特别是对于物联网工作负载。 随着准确性和资源消耗之间的权衡,我们使用RiotBench基准的不同查询来评估原始物以及组成的原始过滤器。我们的结果表明,可以过滤多达94.3%的原始数据,而无需仅使用几百个LUT产生任何观察到的假阳性。
Many Big Data applications include the processing of data streams on semi-structured data formats such as JSON. A disadvantage of such formats is that an application may spend a significant amount of processing time just on unselectively parsing all data. To relax this issue, the concept of raw filtering is proposed with the idea to remove data from a stream prior to the costly parsing stage. However, as accurate filtering of raw data is often only possible after the data has been parsed, raw filters are designed to be approximate in the sense of allowing false-positives in order to be implemented efficiently. Contrary to previously proposed CPU-based raw filtering techniques that are restricted to string matching, we present FPGA-based primitives for filtering strings, numbers and also number ranges. In addition, a primitive respecting the basic structure of JSON data is proposed that can be used to further increase the accuracy of introduced raw filters. The proposed raw filter primitives are designed to allow for their composition according to a given filter expression of a query. Thus, complex raw filters can be created for FPGAs which enable a drastical decrease in the amount of generated false-positives, particularly for IoT workload. As there exists a trade-off between accuracy and resource consumption, we evaluate primitives as well as composed raw filters using different queries from the RiotBench benchmark. Our results show that up to 94.3% of the raw data can be filtered without producing any observed false-positives using only a few hundred LUTs.