论文标题
从未经保育的正则表达式中学习
Learning from Uncurated Regular Expressions
论文作者
论文摘要
从一组数据值中学习正则表达式方面已经完成了重要的工作。根据域名,这种方法可能非常成功。但是,学习这些表达式需要大量时间,并且在存在肮脏数据的情况下,所产生的表达可能会变得非常复杂或不准确。当面对必须匹配的大量值时,手动编写正则表达式的替代方案变得没有吸引力。 作为替代方案,我们建议从一大批手动撰写的语料库中学习,但是从公共存储库中挖掘出来的正则表达式。这种方法的优点是,我们能够从一组有限的开销的字符串中提取出色的特征,以功能工程。由于一组正则表达式涵盖了广泛的应用域,因此我们希望它们广泛适用。 为了证明我们的方法的潜在有效性,我们使用针对语义类型类别的正则表达式提取的语料库训练模型。尽管我们的方法产生的结果总体上不如最先进的结果,但我们的功能提取代码较小,并且我们的模型在某些类别上优于一种流行的现有方法。我们还证明了使用未经保育的正则表达式进行无监督学习的可能性。
Significant work has been done on learning regular expressions from a set of data values. Depending on the domain, this approach can be very successful. However, significant time is required to learn these expressions and the resulting expressions can become either very complex or inaccurate in the presence of dirty data. The alternative of manually writing regular expressions becomes unattractive when faced with a large number of values that must be matched. As an alternative, we propose learning from a large corpus of manually authored, but uncurated regular expressions mined from a public repository. The advantage of this approach is that we are able to extract salient features from a set of strings with limited overhead to feature engineering. Since the set of regular expressions covers a wide range of application domains, we expect them to be widely applicable. To demonstrate the potential effectiveness of our approach, we train a model using the extracted corpus of regular expressions for the class of semantic type classification. While our approach yields results that are overall inferior to the state-of-the-art, our feature extraction code is an order of magnitude smaller, and our model outperforms a popular existing approach on some classes. We also demonstrate the possibility of using uncurated regular expressions for unsupervised learning.