Efficiently extracting full parse trees using regular expressions with capture groups
- Published
- Accepted
- Subject Areas
- Algorithms and Analysis of Algorithms, Programming Languages
- Keywords
- Regular expressions, Parsing, Algorithms
- Copyright
- © 2015 Schwarz et al.
- Licence
- This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ PrePrints) and either DOI or URL of the article must be cited.
- Cite this article
- 2015. Efficiently extracting full parse trees using regular expressions with capture groups. PeerJ PrePrints 3:e1248v1 https://doi.org/10.7287/peerj.preprints.1248v1
Abstract
Regular expressions with capture groups offer a concise and natural way to define parse trees over the text that they are parsing, however classical algorithms only return a single match for each capture group, not the full parse tree. We describe an algorithm based on finite-state automata that extracts full parse trees from text in Θ (n,m) time and Θ(dn + m) space (where n is the size of the text, m the size of the pattern, and d the number of groups in the pattern). It is the first to do so in a single pass with complete control over greediness. This allows the algorithm to process streaming data using all constructs familiar to users of regular expressions.
Author Comment
This is submission to PeerJ Computer Science for review.