# Regex performance Slides: [hackmd.io/@instanceofme/H13VSoKYn](https://hackmd.io/@instanceofme/H13VSoKYn) --- ## Why? DSS had [this](https://github.com/dataiku/dip/blob/12.1.0/src/main/java/com/dataiku/dip/code/CodeConversionService.java#L346), executed under transaction, when copying code between recipe & notebook: ```java sLine.replaceAll("\\s*$", "") // strip trailing spaces ``` …which worked well enough <!-- .element: class="fragment" data-fragment-index="1" --> …until it's fed a line with 262344 spaces followed by a comment ([yes](https://app.shortcut.com/dataiku/story/143835/redos-in-recipe-to-notebook-cell)) <!-- .element: class="fragment" data-fragment-index="2" --> ---- ### ReDoS Regular Expression Denial-of-Service Note: Could happen in an adversarial setting, where a hostile user provides an input known to slow down the program. --- ### Backtracking Regex = non-deterministic finite automaton (NFA)* E.g. `abab|abbb` ```mermaid graph LR; i> ]-->a0(( )) a0-->|a|a1(( )) a1-->|b|a2(( )) a2-->|a|a3(( )) a3-->|b|a4(( )) a4-->e[ ] i-->b0(( )) b0-->|a|b1(( )) b1-->|b|b2(( )) b2-->|b|b3(( )) b3-->|b|b4(( )) b4-->e ``` <small>*most of the time</small> Example matching `abbb` <!-- .element: class="fragment" data-fragment-index="1" --> Note: Some regex extensions (like recursive matching or code execution on match) make the regex not regular anymore, hence not homogenous to a NFA. --- ### Catastrophic backtracking How many steps to fail the match of `1234567890!` against `(\d+)*$` Note: 511. Push those digits to 20, and it's 1M. 2^n-1. ---- #### Known red flags :triangular_flag_on_post: - Compounded quantifiers `(A+)*` - Contiguous quantified tokens not mutually exclusive `\w+\d+` - Alternatives not mutually exclusive `(\w|\d)+` - Remote quantifiers able to venture into each other's territory `^.*?".*?"$` Note: They might appear in much more complex expressions that would make them much harder to spot. Some may be automatically flagged by tools such as [Sonar](https://rules.sonarsource.com/java/RSPEC-5998/), but not always & not in all situations (i.e. mostly watch `Pattern.compile`?) ---- #### Solutions - Explicitly prevent backtrack: - Possessive quantifiers `(ab?)*+` - Atomic grouping `(?>(x*y)+)z` - Mutually exclusive alternatives: `([a-z]|\d)+`, `"[^"]*"` or `\S*\s` Note: Backtracking is the default because it "always works", at the cost of having these issues. Think of it like brute force, trying all possible combinations. A possessive quantifier is just an atomic group over the same greedy quantifier. In the Javascript regex flavor, neither are supported but atomic grouping can be emulated by a capturing group in lookahead then referenced: `(?=((x*y)+))\1z` --- Example techniques to match `.*{start} (stuff) .*{end}` Explicit greedy alternation `(?:[^{]++|{(?!START}))*+{START}` Unrolled star alternation `[^{]*(?:(?:{(?!START}))+[^{]*)*{START}` Tempered greedy quantifier `(?:(?!{START}).)*{START}` --- ### Here? `\s*$` is not a compound quantifier, so what gives? This is not strictly a _catastrophic backtracking_, but… Finding the first match is also an implicit repetition. After failing on `…(___Z)` if will look at `…_(__Z)` <!-- .element: class="fragment" data-fragment-index="1" --> Note: Not as bad as 2^n, more like n^2 —trying n times a O(n) match—, but still gets really slow quickly given a large input. ---- ### Attempt 1 ✗ Possessive quantifier / atomic group: `\s++$` or `(?>\s+)$` Note: Will not backtrack, but still trying n times a O(n) match ---- ### Attempt 2 ✓ `(?<!\s)\s++$` Early dismiss by negative lookahead Note: Now trying n times a O(1) match. ---- ### Actual solution 👍 ```java StringUtils.stripEnd(sLine, null) ``` Note: Clearer + ever so slightly faster. Internally a simple loop over characters. --- ### General regex performance advice - Use atomic grouping & possessive quantifiers - Use mutually exclusive alternations - Anchor your regex, if possible at the beginning - Use region matching to trade a bit of complexity for performance - Test your regexes against long inputs - Leverage dedicated alternative methods Note: This does not include more general performance regex tricks like avoinding unneeded alternation or capturing groups. --- As always: in case of doubt, ask. [![I know regular expressions](https://imgs.xkcd.com/comics/regular_expressions.png "Wait, forgot to escape a space. Wheeeeee[taptaptap]eeeeee.")](https://xkcd.com/208) --- ### Thank you! Sources / more reading :books: - [SC-143835](https://app.shortcut.com/dataiku/story/143835/redos-in-recipe-to-notebook-cell) ReDoS in recipe ↔ notebook - ReDoS: [en.wikipedia.org/wiki/ReDoS](https://en.wikipedia.org/wiki/ReDoS) - Backtracking vs Thomson NFA: [reidy-p.github.io/regex-performance/](https://reidy-p.github.io/regex-performance/) - Catastrophic backtracking: [regular-expressions.info/catastrophic.html](https://www.regular-expressions.info/catastrophic.html) - Explosive quantifiers: [rexegg.com/regex-explosive-quantifiers.html](https://www.rexegg.com/regex-explosive-quantifiers.html)
{"description":"A primer about regex performance issues & how to solve them","slideOptions":"{\"slideOptions\":{\"center\":false,\"spotlight\":{\"enabled\":true}}}","title":"Regex performance","contributors":"[{\"id\":\"0bcb2984-dec3-4f69-a209-04eaf8f9be75\",\"add\":9069,\"del\":3655}]"}
    151 views