# Regex performance
Slides: [hackmd.io/@instanceofme/H13VSoKYn](https://hackmd.io/@instanceofme/H13VSoKYn)
---
## Why?
DSS had [this](https://github.com/dataiku/dip/blob/12.1.0/src/main/java/com/dataiku/dip/code/CodeConversionService.java#L346), executed under transaction, when copying code between recipe & notebook:
```java
sLine.replaceAll("\\s*$", "") // strip trailing spaces
```
…which worked well enough
<!-- .element: class="fragment" data-fragment-index="1" -->
…until it's fed a line with 262344 spaces followed by a comment ([yes](https://app.shortcut.com/dataiku/story/143835/redos-in-recipe-to-notebook-cell))
<!-- .element: class="fragment" data-fragment-index="2" -->
----
### ReDoS
Regular Expression Denial-of-Service
Note:
Could happen in an adversarial setting, where a hostile user provides an input known to slow down the program.
---
### Backtracking
Regex = non-deterministic finite automaton (NFA)*
E.g. `abab|abbb`
```mermaid
graph LR;
i> ]-->a0(( ))
a0-->|a|a1(( ))
a1-->|b|a2(( ))
a2-->|a|a3(( ))
a3-->|b|a4(( ))
a4-->e[ ]
i-->b0(( ))
b0-->|a|b1(( ))
b1-->|b|b2(( ))
b2-->|b|b3(( ))
b3-->|b|b4(( ))
b4-->e
```
<small>*most of the time</small>
Example matching `abbb`
<!-- .element: class="fragment" data-fragment-index="1" -->
Note:
Some regex extensions (like recursive matching or code execution on match) make the regex not regular anymore, hence not homogenous to a NFA.
---
### Catastrophic backtracking
How many steps to fail the match of `1234567890!` against `(\d+)*$`
Note: 511. Push those digits to 20, and it's 1M. 2^n-1.
----
#### Known red flags :triangular_flag_on_post:
- Compounded quantifiers `(A+)*`
- Contiguous quantified tokens not mutually exclusive `\w+\d+`
- Alternatives not mutually exclusive `(\w|\d)+`
- Remote quantifiers able to venture into each other's territory `^.*?".*?"$`
Note:
They might appear in much more complex expressions that would make them much harder to spot.
Some may be automatically flagged by tools such as [Sonar](https://rules.sonarsource.com/java/RSPEC-5998/), but not always & not in all situations (i.e. mostly watch `Pattern.compile`?)
----
#### Solutions
- Explicitly prevent backtrack:
- Possessive quantifiers `(ab?)*+`
- Atomic grouping `(?>(x*y)+)z`
- Mutually exclusive alternatives:
`([a-z]|\d)+`, `"[^"]*"` or `\S*\s`
Note:
Backtracking is the default because it "always works", at the cost of having these issues. Think of it like brute force, trying all possible combinations.
A possessive quantifier is just an atomic group over the same greedy quantifier.
In the Javascript regex flavor, neither are supported but atomic grouping can be emulated by a capturing group in lookahead then referenced: `(?=((x*y)+))\1z`
---
Example techniques to match
`.*{start} (stuff) .*{end}`
Explicit greedy alternation
`(?:[^{]++|{(?!START}))*+{START}`
Unrolled star alternation
`[^{]*(?:(?:{(?!START}))+[^{]*)*{START}`
Tempered greedy quantifier
`(?:(?!{START}).)*{START}`
---
### Here?
`\s*$` is not a compound quantifier, so what gives?
This is not strictly a _catastrophic backtracking_, but…
Finding the first match is also an implicit repetition.
After failing on `…(___Z)` if will look at `…_(__Z)`
<!-- .element: class="fragment" data-fragment-index="1" -->
Note:
Not as bad as 2^n, more like n^2 —trying n times a O(n) match—, but still gets really slow quickly given a large input.
----
### Attempt 1 ✗
Possessive quantifier / atomic group:
`\s++$` or `(?>\s+)$`
Note:
Will not backtrack, but still trying n times a O(n) match
----
### Attempt 2 ✓
`(?<!\s)\s++$`
Early dismiss by negative lookahead
Note:
Now trying n times a O(1) match.
----
### Actual solution 👍
```java
StringUtils.stripEnd(sLine, null)
```
Note:
Clearer + ever so slightly faster. Internally a simple loop over characters.
---
### General regex performance advice
- Use atomic grouping & possessive quantifiers
- Use mutually exclusive alternations
- Anchor your regex, if possible at the beginning
- Use region matching to trade a bit of complexity for performance
- Test your regexes against long inputs
- Leverage dedicated alternative methods
Note:
This does not include more general performance regex tricks like avoinding unneeded alternation or capturing groups.
---
As always: in case of doubt, ask.
[](https://xkcd.com/208)
---
### Thank you!
Sources / more reading :books:
- [SC-143835](https://app.shortcut.com/dataiku/story/143835/redos-in-recipe-to-notebook-cell) ReDoS in recipe ↔ notebook
- ReDoS: [en.wikipedia.org/wiki/ReDoS](https://en.wikipedia.org/wiki/ReDoS)
- Backtracking vs Thomson NFA:
[reidy-p.github.io/regex-performance/](https://reidy-p.github.io/regex-performance/)
- Catastrophic backtracking:
[regular-expressions.info/catastrophic.html](https://www.regular-expressions.info/catastrophic.html)
- Explosive quantifiers:
[rexegg.com/regex-explosive-quantifiers.html](https://www.rexegg.com/regex-explosive-quantifiers.html)
{"description":"A primer about regex performance issues & how to solve them","slideOptions":"{\"slideOptions\":{\"center\":false,\"spotlight\":{\"enabled\":true}}}","title":"Regex performance","contributors":"[{\"id\":\"0bcb2984-dec3-4f69-a209-04eaf8f9be75\",\"add\":9069,\"del\":3655}]"}