owned this note
owned this note
Published
Linked with GitHub
---
tags: resources
---
# RegEx Guide
## What is RegEx?
RegEx is short for Regular Expression. A RegEx is a sequence of characters that defines a search pattern. They allow you to create a set of rules for the kind of String you want, and then search through a larger body of text in order to find Strings that fit these rules.
If you would like a more in depth guide to RegEx tools, you can look [here](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_Expressions/Cheatsheet)! If you would like to test/practice RegEx, we recommend using [regexr.com](https://regexr.com) or [regex101.com](https://regex101.com).
### Warmup Examples: Creating a regex (same as in lab)
If we were just looking for every instance of the word “cat” in a larger document, we would use the RegEx “cat”. However, this won’t get us very far practically. We
can do much more than this. Let’s say instead, we’re looking for every 3 letter word in a body of text that ends in “at” (e.g. “cat”, “bat”, “sat”, etc). For this, we can use `[a-z]at`. The brackets create a set of characters to search for, and the a-z gives a range of every character between a and z in the ASCII character list.
Now instead, suppose we want every single word ending in “at”, regardless of length (e.g. “at”, “bat”, “seat”, etc). For this, we can use a quantifier. The `*` quantifier says that there can be any number, including 0, of the character proceeding it in the string it finds. This means `[a-z]*at` will do what we want; find every String consisting of some number of letters followed by “at”. (If we want to exclude “at” and only find words with 3 or more letters, we could instead use `[a-z]+at`, as the + quantifier requires there be one or more a-z characters.) This RegEx still has some issues though. For example, if we used it to find all of the words ending in “at” in the string “The cat was late,” it would match “cat”, but it would also match “lat”. We need some way to specify that the “at” actually be at the end of the word. For this, we can use a “word boundary,” which requires that there be “word characters” (letters, numbers or underscores) on only one side of the boundary. This is represented as `\b` in the RegEx syntax. So our new RegEx, `[a-z]*at\b`, will match “at”, “cat”, “seat”, and “great”, but not “ate”, “late”, “bat9”, or “at_home”. If we want to avoid matches like “seat” or “great”, we could add a word boundary to the beginning as well, for a final RegEx of `\b[a-z]at\b`.
## Characters and Quantifiers
| Characters | Explanation |
| ------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| `\w` | Matches any word character (letters, digits, or underscores) |
| `\d` | Matches any digit |
| `\s` | Matches any whitespace character (space, tab, newline) |
| `\W` | Matches any non-word character (NOT a letter, digit, or underscore) |
| `\D` | Matches any non-digit character |
| `\S` | Matches any non-whitespace character |
| `\b` | Matches a word boundary (there must be word characters on only one side of the boundary) |
| `.` | Matches any one character except a newline. |
| `[xyz]` | Matches any character within the brackets (ie. x or y or z) |
| `[x-z]` | Matches any character in the range between x and z on the ASCII character list |
| `[^x]` | Matches any character that is not x; this can be combined with other techniques. (ie. [^b-x] matches any character not in the range b-x) |
| `(abc)` | Matches the string inside the parentheses (ie. “abc”) |
| `^` | If at the beginning of the RegEx, requires that the pattern be found at the start of the String being searched. |
| `$` | If at the end of the RegEx, requires that the pattern end at the end of the String being searched. |
| Quantifiers | Explanation |
| ------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| `+` | One or more of the preceding character |
| `{x}` | Exactly x of the preceding character |
| `{x, y}` | Between x and y of the preceding character |
| `{x, }` | x or more of the preceding character |
| `*` | 0 or more of the preceding character |
| `?` | 0 or 1 of the preceding character |
If no quantifier is present on a character or set of characters, regex will look for examples with exactly one of the character.
## Escaping Characters
A number of characters have special meanings in regular expressions. For example, `.` is used as a wildcard, matching any one character besides a newline. But what if we’re actually looking for every period in a document? For this, `\.` will find every character that’s actually a period. This is known as “escaping” the special character. This will work for any character that has a special meaning in regular expressions (e.g. `()[]*+?\^$"`).
______________________________
*Please let us know if you find any mistakes, inconsistencies, or confusing language in this or any other CSCI0200 document by filling out the [anonymous feedback form](https://docs.google.com/forms/d/e/1FAIpQLSdFzM6mpDD_1tj-vS0SMYAohPAtBZ-oZZH0-TbMKv-_Bw5HeA/viewform?usp=sf_link).*