Regular Expressions

--- title: Regular Expressions tags: Programming Concepts, Python Cheatbook --- # Regular Expressions (RegEx) Regular expressions also known as *RegEx*, is a search pattern expression for text. It is one of the important tools used by data wranglers to search, filter and validate text data for their applications. Lets say I have a text > Quick brown fox jumps over a quick fox. Lets search for text `quick` (case-insensitive) > ==Quick== brown fox jumps over a ==quick== fox. :::success 2 matches ::: That was easy, we got 2 matches, but now we want to search for 'Quick' only at the start of the text. What do we do? 1. search for `Quick` and match-case, this might fail if there are grammar mistakes. 1. use regex search `^quick`, where `^` meta-character says match the following **keyword** with only the start of the text. Before we start it is important to note that each language has a slightly different implementation of regex, **most** of the implementations have common attributes among them which will be covered in the basics. In the following sections we will look at basic regex usage with examples, :star: indicates popular regex based on my assumptions. # Anchors `^` `$` `\b` `\B` :star: Search: `^quick` (case-insensitive) > ==Quick== brown fox jumps over a quick fox. :::success 1 match ::: Search: `fox$` (case-insensitive) > Quick brown fox jumps over a quick fox. :::danger No match. This failed because there is a full stop at the end of the text, update the regex to include a full stop `fox\.$`. ::: :star: Search: `fox\.$` (case-insensitive) > Quick brown fox jumps over a quick ==fox.== :::success 1 match, use `\` to **escape special regex characters**, `.` is a special regex character (more on it below) that must be escaped in this case. ::: Search: `fox\b` (case-insensitive) > Quick brown ==fox== jumps over a quick **foxy** :::success 1 match, `\b` is a **word boundary match**, i.e. match any character that separates letters. ::: Search: `fox\B` (case-insensitive) > Quick brown fox jumps over a quick **==fox==y** :::success 1 match, `\B` is a **not-word boundary match**, i.e. match any character that does not separates letters. :::  # Examples Search for a N letter words in a document. Search: `^[A-Za-z]{N}$` (where N is an integer) # Resources 1. Regulex helps visualizing regex patterns, https://jex.im/regulex/#!flags=&re=%5E(a%7Cb)*%3F%24 2. Compile and test regex https://regexr.com/