HackMD - Collaborative Markdown Knowledge Base

**Regular Expression** In conclusion: Regular expression helps in manipulating textual data, which is often a prerequisite for data science projects involving text mining. - Application: + Search for things pattern = r'cookie' sequence = 'In this Cookie store we sell cookie' match_obj = re.search(pattern, sequence) print(match_obj) => only work with exact match (capitalize, plural/singular) => doesn't work \t: a tab in string \n: create a new line pattern = r'cookie'=> turn string's special characters into string strip(): it will remove all the spaces at the beginning and the end of the string https://www.w3schools.com/python/ref_string_strip.asp Note that re.search scans through string looking for the first matched location match_obj = re.search(pattern,sequences[0]) match_obj.group() => use to return workable output import re => used to import function that is not built-in [A-C][^a-c][abc] other approach: A more convenient way is to specify how many repetitions of each character we want using the curly braces notation. For example, a{3} will match the a character exactly three times. Certain regular expression engines will even allow you to specify a range for this repetition such that a{1,3} will match the a character no more than 3 times, but no less than once for example. This quantifier can be used with any character, or special metacharacters, for example w{3} (three w's), [wxy]{5} (five characters, each of which can be a w, x, or y) and .{2,6} (between two and six of any character). Answer: waz{3,5}up Answer: a+b+c+ a+b*c+ Answer: \d{1,2} files? found\? Answer: Lesson 9: All this whitespace When dealing with real-world input, such as log files and even user input, it's difficult not to encounter whitespace. We use it to format pieces of information to make it easier to read and scan visually, and a single space can put a wrench into the simplest regular expression. The most common forms of whitespace you will use with regular expressions are the space (␣), the tab (\t), the new line (\n) and the carriage return (\r) (useful in Windows environments), and these special characters match each of their respective whitespaces. In addition, a whitespace special character \s will match any of the specific whitespaces above and is extremely useful when dealing with raw input text. In the strings below, you'll find that the content of each line is indented by some whitespace from the index of the line (the number is a part of the text to match). Try writing a pattern that can match each line containing whitespace characters between the number and the content. Notice that the whitespace characters are just like any other character and the special metacharacters like the star and the plus can be used as well. \d\.\s+abc Lesson 10: Starting and ending ^Mission: successful$ => must match the entire string ^ and $ must be placed at the beginning and the end of the string ^(file.+)\.pdf$ => .+ choose everything Lesson 12: (\w{3}\s(\d{4})) Lesson 14: Using denotation | Specifically when using groups, you can use the | (logical OR, aka. the pipe) to denote different possible sets of characters. In the above example, I can write the pattern "Buy more (milk|bread|juice)" to match only the strings Buy more milk, Buy more bread, or Buy more juice. **[?]Additionally, there is a special metacharacter \b which matches the boundary between a word and a non-word character. It's most useful in capturing entire words (for example by using the pattern \w+\b) => what does **this mean? ([^1(\s]\d{2}) 2nd solution: (\d{3}) https://regexone.com/problem/matching_emails? [?] (\w+(\.\w+[^+@)*) vs. ^([\w\.]*)