# MLE - Week 3 Summary ------- - Regular Expression (RegEx) - List Comprehensions - Random - Generator, Map, Zip - Arguments and Keyword Arguments - Others RegEx ----- ### Raw strings > :point_right: A raw string completely ignores all escape characters and prints any backslash that appears in the string. [[3]] :penguin: Raw strings are helpful if you are typing string values that **contain many backslashes**, such as the strings used for Windows file paths like r'C:\Users\Al\Desktop' or **regular expressions**. ```python= >>> pattern=r'A word\tAnother word\nA new line' >>> print(pattern) 'A word\tAnother word\nA new line' ``` ### RegEx functions | Function | Description | | -------- | ----------------------------------------------------------------- | | `findall` | Returns **a list of strings** containing all matches | | `search` | Returns a [Match object](#Match-object) if there is a match **anywhere** in the string | |`match`|Returns a [Match object](#Match-object) if there is a match starting from **the beginning** of the string. :exclamation: avoid to use this:exclamation:| | `split` | Returns **a list of strings** where the string has been split at each match | | `sub` | Replaces one or many matches with a string | Both `search` and `match` will return `None` if not found. Example: ```python= import re txt = "The rain in Spain" # findall >>> x = re.findall("ai", txt) >>> print(x) ['ai', 'ai'] # search >>> x = re.search("ai", txt) >>> print(x) <_sre.SRE_Match object; span=(5, 7), match='ai'> # match >>> x = re.match("ai", txt) >>> print(x) None >>> x = re.match(".*ai", txt) >>> print(x) <re.Match object; span=(0, 16), match='The rain in Spai'> # split # split the string at the first white-space character >>> x = re.split("\s", txt, 1) >>> print(x) ['The', 'rain in Spain'] >>> x = re.split("\s", txt) >>> print(x) ['The', 'rain', 'in', 'Spain'] # sub # replace the first two occurrences of a white-space character with the digit 9 >>> x = re.sub("\s", "9", txt, 2) >>> print(x) 'The9rain9in Spain' >>> x = re.sub("\s", "9", txt) >>> print(x) 'The9rain9in9Spain' ``` :penguin: A better way to define the regEx that helps it readable: ```python= phoneRegex = re.compile(r'''( (\d{3}|\(\d{3}\))? # area code (\s|-|\.)? # separator (\d{3}) # first 3 digits (\s|-|\.) # separator (\d{4}) # last 4 digits (\s*(ext|x|ext.)\s*(\d{2,5}))? # extension )''', re.VERBOSE) ``` ### Match object > :point_right: The Match object has properties and methods used to retrieve information about the search, and the result [[4]]: > - `.span()` returns a tuple containing the `start-`, and `end` positions of the match. > - `.string` returns the string passed into the function > - `.group()` returns the part of the string where there was a match Example: ```python= >>> txt = "The rain in Spain in Spring" >>> x = re.search(r"(\bS\w+)", txt) # search the first match only >>> print(x.span()) (12, 17) >>> print(x.string) 'The rain in Spain in Spring' >>> print(x.group()) 'Spain' >>> print(x.groups()) ('Spain',) >>> x = re.findall(r"(\bS\w+)", txt) >>> print(x) ['Spain', 'Spring'] ``` ### Character classes Cheatsheet: https://regexr.com | Character | classes | | ----------------------------- | -------------------------------------------------------------------------- | | `.` | any character **except newline** - [matching newlines with the Dot character](#Matching-newlines-with-the-Dot-character) | | `\w\d\s` | word, digit, whitespace | | `\W\D\S` | not word, digit, whitespace | | `[abc]` | any of a, b, or c | | `[^abc]` | not a, b, or c | | `[a-g]` | character between a & g | | **Anchors** | | | `^abc$` | start / end of the string | | `\b\B` | word, not-word boundary | | **Escaped characters** | | | `\.\*\\` | escaped special characters - [more details](#Escape-character) | | `\t\n\r` | tab, linefeed, carriage return | | **Groups & Lookaround** | | | `(abc)` | capture group | | `\1` | backreference to group #1 | | `(?:abc)` | non-capturing group | | `(?=abc)` | positive lookahead | | `(?!abc)` | negative lookahead | | **Quantifiers & Alternation** | | | `a* a+ a?` | 0 or more, 1 or more, 0 or 1 | | `a{5} a{2,}` | exactly five, two or more | | `a{1,3}` | between one & three | | `a+? a{2,}?` | match as few as possible - [more details](#Greedy-and-Non-greedy-matching) | | `ab|cd` | match ab or cd | ### Matching newlines with the Dot character :penguin: By passing `re.DOTALL` as the second argument to `re.compile()`, you can make the dot character match *all* characters, including the newline character. [[2]] ```python= >>> newlineRegex = re.compile('.*', re.DOTALL) >>> newlineRegex.search('Serve the public trust.\nProtect the innocent. \nUphold the law.').group() 'Serve the public trust.\nProtect the innocent.\nUphold the law.' ``` ### Escape character :point_right: If you want to detect these characters as part of your text pattern, you need to escape them with a backslash: `\. \^ \$ \* \+ \? \{ \} \[ \] \\ \| \( \)` :penguin: In sets, `+, *, .,|, (), $,{}` has no special meaning.[[4]] ### Greedy and Non-greedy matching > :point_right: Python’s regular expressions are *greedy* by **default**, which means that in ambiguous situations they will **match the longest string possible**. The *non-greedy* (also called *lazy*) version of the braces, which **matches the shortest string possible**, has the closing brace followed by **a question mark**.[[2]] Example: ```python= >>> greedyHaRegex = re.compile(r'(Ha){3,5}') >>> mo1 = greedyHaRegex.search('HaHaHaHaHa') >>> mo1.group() 'HaHaHaHaHa' >>> nongreedyHaRegex = re.compile(r'(Ha){3,5}?') >>> mo2 = nongreedyHaRegex.search('HaHaHaHaHa') >>> mo2.group() 'HaHaHa' ``` :bangbang: The **question mark** can have **two** meanings in regular expressions: declaring a *non-greedy* match or flagging an *optional* group. These meanings are entirely unrelated.[[2]] List Comprehension ------------------ ```python= even_numbers = [i for i in range(5) if i % 2 == 0] # [0, 1, 2,3,4] pairs = [(x, y) for x in range(2) for y in range(3)] print(x for x in range[5]) # TypeError: 'type' object is not subscriptable ``` :exclamation: Should not use more than 2 nested loop for a better readability. Generator --------- :point_right: Generator functions allow you to declare a function that behaves like an iterator, i.e. it can be used in a `for` loop [[8]]. :penguin: To make a function as a generator, use keyword `yield` instead of `return`. :penguin: The reason to use a generator is because of the huge amount of data which can't be loaded into a limitted memory. The generator will load the data on-demand. ![](https://miro.medium.com/max/1400/1*7X8rtWOiz5RKENZ_vugmKg.png) Random ------ `Random` is a special built-in [generator](#Generator). | Type | Functions | Descriptions | | ---------------------- | ----------------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------- | | Bookkeeping functions | `random.seed(a=None)` | Initialize the random number generator. If `a` is omitted or `None`, the current **system time** is used. | | Functions for *integers* | `random.randrange(stop)`<br/><br/>`random.randrange(start, stop[, step])` <br/><br/> `random.randint(a, b)` | Return a randomly selected element from `range(start, stop, step)`<br/><br/>:bangbang:`random.randint(a, b) = randrange(a, b+1)`. | |Functions for *sequences*|`random.choice(seq)`|Return a random element from the non-empty *sequence* `seq`. If `seq` is empty, raises `IndexError` [[7]].<br/><br/>:bangbang: *sequence* type includes: `str` `list` `tuple` `range` ([details](https://hackmd.io/gAhGf8pEQVaIkX7tiBy7tQ?view#Comparison))<br/><br/>:bangbang: `random.choice(sequence) = sequence[random.randint(0, len(sequence) – 1]`| ||`random.shuffle(x)`|Shuffle the *sequence* `x` in place.| |Real-valued distributions|`random.random()`|Return the next random *floating point number* in the range `[0.0, 1.0)`.| Map --- :point_right: More often than not, you want to apply a `function` (that is designed to work with **1 single input**) on a `list` of values. ```python= def add_2(n): return n+2 # We double all numbers using map() numbers = [1, 2, 3, 4] result = map(add_2, numbers) ``` `map object` is a sort of [generator](#Generator); it only generates value upon request. Zip --- :point_right: The `zip` function transforms multiple `iterables` into a single `iterable` of `tuples`. <img src="https://miro.medium.com/max/1200/1*rzvEG0LqZBSfa1rJgNOJrQ.png" width=400> ```python= >>> list(zip([1,2,3], {'a', 'b', 'c'})) [(1, 'a'), (2, 'b'), (3, 'c')] >>> list(zip(range(3), range(4,7))) [(0, 4), (1, 5), (2, 6)] >>> list(zip({'a':1, 'b': 2}, {'c':3, 'd':4})) [('a', 'c'), ('b', 'd')] ``` `zip object` is a sort of [generator](#Generator); it only generates value upon request. Arguments and Keyword Arguments ------------------------------- Arguments (`*args` - `tuple`) and Keyword arguments (`**kwargs` - `dict`) are useful when we do not know how many function inputs in advance. ```python= def big_function(*args, **kwargs): print('args = ', args) # this is a tuple print('kargs = ', kwargs) # this is a dictionary print("----------------") for name in args: print(f'Hi {name}') for key in kwargs: print(f'{key}: {kwargs[key]}') >>> big_function('Minh', 'Tom',1, ['Quan'], coder=True, school=True, test=False, hihi=6) # Valid >>> big_function('Tom','Quan','Nhan', year='1992', school='coderschool', 'Ai') # SyntaxError: positional argument follows keyword argument ``` Others ------ ### Iterators :point_right: The iterator objects themselves are required to support the following two methods, which together form the iterator protocol [[6]]: - `iterator.__iter__()`: allow both *containers* and *iterators* to be used with the **for** and **in** statements. - `iterator.__next__()`: return the *next item* from the container. If there are no further items, raise the **StopIteration** exception. What built-in data type is an iterator? - Mutable: `list` `set` `dict` - Immutable: `string` `tuple` `range` #### Exception Handling :point_right: When code in a `try` clause causes an error, the program execution immediately moves to the code in the `except` clause. After running that code, the execution continues as normal. [[1]] ```python= def spam(divideBy): try: return 42 / divideBy except ZeroDivisionError: print('Error: Invalid argument.') ``` ### `PyInputPlus` library for input validation PyInputPlus has several functions for different kinds of input [[5]]: | Method | Description | | --------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------- | | `inputStr()` | Is like the built-in `input()` function but has the general `PyInputPlus` features. You can also pass a custom validation function to it | | `inputNum()` | Ensures the user enters a number and returns an `int` or `float`, depending on if the number has a decimal point in it | | `inputChoice()` | Ensures the user enters one of the provided choices | | `inputMenu()` | Is similar to `inputChoice()`, but provides a menu with *numbered* or *lettered* options | | `inputDatetime()` | Ensures the user enters *a date and time* | | `inputYesNo()` | Ensures the user enters a `yes` or `no` response | | `inputBool()` | Is similar to `inputYesNo()`, but takes a `True` or `False` response and returns a *Boolean* value | | `inputEmail()` | Ensures the user enters a valid email address | | `inputFilepath()` | Ensures the user enters a valid file path and filename, and can optionally check that a file with that name exists | | `inputPassword()` | Is like the built-in input(), but displays `*` characters as the user types so that passwords, or other sensitive information, aren’t displayed on the screen | These functions will automatically reprompt the user for as long as they enter invalid input. ```python= >>> import pyinputplus as pyip >>> response = pyip.inputNum() five 'five' is not a number. 42 >>> response 42 ``` The `min`, `max`, `greaterThan`, and `lessThan` Keyword Arguments: ```python= >>> response = pyip.inputNum('Enter num: ', min=4, lessThan=6) Enter num: 6 Input must be less than 6. Enter num: 3 Input must be at minimum 4. Enter num: 4 >>> response 4 ``` The `blank` Keyword Argument: ```python= >>> response = pyip.inputNum(blank=True) (blank input entered here) >>> response '' ``` The `limit`, `timeout`, and `default` Keyword Arguments: ```python= >>> import pyinputplus as pyip # limit >>> response = pyip.inputNum(limit=2) blah 'blah' is not a number. Enter num: number 'number' is not a number. Traceback (most recent call last): --snip-- pyinputplus.RetryLimitException # timeout >>> response = pyip.inputNum(timeout=10) 42 (entered after 10 seconds of waiting) Traceback (most recent call last): --snip-- pyinputplus.TimeoutException # default >>> response = pyip.inputNum(limit=2, default='N/A') hello 'hello' is not a number. world 'world' is not a number. >>> response 'N/A' ``` The `allowRegexes` and `blockRegexes` Keyword Arguments: ```python= >>> import pyinputplus as pyip # allowRegexes >>> response = pyip.inputNum(allowRegexes=[r'(I|V|X|L|C|D|M)+', r'zero']) XLII >>> response 'XLII' # blockRegexes >>> response = pyip.inputNum(blockRegexes=[r'[02468]$']) 42 This response is invalid. 44 This response is invalid. 43 >>> response 43 ``` [1]: https://automatetheboringstuff.com/2e/chapter3/ [2]: https://automatetheboringstuff.com/2e/chapter7/ [3]: https://automatetheboringstuff.com/2e/chapter6/ [4]: https://www.w3schools.com/python/python_regex.asp [5]: https://automatetheboringstuff.com/2e/chapter8/ [6]: https://docs.python.org/3/library/stdtypes.html#iterator-types [7]: https://docs.python.org/3/library/random.html [8]: https://wiki.python.org/moin/Generators ###### tags: `mle` `week3` `regex` `generator` `map` `zip` `random`