Week 2 - HackMD

--- tags: FTMLE-Philipines-2020 --- ###### tags: FTMLE-Philipines-2020 # Week 2 <div style="text-align: justify"> The topics of this week are: - **Intermediate Python** - **Regular expression** - **Python OOP** - **Data structures and Algorithm** </div> ## Monday 25.05.2020 ### Intermediate Python #### List comprehension ##### Generator. <div style="text-align: justify"> - Generator does not store anything in your RAM. So it is better to use a generator than loop throught a pre-defined list (then you have to store a list first!) - You can convert a generator into a list by using ```list()``` ``` python def natural_numbers(): """returns 1, 2, 3, ...""" n = 1 while True: yield n n += 1 data = natural_numbers() evens = (x for x in data if x % 2 == 0) even_squares = (x ** 2 for x in evens) print(even_squares) ``` What if we run ``` list(data)```? $\rightarrow$ It returns an error: OOM - out of memory. The Colab crashes! This means you created an infinity list! </div> #### Automated Testing via assert <div style="text-align: justify"> Use ``` assert ``` ```python assert 1 + 1 == 2 assert 1 + 1 == 3, "An error message" ``` You can just put some test-cases below the function and check if your function works correctly or not! For example ```python def smallest_item(xs): return min(xs) assert smallest_item([10, 20, 5, 40]) == 5 assert smallest_item([1, 0, -1, 2]) == -1 ``` </div> #### Randomness <div style="text-align: justify"> We use the library ```random``` ```python import random random.seed(102) # this ensures we get the same results every time four_uniform_randoms = [random.random() for _ in range(4)] four_uniform_randoms ``` Here we are using the ```random.seed(102)```, which means you specified the seed 102. Everytime you run the random-code, it gives the same results; the same numbers will be generated. If there is nothing specified, it will randomly generate a number from 0 to 1. We use a different method to generate a number in a range $(a,b)$ ```python random.randrange(10) # choose randomly from range(10) = [0, 1, ..., 9] random.randrange(3, 6) # choose randomly from range(3, 6) = [3, 4, 5] ``` We can use the method ```_.sample``` to take some samples ```python lottery_numbers = range(49) winning_numbers = random.sample(lottery_numbers, 6) # [16, 36, 10, 6, 25, 9] winning_numbers ``` or ```_.shuffle``` to shuffle a list. Example: ```python up_to_ten = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] random.shuffle(up_to_ten) print(up_to_ten) ``` #### Zip and Argument unpacking What does ```zip``` function do? We use ```zip``` to combine element with element of lists. ```python list1 = ['a', 'b', 'c'] list2 = [1, 2, 3] # zip is lazy, so you have to do something like the following [pair for pair in zip(list1, list2)] # is [('a', 1), ('b', 2), ('c', 3)] ``` Unpacking: Assign multiple variables! - Remark on asterisk ```*```: it takes all possible values left and store as a list in a new variable. For example, if we assign as below ```python a = [1, 2, 3, 4] x, y, z = a ``` - This will returns an error because the zip function stops when the first list ends. So we need to use the asterisk ```*``` as in the following code ```python a = [1, 2, 3, 4] x, y, *z = a ``` - This gives ```x = 1, y = 2, z = [3,4]]``` - Remark on keywords argment ```**kargs```: a dictionary storing ***keys*** and ***values*** What if we combine the function ```zip``` and ```*``` ```python pairs = [('a', 1), ('b', 2), ('c', 3)] list(zip(*pairs)) ``` More example to understand the keyword-values ```kargs``` ```python def f(*args, **kargs): a, b = args operator = kargs['operator'] print(args, kargs) if operator == 'Add': return a + b elif operator == 'Mul': return a*b else: return a - b f(3, 2, divide_by_2=True, operator='Mul') ``` Specify a specific value for this keyword-input argument. The output of the above code is: ```python (3, 2) {'divide_by_2': True, 'operator': 'Mul'} 6 ``` We add the two lines ```a, b = args```, ``` operator = kargs['operator']``` to print out the input arguments and the keyword arguments that we specify in the function ```f```. The input ``` *args``` means that we can input as many as we want. The double asterisks ```**``` means that this is the keyword input value. </div> #### Pointer in python <div style="text-align: justify"> When we pass a list/dictionary into a function, if we manipulate it in the function, the original list/dictionary will be manipulated too. One should keep in mind the following example: ```python a = [1, 2, 3] b = a b[0] = 99 print(a) ``` The output is: ``` a = [99, 1, 2]``` . That means, once we change ```b```, we also change ```a```. In order to change ```b``` without affecting ```a```, we need to use the method ```_.copy()```. ```python a = [1, 2, 3] b = a.copy() # make a copy of a and store in b b[0] = 99 print(a) ``` The concept of **local** and **global** variables. ```python global a # define a global variable a. ``` </div> #### Python Decorator <div style="text-align: justify"> We can pass a function as an input of a function! ```python def say_hello(name): return f"Hello {name}" def be_awesome(name): return f"Yo {name}, together we are the awesomest!" def greet_bob(greeter_func): return greeter_func("Bob") print(greet_bob(say_hello)) print(greet_bob(be_awesome)) ``` We can define a function inside another function! ```python def parent(): print("Printing from the parent() function") def first_child(): print("Printing from the first_child() function") def second_child(): print("Printing from the second_child() function") second_child() first_child() parent() ``` **Function Wrapper and Debugger** Check out the following examples, let define a function called ```greeting``` ```python def greeting(*args, **kwargs): print(args) print(kwargs) name = args[0] domain = kwargs['domain'] print(f'Hello {name} with email {name}@{domain}') greeting('minhdh', domain='coderschool.vn') ``` We want to add some **decorator** here! Why do we need that? ```python def my_decorator(func): def wrapper(*args, **kwargs): print("Something is happening before the function is called.") func(*args, **kwargs) print("Something is happening after the function is called.") return wrapper @my_decorator def greeting(name): """Just a docstring""" print("Hi", name) greeting('Minh') ``` So, ```@my_decorator``` is just an easier way of saying ```say_whee = my_decorator(say_whee)```. It’s how you apply a decorator to a function. But how does it work? Let's analyze the code above and then give one more example! - Notice that the syntax ```@my_decorator``` is placed on top of the function ```greeting```, that means we are applying the decorator ```my_decorator``` to the function ```greeting```. We will describe how the program flows when we execute the last command ```greeting('Minh')``` -- The function ```greeting``` takes ```Minh``` as an input. Because we have a decorator ```my_decorator``` on top of ```def greeting```, when we execute ```greeting('Minh')```, the function ```greeting``` is brought to the decorator ```my_decorator``` as an input. -- Inside the function ```my_decorator```, the function ```wrapper``` is defined. It takes the input of ```greeting``` as its input. Roughly speaking, it just excute the function ```greeting``` with some more additional/extra works. -- But what are these extra works? They are usually some condition that needed to be check before excuting the function ```func``` (```greeting``` in this example). ```python def wrapper(*args, **kwargs): print("Something is happening before the function is called.") func(*args, **kwargs) print("Something is happening after the function is called.") return wrapper ``` - Now we try the command ```help(greeting)```. It shows that ``` Help on function wrapper in module __main__: wrapper(*args, **kwargs) ``` which means that it confused between the docstring of ```greeting``` and the docstring of ```wrapper```. In order to preserve the docstring of ```greeting```, we have to add ```@functools.wraps(func)``` ```python import functools def my_decorator(func): @functools.wraps(func) def wrapper(*args, **kwargs): print("Something is happening before the function is called.") func(*args, **kwargs) print("Something is happening after the function is called.") return wrapper @my_decorator def greeting(name): """The docstring of greeting""" print("Hi", name) greeting('Minh') print('-----------------') help(greeting) ``` **In general, the ```wrapper``` function inside the ```decorator``` function takes the following form** ```python import functools def decorator(func): @functools.wraps(func) def wrapper_decorator(*args, **kwargs): # Do something before value = func(*args, **kwargs) # Do something after return value return wrapper_decorator ``` Let us consider the next example to have a clearer understanding on the advantage of using decorators! Suppose that we have a dictionary storing customer's information ```python customers = [ {'name':'A', 'phone':'0981231234', 'age':10}, {'name':'B', 'phone':'0981231234123123', 'age':100}, {'name':'C', 'phone':'0981231235', 'age':20}, {'name':'D', 'phone':'0981231236', 'age':30}, ] ``` We notice that there is a fake information in it. Our task is to find and eliminate it. We only extract real information. In order to determine which information is fake, the number of digits in the ``'phone'`` number will be our evaluation criteria. We will write the first function taking the information of all real customers. ```python import functools def valid_phone_number(func): @functools.wraps(func) def wrapper(customers): # Do something before valid_customers = [] for cus in customers: if len(cus['phone']) == 10: valid_customers.append(cus) func(valid_customers) # Do something after return wrapper @valid_phone_number def read_customer_info(customers): for cus in customers: print(cus) read_customer_info(customers) ``` When we execute ```read_customer_info(customers)```, the function ```read_customer_info``` will be brought to the decorator; which is ```valid_phone_number()``` in this case, as an input. The input ```customers``` will go to the input of ```wrapper``` function. Instead of checking the condition of valid phone number inside the function ```read_customer_info```, now we check it in the decorator ```valid_phone_number```, or more precisely, inside the function ```wrapper```. The extra works that we mention above (in previous example) is to check the valid phone number condition. ```python def wrapper(customers): # Do something before excuting func (=read_customer_info), this is the extra work! valid_customers = [] for cus in customers: if len(cus['phone']) == 10: valid_customers.append(cus) func(valid_customers) # execute the main function read_customer_info. # Do something after return wrapper ``` Next, the **advantage** of using this decorator is that we can re-use it in a very efficient way! Suppose that we now want to extract another information, which is also based on the condition valid-phone-number. Let's say we want to calculate the mean age of real customers. All we need to do is ```python @valid_phone_number def mean_age(customers): s = 0 for cus in customers: s += cus['age'] print('Mean age:', s/len(customers)) mean_age(customers) ``` If we did not use the decorator as above, we would have written two seperated function. Each function contains exactly the same part in which we check the valid-phone-number condition! But thanks to the decorator, now we just need to place it on top of the ```mean_age``` function and get the result. Rougly speaking, the decorator helps us to check a common condition that could be used several times for several different functions in our program. Overall, our program would be very clean, well-organized, efficient and easy to read. ```python customers = [ {'name':'A', 'phone':'0981231234', 'age':10}, {'name':'B', 'phone':'0981231234123123', 'age':100}, {'name':'C', 'phone':'0981231235', 'age':20}, {'name':'D', 'phone':'0981231236', 'age':30}, ] import functools def valid_phone_number(func): @functools.wraps(func) def wrapper(customers): # Do something before valid_customers = [] for cus in customers: if len(cus['phone']) == 10: valid_customers.append(cus) func(valid_customers) # Do something after return wrapper @valid_phone_number def read_customer_info(customers): for cus in customers: print(cus) @valid_phone_number def mean_age(customers): s = 0 for cus in customers: s += cus['age'] print('Mean age:', s/len(customers)) mean_age(customers) ``` **KEEP IN MIND HOW TO USE THE DECORATORS !!!!** **Solve Repl.it exercise 6G by using generator** My solution ```python x = int(input()) lst = [1, 1] temp = 2 while temp <= x: lst.append(lst[-1] + lst[-2]) temp = lst[-1] + lst[-2] if x in lst: print(len(lst)) else: print(-1) ``` Solution using a generator ```python def fibo_index(n): a, b = 1, 1 index = 2 while n > b: # find the index of n a, b = b, a + b return index ``` </div> ### Regular expression <div style="text-align: justify"> The way you define the patterns of strings, so that you can search or manipulate it! Exercises on www.regexone.com There are some important function on regular expression: - ```re.search(pattern, string)``` Scan through string looking for the first location where the regular expression pattern produces a match, and return a corresponding match object. Return None if no position in the string matches the pattern; note that this is different from finding a zero-length match at some point in the string. - ```re.match(pattern, string)``` If zero or more characters at the beginning of string match the regular expression pattern, return a corresponding match object. Return None if the string does not match the pattern; note that this is different from a zero-length match. Note that even in MULTILINE mode, re.match() will only match at the beginning of the string and not at the beginning of each line. If you want to locate a match anywhere in string, use search() instead. - ```re.findall(pattern, string)``` Return all non-overlapping matches of pattern in string, as a list of strings. The string is scanned left-to-right, and matches are returned in the order found. If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group. Empty matches are included in the result. - ```re.sub(pattern, repl, string)``` Return the string obtained by replacing the leftmost non-overlapping occurrences of pattern in string by the replacement repl. If the pattern isn’t found, string is returned unchanged. repl can be a string or a function; if it is a string, any backslash escapes in it are processed. That is, \n is converted to a single newline character, \r is converted to a carriage return, and so forth. Unknown escapes of ASCII letters are reserved for future use and treated as errors. Other unknown escapes such as & are left alone. Backreferences, such as \6, are replaced with the substring matched by group 6 in the pattern. </div> ## Tuesday 26.05.2020 <div style="text-align: justify"> ### Big-O notation **Definition** (Big-O notation) In mathematics, the big-O notation describe the limiting behaviour of a function when the arguments tend to some specific values or to infinity. In computer science, people use big-O notation in analyzing algorithm. Precisely, it describe the run time and space requirements for an algorithm for a given input size $n$. We will usually see something like $O(log n), (n*n), O(n) ... $, where ```n``` is the input's size. We have the following formal mathematical definition: Let $f$ be a complex valued function and $g$ be a real valued function. We say that $$ f(x) = O(g(x)) \qquad \text{as } x\to \infty $$ if for each value $x$, the absolute value of $f(x)$ is at most $g(x)$ up to a constant $M$, which means: We say that $f(x) = O(g(x))$ if there exists a positive real value $M$ and $x_0$ such that $$|f(x)| \leq M g(x) \qquad \text{for all } x\geq x_0$$ In order to illustrate how this expression can be used to analyze the runtime of an algorithm. In analyzing algorithm, we are interested in its worse-case, which means, the case that takes the longest time to process. Let's think of a graph: <div style="text-align: center"> ![](https://i.imgur.com/jspbvd4.png) </div> The horizontal coordinate is $n$, size of the input. Obviously, the runtime and the input's size increase proportionally. Let's say our algorithm takes $O(f(n))$. This means that we can illustrate the graph of the runtime function like the blue line $k.f(n)$, where $k$ is some positive constant (See again the mathematical definition of big-O). This represent the worse case of running this algorithm. But it does not mean every time we execute this algorithm, it always takes that much time! The possible runtime is anything lower than that blue line, in this case, an example is the red line. Note that the constant $k$ is not important here, one can write $O(k.f(n))$ or $O(f(n))$ to point out the runtime function $O(f(n))$. In summary, we use big-O notation for **asymptotic upper bounds**, since it bounds the growth of the running time from above for large enough input sizes. Based on this big-O notation, we can determine the runtime by looking at how much we increase the input size. Let's consider the following examples: ```python @timer def bubble_sort(a): """Sort the list a""" n = len(a) for j in range(n-1): for i in range(n-1): if a[i] > a[i+1]: a[i], a[i+1] = a[i+1], a[i] ``` We are using a ```@timer``` decorator to see the runtime of ```bubble_sort()```. ### Python OOP Consider the example of cats and dogs in the colab_notebook and try to compare the normal way and the OOP way. ```class Cat(Animal)``` inherites from the class ``` Animal```. There is something that storing data in dictionary cannot do, but by defining ```class```, it is then possible. ```python class Cat(Animal): def __init__(self, name, age) ``` ```__function-name__ ``` is a syntax in python. It is used to define the function inside classes that overwrite functions already defined in python. For example, we can also define a mathematical operator as in the following exercise ```python class Fraction(): def __init__(self, num, de): self.num = num self.de = de divisor = self.gcd(self.num, self.de) self.num = self.num // divisor self.de = self.de //divisor def gcd(self, a, b): while b>0: a, b = b, a%b return a def __add__(self, other): return Fraction((self.num * other.de + other.num *self.de) , (self.de * other.de)) def __sub__(self, other): return Fraction((self.num * other.de - other.num *self.de) , (self.de * other.de)) def __mul__(self, other): return Fraction((self.num * other.num) , (self.de * other.de)) def __repr__(self): return str(self.num) + "/" + str(self.de) a = Fraction(4, 8) b = Fraction(2, 1) print(a*b) ``` In the class ```Fraction``` above, we define the function ```__add__```, ```__sub__```, ```__mul__``` and they perform exactly the same like the operator $+, -, \times$ as normal. One more example to see the ```class```, some of its attributes and its subclass with its own attributes. ```python class Employee(): def __init__(self, name, gender, email, salary, raising_rate=1.2): self.name = name self.gender = gender self.email = email self.salary = salary self.raising_rate = raising_rate def contact_info(self): return f'{self.name} - {self.email}' @classmethod def raising_salary(cls, employee): employee.salary = employee.salary * employee.raising_rate class Developer(Employee): def __init__(self, name, gender, email, salary, programming_language): super().__init__(name, gender, email, salary) self.programming_language = programming_language class Manager(Employee): def __init__(self, name, gender, email, salary, developers): super().__init__(name, gender, email, salary) self.developers = developers self.raising_rate = 1.5 def team_members(self): for dev in self.developers: print(dev.contact_info()) minh = Developer('Minh', 'male', 'minhdh@coderschool.vn', 1000, 'Python') mia = Developer('Mia', 'female', 'mia@coderschool.vn', 2000, 'C') charles = Manager('Charles', 'male', 'sang@coderschool.vn', 1, [minh, mia]) ``` The main ```class``` is Employee and there is a children class ```Developer``` that inherites the class ```Employee```. We define the class method by writting ```python @classmethod def raising_salary(cls, employee): employee.salary = employee.salary * employee.raising_rate ``` This means that the method ```raising_salary``` belongs to the class, we have to call it by ```python Employee.raising_salary(charles) ``` </div> ## Wednesday 27.05.2020 Numpy Some remarks on ```numpy``` - The library ```numpy``` can process much faster than normal ```list``` of python in working with array of number, especially multi-dimensional array. See the chart comparing the time needed in the lecture-note. - Can only remove dimension if it is 1. - The most important example of using ```numpy``` today is on image processing. Since digital image can be considered as an 2D array, where each element represents a value representing the color. Actually, for colored image, it is a 3D tensor, there are 3 layers, each layer is a 2D array. ```python !wget -q 'https://raw.githubusercontent.com/dhminh1024/practice_datasets/master/images/street.jpg' # photo sample from skimage.io import imread # the library we use photo = imread('street.jpg') # Show the original image import matplotlib.pyplot as plt %matplotlib inline def show_image(photo): plt.figure(figsize=(12,12)) plt.axis("off") plt.imshow(photo) show_image(photo) ``` We can rotate the image by rotating the array ```python # Rotation 180 degree # Hint: syntax to reverse an array in Python: '::-1' # Your code here # photo_flip = photo[::-1] show_image(photo[::-1]) ``` It is simply done by the manipulation power of numpy. We can also reverse the image to get a mirror reflection image ```python # reverse the columns, so we got an mirror image # Your code here photo_mirror = np.flip(photo, axis=1) show_image(photo_mirror) ``` Crop the image by reducing some rows and columns of the array ```python # Crop 10% of the width and height around the photo # Hint: corordinate of the top left pixel is (0, 0) # Your code here a = photo.shape print('Original shape', a) b0 = int(a[0] - a[0]*0.1) b1 = int(a[1] - a[1]*0.1) # print(b0,b1) crop_photo = photo[:b0,:b1,:] show_image(crop_photo) print('new shape',crop_photo.shape) ``` Reduce the solution of the image ```python # Reduce the quality (take 1 only from 4 pixels in both axes) # Double check by printing out the shape # Your code here show_image(photo[::4,::4]) ``` And finally we can also separate the 3-layers of the image ```python # The color system is RGB with 3 layers Red, Green, and Blue # Write a function that takes the index of a layer (0 to 2) as input # and display that layer on the screen. # (not in one line this time) # Your code here def show_layer(n, photo): """Plot the n_th layer of the image. n is in [0, 1, 2] """ black_img = np.zeros(photo.shape, dtype= 'uint8') # Your code here layer = photo[:,:,n] black_img[:,:,n]= layer show_image(black_img) show_layer(1, photo) ``` We can see a green image, a blue image and a red image. See the exercise note. ## Thursday 28.05.2020 <div style="text-align: justify"> SQL, do the assignments in www.sqlbolt.com to get familiar with the SQL-commands. Remark: SQL Lesson 12: Order of execution of a Query. ```SQL SELECT DISTINCT column, AGG_FUNC(column_or_expression), … FROM mytable JOIN another_table ON mytable.column = another_table.column WHERE constraint_expression GROUP BY column HAVING constraint_expression ORDER BY column ASC/DESC LIMIT count OFFSET COUNT; ``` What is executed first? 1. ```FROM``` and ```JOIN```s The FROM clause, and subsequent JOINs are first executed to determine the total working set of data that is being queried. This includes subqueries in this clause, and can cause temporary tables to be created under the hood containing all the columns and rows of the tables being joined. 2. ```WHERE``` Once we have the total working set of data, the first-pass WHERE constraints are applied to the individual rows, and rows that do not satisfy the constraint are discarded. Each of the constraints can only access columns directly from the tables requested in the FROM clause. Aliases in the SELECT part of the query are not accessible in most databases since they may include expressions dependent on parts of the query that have not yet executed. 3. ```GROUP BY``` The remaining rows after the WHERE constraints are applied are then grouped based on common values in the column specified in the GROUP BY clause. As a result of the grouping, there will only be as many rows as there are unique values in that column. Implicitly, this means that you should only need to use this when you have aggregate functions in your query. 4. HAVING If the query has a GROUP BY clause, then the constraints in the HAVING clause are then applied to the grouped rows, discard the grouped rows that don't satisfy the constraint. Like the WHERE clause, aliases are also not accessible from this step in most databases. 5. ```SELECT``` Any expressions in the SELECT part of the query are finally computed. 6. ```DISTINCT``` Of the remaining rows, rows with duplicate values in the column marked as DISTINCT will be discarded. 7. ```ORDER BY``` If an order is specified by the ORDER BY clause, the rows are then sorted by the specified data in either ascending or descending order. Since all the expressions in the SELECT part of the query have been computed, you can reference aliases in this clause. 8. ```LIMIT / OFFSET``` Finally, the rows that fall outside the range specified by the LIMIT and OFFSET are discarded, leaving the final set of rows to be returned from the query. **Check the exercises on SQL** </div> ## Friday 29.05.2020 <div style="text-align: justify"> Morning session: REVIEW THE FIRST 2 WEEKS Afternoon session: MODULE TEST. **Weekly Project 2** We initialize the project by importing necessary libraries. A database named "tikiproject1.db" is created. ```python conn = sqlite3.connect('tikiproject1.db') cur = conn.cursor() ``` The above code is to create a connection to the database. Notice that we also have to drop tables that could be overwritten. If the database is filled with both old and new data, the result could be wrong. ```python # --------------------------------------------------------------------- # Import libraries from bs4 import BeautifulSoup import requests import sqlite3 import re import pandas as pd # --------------------------------------------------------------------- TIKI_URL = 'https://tiki.vn' # Create database and make a connection. conn = sqlite3.connect('tikiproject1.db') cur = conn.cursor() # --------------------------------------------------------------------- # Drop existed table to avoid overwriting # --------------------------------------------------------------------- try: cur.execute('DROP TABLE categories;') cur.execute('DROP TABLE Sub_categories;') cur.execute('DROP TABLE products;') except: pass # --------------------------------------------------------------------- ``` In the database, we want to create 3 tables containing the following information: - table ```categories``` contains all the main categories fetched from tiki.vn - table ```sub_categories``` contains all the sub-categories of the above main categories. Note that these sub-categories are of all layers, which means, there could be a sub-category that is also a sub-category of a sub category. In our program, we do not take this into account. We just focus on pointing out the lowest sub-category and its products. ```python #----------------------------------------------------------------------- # Create Main categories table "categories" def create_Main_categories_table(): query = """ CREATE TABLE IF NOT EXISTS categories ( id INTEGER PRIMARY KEY AUTOINCREMENT, name VARCHAR(255), url TEXT, create_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP ) """ # Execute the SQL query. try: cur.execute(query) except Exception as err: print('ERROR BY CREATE TABLE', err) create_Main_categories_table() #----------------------------------------------------------------------- # Create Sub-categories table "sub_categories" def create_Sub_categories_table(): query = """ CREATE TABLE IF NOT EXISTS sub_categories ( id INTEGER PRIMARY KEY AUTOINCREMENT, name VARCHAR(255), url TEXT, parent_id INTEGER, create_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP ) """ try: cur.execute(query) except Exception as err: print('ERROR BY CREATE TABLE', err) create_Sub_categories_table() #----------------------------------------------------------------------- # Create Products table "products" def create_products_table(): query = """ CREATE TABLE IF NOT EXISTS products ( product_id INTEGER PRIMARY KEY AUTOINCREMENT, name VARCHAR(255), price VARCHAR(255), url TEXT, sub_cat_id INTEGER, create_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP ) """ try: cur.execute(query) except Exception as err: print('ERROR BY CREATE TABLE', err) create_products_table() ``` Next, we define two classes ```Category``` and ```Products``` for categories, both Main and Sub-categories. Their instance methods are ```save_into_mainCat()```, ```save_into_subCat()``` and ```save_into_products()```, respectively. These methods are used to push the information into the database coresponding to their types. (We have three separate tables!) ```python #----------------------------------------------------------------------- # Define the class Category for all Main Categories and Sub Categories. # Note that only sub-categories have parent_id, which point to the Main Category # or itself to identify its parent category. class Category: def __init__(self, name, url, parent_id=None, cat_id=None): self.cat_id = cat_id self.name = name self.url = url self.parent_id = parent_id def __repr__(self): return f"ID: {self.cat_id}, Name: {self.name}, URL: {self.url}, Parent: {self.parent_id}" def save_into_mainCat(self): query = """ INSERT INTO categories (name, url) VALUES (?, ?); """ val = (self.name, self.url) try: cur.execute(query, val) self.cat_id = cur.lastrowid except Exception as err: print('ERROR BY INSERT:', err) def save_into_subCat(self): query = """ INSERT INTO sub_categories (name, url, parent_id) VALUES (?, ?, ?); """ val = (self.name, self.url, self.parent_id) try: cur.execute(query, val) self.cat_id = cur.lastrowid except Exception as err: print('ERROR BY INSERT:', err) class products: def __init__(self, name, price, url, sub_cat_id, product_id=None): self.name = name self.price = price self.url = url self.sub_cat_id = sub_cat_id self.product_id = product_id def save_into_products(self): query = """ INSERT INTO products (name, price, url, sub_cat_id) VALUES (?, ?, ?, ?); """ val = (self.name, self.price , self.url, self.sub_cat_id) try: cur.execute(query, val) self.product_id = cur.lastrowid except Exception as err: print('ERROR BY INSERT:', err) ``` After defining the database and classes, we now start to scrap the information from tiki and classify it. First of all, we need to request Tiki.vn ```python # Get request TIKI def get_url(url): # print('Getting URL') """Get parsed HTML from url Input: url to the webpage Output: Parsed HTML text of the webpage """ # Send GET request r = requests.get(url) # Parse HTML text tiki = BeautifulSoup(r.text, 'html.parser') return tiki ``` Then get all main categories on the left handed side of the webpage ```python #----------------------------------------------------------------------- # Get all main categories. #----------------------------------------------------------------------- def get_main_categories(save_db=False): # print('Getting Main Categories') tiki = get_url(TIKI_URL) result = [] for a in tiki.find_all('a', {'class': 'MenuItem__MenuLink-sc-181aa19-1 fKvTQu'}): name = a.find('span', {'class': 'text'}).text url = a['href'] main_cat = Category(name, url) if save_db: main_cat.save_into_mainCat() result.append(main_cat) return result ``` Then for each main categories, we want to get its subcategories. Since there could a subcategory that lies under a subcategory, we need to get to the lowest layer of categories. ```python #----------------------------------------------------------------------- # For each main category, we get all subcategories #----------------------------------------------------------------------- def get_sub_categories(parent_category, save_db=False): # print('Getting Sub Categories') url = parent_category.url result = [] try: soup = get_url(url) div_containers = soup.find_all('div', {'class':'list-group-item is-child'}) for div in div_containers: name = div.a.text name = re.sub('\s{2,}', ' ', name) name = name.replace('\n','') url = TIKI_URL + div.a['href'] cat = Category(name, url, parent_category.cat_id) if save_db: cat.save_into_subCat() result.append(cat) except Exception as err: print('ERROR BY GET SUB CATEGORIES:', err) return result ``` ```python #----------------------------------------------------------------------- # Now we combine the two functions above to take all sub-categories in all level # Note that we use a recursive function here. #----------------------------------------------------------------------- def get_all_categories(categories): # print('Getting All Sub-Categories!') result = [] if len(categories) == 0: return for cat in categories: sub_categories = get_sub_categories(cat, save_db=True) result.append(sub_categories) get_all_categories(sub_categories) return result ``` Then, we get all the products listed on each sub-category ```python # Function to get num_page of pages from a sub_category. In order to do that, # we just need to manipulate the url. def get_product_multipage(sub_category, save_db =False, num_page = 1): # print('Getting products') original_url = sub_category.url data = [] for page in range(1, num_page+1): url = original_url + '&page=' + str(page) tiki = get_url(url) try: products_tiki = tiki.find_all('div', {'class':'product-item'}) for p in products_tiki: d= {} d['name'] = p['data-title'] d['price'] = int(p['data-price']) d['url'] = p.a['href'] d['sub_cat_id'] = sub_category.cat_id # Attention: This is the sub_cat_id of the lowest sub-category. product = products(d['name'],d['price'], d['url'], d['sub_cat_id']) if save_db: product.save_into_products() data.append(d) except Exception as err: print('ERROR BY GET PRODUCTS', err) return data ``` The final function combines all of the above features; it will get all the information including Main categories, Sub categories and products listed in ```num_page``` of pages. ```python # all in one function! def Tiki_please_give_me_the_data(url ="https://tiki.vn"): # print('Getting TIKI PLEASE') # First we get the main categories and store in the table Main Categories print('Get Main Categories') main_categories = get_main_categories(save_db = True) print('Get Main Categories DONE') # We get the subcategories in each main categories and store in the table Sub Categories print('Get all Sub Categories') sub_categories = get_all_categories(main_categories) print('Get all Sub Categories DONE') # Define output of products output = [] print('Get All the Products') for subcat in sub_categories: for cat in subcat: product = get_product_multipage(cat, save_db = True, num_page = 2) output.append(product) return output output = Tiki_please_give_me_the_data() pd.read_sql_query('SELECT p.name as ProductName, p.price as Price, s.name as Category FROM products as p JOIN sub_categories as s ON p.sub_cat_id = s.id', conn) ``` </div>