Week 1 - HackMD

--- tags: FTMLE-Philipines-2020 --- ###### tags: FTMLE-Philipines-2020 Nguyen Trong Hieu - trhnguyenvn(at)gmail.com # Week 1 ## Monday 18.05.2020 <div style="text-align: justify"> The main topic of today is introduction to Data Science and Machine Learning. Several definitions will be given to get familiar with the concepts. </div> ### What is Data? <div style="text-align: justify"> Data are characteristics or information, usually numerical, that are collected through observation. In general, a collection of facts such as measurements, numbers, words, observations, is data. In a technical point of view, one can classify data into two types: - Qualitative data - Quantitative data Roughly speaking, **qualitative** data (***discriptive information***) are usually the information that cannot be measured in a numerical unit. Qualitative data represents the status or the current characteristic of an object. For example: the color of a fruit, the taste of a meal, ... For **quantitative** data (***numerical information***), some examples are: distance (10 meters, 100 meters), lenght, height, weight. In general, quantitative data could be represented in a numerical value. An easy way to check if a data is quantitative or not is to ask the question: "How many/How much?". The two subsets of quantitative data are - Discrete data - Continuous data </div> ### Why do we need data? <div style="text-align: justify"> In order to understand why do we need data, let us date back to the 1960s-1970s. Back then, the emphasis of computer science was on programming languages and how to make the computer becomes useful. As the power of computers increases gradually, a challenging task people nowadays are asking computers to do is "making decision by themselves". As human being, we judge a situation by observing it and use our own knowledges/experiences. For computers, data is their "experience". The computer cannot think like us, but it can analyze the data systematically, make use of mathematical algorithm, handle extremely big amount of information and finally give us the insights hidden. </div> #### How do we collect data? <div style="text-align: justify"> Depending on the type of data, there could be several ways to collect data. For example, in business, a company can make a surver to collect their customer's reviews. Medical data can be recorded in the hospital by doctors/nurses. The most important thing is: The collected data must be large enough and it must be collected arbitrarily. Otherwise, we cannot apply statistical model to have the best result. #### Cencus and Sample - A census is a study of every unit, everyone or everything, in a population. - A sample is a subset of units in a population, selected to represent all units in a population of interest. In statistics, the sample is picked randomly from a large set of data. </div> #### Unstructured and Structured data <div style="text-align: justify"> Structured data is the data that is well-organized. More precisely, this type of data could be placed neatly in a database (a table, spreadsheet,...) with specific names, numerical values, ... Otherwise, it is unstructured data. </div> ### What is Data science? <div style="text-align: justify"> Roughly speaking, data science is a field that use scientific methods, processes, algorithms and systems to extract knowledge and insights from structured data or unstructured data. There are many contributions to the rise of data science: - The merging of computing, communications, e-commerce. - The enhanced ability of modern technologies that allow us to observe, collect and store data in many fields, such as natural sciences. - The rise of social networks. and many more reasons. With data science, now we can extract usable information from massive data arising in applications. Depending on the role, a person working in Data Science could be named with different titles and have different working tasks: - Data analyst Analyze data to make better decision. Their tasks are to clean data, visualize data, make summary dashboard and present it to the customers. - Data engineering Develop, construct, test and maintain the data processing model. - Data scientist The task of Data scientist is similar to D.E, however, it requires a deeper knowledge and understanding in the field. D.S usually works with more complex problem. Since their working task are different, their skill-set requirements are also different. </div> ### What is Machine Learning? <div style="text-align: justify"> Let us start this section by an example: Suppose that we are given a dataset of 100000 images of handwritten digits. Each image has a label showing what digit it is. The goal of a "Machine Learning" model applied to this problem is to **learn** the relationship between the images their labels, such that, in the future, if we input the 100001st image, the computer can predict what digit it contains with as few errors as possible. In other words, the process of machine learning is somewhat similar to the process how human gains knowledges in sciences. Instead of programming the computer explicitly to perform a task, we now **teach** them to learn a general principles from observation and be able to make a prediction. However, Statistics can do a similar task if we define Machine Learning like this! In statistics, we also deduce general principle (statistical properties) from given data. But the difference is: **Machine Learning mainly focus on accurate prediction than just analyze the model!** There are many applications areas of machine learning: - Bioinformatics - Computer vision - Economics - ... There are different types of Machine learning models, which will be specified later in an upcoming note. (Supervised/Unsupervised learning, Reinforcement learning, Deep learning...) </div> ## Tuesday 19.05.2020 (Happy Birthday, HCM.) **Basic Python Programming and Exercises** <div style="text-align: justify"> There are several useful and basic built-in functions in python such as ```python print() # print the value of a variable or a string to the screen. help() # print the description and usage of a function ``` Try the code ```python help(print) ``` to see that the function "print" has two more input arguments. But in many case we just leave it empty; which means the "print" function will be executed with default values. </div> #### Conditional statement, For/While Loops: <div style="text-align: justify"> ```python if condition: execute something here elif condition2: execute something here else: execute something here ``` Indentation in Python refers to the (spaces and tabs) that are used at the beginning of a statement. The statements with the same indentation belong to the same group. This could lead to errors if we do not pay attention to indentation, for example ```python if first_condition if condition: execute something here 1 elif condition2: execute something here 2 else: execute something here 3 ``` In this case the line "execute something here 1" belongs to the first if-statement. ALWAYS CHECK THE INDENTATION ESPECIALLY WHEN YOU HAVE A BUNCH OF IF/FOR/WHILE STATEMENTS ```python for i in something: do this ``` ```python while condition is still satisfied: do this ``` </div> #### How to define our own function? <div style="text-align: justify"> ```python def thisismyfunction(input): """ Use """ and """ to write the details of our function. For example: what should be in the input, what the output is,... These lines will be show when one use the function help: help(thisismyfunction) """ do something return output # <-- the output of our function ``` If the function has several inputs and we want to have some default input, here is the way: ```python def function_with_default_input(input1, input2=default) do something here return output ``` Then when we execute ```python output = function_with_fault_input(input1) ``` it will automatically consider the function ```python output = function_with_fault_input(input1, default) ``` </div> #### PEMDAS <div style="text-align: justify"> What does PEMDAS mean? P: Parentheses first E: Exponents (ie Powers and Square Roots, etc.) MD:Multiplication and Division (left-to-right) AS:Addition and Subtraction (left-to-right) Python has precedence rules that determine the order in which operations get evaluated in expressions like above. For example, and has a higher precedence than or. See https://docs.python.org/3/reference/expressions.html#operator-precedence **Always use liberal parentheses to avoid bugs!!!** Consider the following exercise (Exercise 3) to see how important the liberal parentheses :In the main lesson we talked about deciding whether we're prepared for the weather. I said that I'm safe from today's weather if... - I have an umbrella... - or if the rain isn't too heavy and I have a hood... - otherwise, I'm still fine unless it's raining *and* it's a workday The function below uses our first attempt at turning this logic into a Python expression. I claimed that there was a bug in that code. Can you find it? To prove that `prepared_for_weather` is buggy, come up with a set of inputs where it returns the wrong answer. ```python def prepared_for_weather(have_umbrella, rain_level, have_hood, is_workday): # Don't change this code. Our goal is just to find the bug, not fix it! return have_umbrella or (rain_level < 5 and have_hood) or not (rain_level > 0 and is_workday) # Change the values of these inputs so they represent a case where prepared_for_weather # returns the wrong answer. have_umbrella = True rain_level = 0.0 have_hood = True is_workday = True # Check what the function returns given the current values of the variables above actual = prepared_for_weather(have_umbrella, rain_level, have_hood, is_workday) print(actual) ``` </div> #### Methods <div style="text-align: justify"> A function attached to an object is called a method. (Non-function things attached to an object, such as imag above, are called attributes). For examples, some lists methods are ```python mylist.extend() mylist.append() mylist.index() ``` </div> ### Tuple vs List? What are their differences and what to remember? <div style="text-align: justify"> - Tuples are faster than lists. If you're defining a constant set of values and all you're ever going to do with it is iterate through it, use a tuple instead of a list. - It makes your code safer if you “write-protect” data that does not need to be changed. Using a tuple instead of a list is like having an implied assert statement that this data is constant, and that special thought (and a specific function) is required to override that. - Some tuples can be used as dictionary keys (specifically, tuples that contain immutable values like strings, numbers, and other tuples). Lists can never be used as dictionary keys, because lists are not immutable. </div> ### List comprehension - The nice way to write one-line code!!! <div style="text-align: justify"> This code ```python for i in range(len(mylist)): mylist[i] = int(mylist()) ``` is equivalent to this online code ```python mylist = [int(x) for x in mylist] ``` One can also add an if-statement into that ```python mylist = [int(x) for x in mylist if some conditions] ``` </div> ### Remarks from exercise: <div style="text-align: justify"> int(True) returns 1, int(False) return 0. Hence, one can make use of this. For example, in the following exercise (Ex. 6) ```python def exactly_one_topping(ketchup, mustard, onion): """Return whether the customer wants exactly one of the three available toppings on their hot dog. """ # Since we have int(True) returns 1 and int(False) returns 0, if the customer # wants exactly one topping, it means that the sum of int(ketchup), int(mustard), # and int(onion) must be 1 return (int(ketchup)+int(mustard)+int(onion)) ==1 #test exactly_one_topping(True, False, False) ``` </div> ## Wednesday 20.05.2020 **Notes on Exercise from Tue and the repl.it exercises.** #### The function enumerate() in python <div style="text-align: justify"> Let's consider the following exercise **Searching a Word** A researcher has gathered thousands of news articles. But she wants to focus her attention on articles including a specific word. Complete the function below to help her filter her list of articles. Your function should meet the following criteria - Do not include documents where the keyword string shows up only as a part of a larger word. For example, if she were looking for the keyword “closed”, you would not include the string “enclosed.” - She does not want you to distinguish upper case from lower case letters. So the phrase “Closed the case.” would be included when the keyword is “closed” - Do not let periods or commas affect what is matched. “It is closed.” would be included when the keyword is “closed”. But you can assume there are no other types of punctuation. *HINT*: Some methods that may be useful here: `str.split()`, `str.strip()`, `str.lower()` ```python def word_search(doc_list, keyword): """ Takes a list of documents (each document is a string) and a keyword. Returns list of the index values into the original list for all documents containing the keyword. Example: doc_list = ["The Learn Python Challenge Casino.", "They bought a car", "Casinoville"] >>> word_search(doc_list, 'casino') >>> [0] """ output =[] i = 0 for document in doc_list: for word in document.replace(',','').replace('.','').split(' '): # this is to remove commas and periods and blank space if word.lower() == keyword.lower(): output.append(i) break i = i +1 return output doc_list = ["The Learn Python Challenge Casino.", "They bought a car", "Casinoville"] word_search(doc_list, 'car') ``` In my function, I have to use the expression ```python i = i +1 ``` in each loop to keep track of the second running index in the loop. There is a better way to do this, that is to use the function enumerate() ```python def word_search(doc_list, keyword): """ Takes a list of documents (each document is a string) and a keyword. Returns list of the index values into the original list for all documents containing the keyword. Example: doc_list = ["The Learn Python Challenge Casino.", "They bought a car", "Casinoville"] >>> word_search(doc_list, 'casino') >>> [0] """ output =[] for i, document in enumerate(doc_list): for word in document.replace(',','').replace('.','').split(' '): # this is to remove commas and periods and blank space if word.lower() == keyword.lower(): output.append(i) break return output doc_list = ["The Learn Python Challenge Casino.", "They bought a car", "Casinoville"] word_search(doc_list, 'car') ``` </div> ## Thursday 21.05.2020 **Git/GitHub and The exciting Murder Mystery game in command line** ### Some commands to keep in mind: <div style="text-align: justify"> ```mkdir dir1 dir2 dir3``` creates three directories ```rmdir dir1 dir2 dir3``` removes the empty directories (Note:```dir``` is the directory. If you don't specify the directory, it will create at current location. ) ```rm -R dir``` deletes non-empty directory ```cp file1 file2 copies file, -i``` (interactive mode) to prompt user to answer if they want to overwrite file. ```mv file1 file2``` moves file ```touch file_name``` creates an empty file </div> ### Commands for inspecting a file/folder <div style="text-align: justify"> ```cat file_name``` views a file's contents or concatenate several files ```less large_file``` one page is displayed at a time; space bar to page down; q to quit (less serveral files: :n to move to next file, :p to go back, :q to quit ```head -n 3 file_name``` prints the first 3 lines of the file ```tail -n 3 file_name``` prints the last 3 lines of the file ```shuf -n 3 file_name``` prints randomly 3 lines of the file ```wc file_name``` counts number of lines, words, and characters in the file column -s"," -t example_data.csv (be careful to use column on very large files) ```sort file_name``` sortes the content (-r to reverse, -u for getting rid of duplicates) </div> ### Git/Github <div style="text-align: justify"> - Github, in a nutshell, is the way to share files with other people collaboratively without overwriting other people's stuff. Github is sort of a community, and there are a lot of open-source projects, which means the project's code is available to look at. Github is a web-based platform, and it bases on a technology called Git, G-I-T. - Draw a diagram to have an overview of the usage of Github. Starting from your **local** computer, when you work on some projects with a team, you can push your code to a **Staging** state, when your newer version will be processed and checked. If it passed the Staging, it then can be upload to the shared GitHub, the **remote** , where anyone else in your team can see it. - Always remember to pull the newest version of the project to avoid conflict with your teamate. (Merge Conflicts: Let say two developers work on the same line of a file, and both of them push the code up to Github. So Git's not going to be able to know what to do in that situation, because it doesn't know which kind of version of the code to believe. So you have to resolve that merge conflict by specifically telling Git which version of the code you want to use.) - There could be several branches in your **local** and the **remote**. Make sure to know which branch you are working on. When you work with a team, consider the branch named **master** carefully. </div> #### Git commands <div style="text-align: justify"> ```git clone <url>```: makes a copy of a repository and stores it on your computer. ```git init```: makes the current folder as a git repository ```git add```: adds a file to "staging area", which means telling git to include the file in the next time it saves a version of the repo. git add . to add all changed files. ```git status```: a helpful command, to see what's currently going on in the git repo ```git commit```: saves the version of the repository. ```git push -u origin master```: pushes the changes I made on my computer up to the Github repo ```git pull```: is opposite of git push, pull the most recent version of the online repository. ```git log```: to see all the version that you have saved and you can see the commit message is a very good indicator or reminder of what was changed in a particular version. So make sure you use good commit messages. ```git reset --hard <commit>```: reverts code back to a previous commit ```git reset --hard origin/master```: reverts code back to remote repository version ```git branch```: shows all branches of code ```git branch <branch_name>```: creates a new branch ```git checkout <branch_name>```: switches to a branch ```git merge <branch_name>```: merges the branch with current branch ```git config --global alias.sla 'log --oneline --decorate --graph --all``` makes git log a bit prettier. </div> ### The Murder Mystery - My solution <div style="text-align: justify"> First we navigate to the main folder, in this case it is ``` cd /home/hieunguyen/mm/clmm/ ``` Navigate to the folder "mystery", we take the clues by searching for the keyword CLUE in crimescene file ``` cd ./mystery grep CLUE crimescene ``` The result is: ``` CLUE: Footage from an ATM security camera is blurry but shows that the perpetrator is a tall male, at least 6'. CLUE: Found a wallet believed to belong to the killer: no ID, just loose change, and membership cards for AAA, Delta SkyMiles, the local library, and the Museum of Bash History. The cards are totally untraceable and have no name, for some reason. CLUE: Questioned the barista at the local coffee shop. He said a woman left right before they heard the shots. The name on her latte was Annabel, she had blond spiky hair and a New Zealand accent. ``` Thanks to this clue, we continue by searching for information belong to the witness Annabel. Hence, we search for Annabel in "people" ``` grep Annabel people ``` The result is: ``` Annabel Sun F 26 Hart Place, line 40 Oluwasegun Annabel M 37 Mattapan Street, line 173 Annabel Church F 38 Buckingham Place, line 179. Annabel Fuglsang M 40 Haley Street, line 176 ``` Notice that there are only two "Annabel" which are female. We continue to test each "Annabel" by reading the information in file "streets". We take the information from corresponding lines 40 of Hart_Place (Annabel Sun) and 179 of Buckingham_Place (Annabel Church). ``` cd ./streets head -40 Hart_Place | tail -1 ``` The result is ``` SEE INTERVIEW #47246024 ``` That means we have to look at the interview no. 47246024 to get the information ``` cd .. cd ./interviews cat interview-47246024 ``` The result is: Ms. Sun has brown hair and is not from New Zealand. Not the witness from the cafe. It yields that Ms. Sun is not the one we are looking for. She is not the witness. Now we try the second person: ``` cd .. cd ./streets head -179 Buckingham_Place | tail -1 ``` The result now is: SEE INTERVIEW #699607. We then look at this interview in the folder "interviews" ``` cd .. cd ./interviews cat interview-699607 ``` We get more information about the murderer from this result: Interviewed Ms. Church at 2:04 pm. Witness stated that she did not see anyone she could identify as the shooter, that she ran away as soon as the shots were fired. However, she reports seeing the car that fled the scene. Describes it as a blue Honda, with a license plate that starts with "L337" and ends with "9" This leads to searching for the license plate starting with L337, Blue Honda vehicle. To do this we take advantage of the grep function ``` cd .. grep -B 4 "6'" vehicles | grep -A 4 L337| grep -A 3 Honda| grep -A1 Blue > result1 ``` This command line expresses the condition: Blue Honda vehicles with license plate starting with L337. (since all license plate starting with L337 ends with 9, we don't have to worry about this condition). This reduces the list to the following names ``` Color: Blue Owner: Erika Owens -- Color: Blue Owner: Joe Germuska -- Color: Blue Owner: Jeremy Bowers -- Color: Blue Owner: Jacqui Maher ``` The next thing we do is to clean this result in the exported text file "result1" ``` sed 's/Color: Blue//g' result1|sed 's/Owner: //g'| sed 's/--//g' > result1_clean ``` Now we have a short list of suspects. We are going to use the last information in "memberships". It is suggested that the murdered is simultaneously member of 4 things: AAA, Delta_SkyMiles, Terminal_City_Library, Museum_of_Bash_History. So we compare these 4 lists to get the most common names ``` cd ./memberships comm -12 <(sort AAA) <(sort Delta_SkyMiles) > test1 comm -12 <(sort Terminal_City_Library) <(sort Museum_of_Bash_History) > test2 comm -12 <(sort test1) <(sort test2) > result2 ``` Now we compare the file result2 (containing everyone who is member of all 4 groups) and the file result1_clean (containing everyone who has Blue Honda vehicle with L337 license plate and is taller than 6' ) ``` mv result2 /home/hieunguyen/mm/clmm/mystery comm -12 <(sort result1_clean) <(sort result2) ``` The final list of suspects are ``` Jacqui Maher Jeremy Bowers ``` The final thing to do is to check the information of these people in "people" file ``` grep "Jacqui Maher" people grep "Jeremy Bowers" people ``` Then check their information in streets to get the interview code. We finally check the interview and find out that Jacqui has an alibi while Jeremy doesn't. Moreover, the police thinks that Jerymy is an suspect. This concludes that Jerymy Bowers is the murderer! ``` cd ./streets head -284 Dunstable_Road | tail -1 cd .. cd ./interviews cat interview-9620713 ``` Result for Jeremy ``` Home appears to be empty, no answer at the door. After questioning neighbors, appears that the occupant may have left for a trip recently. Considered a suspect until proven otherwise, but would have to eliminate other suspects to confirm. ``` ``` cd .. cd ./streets head -224 Andover_Road | tail -1 cd.. cd ./interviews cat interview-904020 ``` Result for Jacqui ``` Maher is not considered a suspect. Video evidence confirms that she was away at a professional soccer game on the morning in question, even though it was a workday. ``` Another way to exclude Jacqui is that it is a woman! The murderer is Jeremy Bowers. </div> ## Friday 22.05.2020 **Web Scrapping and the weekly project** <div style="text-align: justify"> In this section, we write the explanations and codes for the weekly project: Web Scrapper Tiki. Weekly Project - Tiki Web Scrapper (Hieu - Minh) In this project, we build a function to crawl the data from www.tiki.vn The first thing we need to do is to send a request to the website to fetch the data. The python library request witll be used. The second step is to make use of the library BeautifulSoup. The reason why we need to use this library is that we need to simplify the raw information we receive from tiki.vn. By letting the information to go through a parser, we can then easily find and extract the needed information. In general, all the steps can be simplify to the following diagram: 1. **GET** (send requests to the website we want to crawl) 2. The information obtained from the website is raw text in HTML (if we get a response, otherwise we need to use other method) 3. **BS4** The BeautifulSoup library - A parser. This would add data structure to the raw text and allow us to manipulate the content easily. The following function, named ```get_url``` will execute these two steps. The output is tiki, which contains all the information we want. ```python import requests from bs4 import BeautifulSoup def get_url(url): """Get parsed HTML from url Input: url to the webpage Output: Parsed HTML text of the webpage """ # Send GET request r = requests.get(url) # Parse HTML text tiki = BeautifulSoup(r.text, 'html.parser') return tiki ``` - After printing the object ``` tiki ``` and inspecting the website, we can see that the class of all product listed on the link is ``` product-item ```. This suggests us the way to find all the products. (Remember the method ``` _.find_all()``` and the structure of inputs) ``` products = tiki.find_all('div', {'class':'product-item'}) ``` Now ``` products ``` contains all the information of products. - Looping through this object, ``` for product in products: ``` note that there are several children tag under the tag ``` div ``` (which we have already search for in previous line), we need to go to that children tag, for example, the link to the product is on the tag ``` <a> ```, class ``` href ``` ``` product.a['href'] ``` and the link to the product's image is on the tag ``` <img>```, class ``` src ``` ``` product.img['src'] ``` - We also look for some more information such as: Discount, Original Price, Discount Code, Price after using discount code, Installment Price and Reviews. These information is stored under the tag ``` span ```, so we need to find it again: ``` product.find('span',{'class':'sale-tag sale-tag-square'}).text ``` We also use the method ``` _.text``` to extract the text. For the price, we use the method ``` _.replace() ``` to remove the currency. Then use ``` int() ``` to make it an integer. For instance, with the original price ``` int(product.find('span',{'class':'price-regular'}).text.strip('đ').replace('.','')) ``` - Note that we define a dictionary ``` d = {'Product id':'','Product Name':'','Brand':'','Category':'','Price':'','Url':'','Image':'','Discount':'','Original Price':'','Discount Code':'', 'Price after code':'','Installment Price':'','Reviews':''} ``` to store all information at the beginning of the code. The final output is ``` data ``` ``` python def scrape_tiki(url="https://tiki.vn/dien-thoai-may-tinh-bang/c1789?src=c.1789.hamburger_menu_fly_out_banner&_lc=Vk4wMzkwMTkwMDc="): """Scrape the product page of tiki Input: url to the webpage. Default: https://tiki.vn/dien-thoai-may-tinh-bang/c1789?src=c.1789.hamburger_menu_fly_out_banner&_lc=Vk4wMzkwMTkwMDc= Output: A list containing scraped data of all articles """ # Get parsed HTML tiki = get_url(url) # Find all product tags products = tiki.find_all('div', {'class':'product-item'}) # List containing data of all articles data = [] # Extract information of each product for product in products: # Each product is dictionary containing the required information d = {'Product id':'','Product Name':'','Brand':'','Category':'','Price':'','Url':'','Image':'','Discount':'','Original Price':'','Discount Code':'', 'Price after code':'','Installment Price':'','Reviews':''} try: d['Product id'] = product['data-id'] d['Product Name'] = product['data-title'] d['Brand'] = product['data-brand'] d['Category']= product['data-category'] d['Price'] = int(product['data-price']) d['Url'] = product.a['href'] d['Image'] = product.img['src'] d['Discount']= product.find('span',{'class':'sale-tag sale-tag-square'}).text d['Original Price']= int(product.find('span',{'class':'price-regular'}).text.strip('đ').replace('.','')) d['Discount Code']= product.find('span',{'class':'code'}).text d['Price after code']=int(product.find('span',{'class':'price'}).span.text.strip('đ').replace('.','')) d['Installment Price']=product.find('span',{'class':'installment-price-v2'}).text d['Reviews']= product.find('p',{'class':'review review-wrap'}).text data.append(d) except: data.append(d) pass return data ``` The function scrape_tiki is complete. We now use it with the following link from www.tiki.vn For turning our output data into a ```spreadsheet```, we use the library pandas and the function ```DataFrame```. ```python data = scrape_tiki('https://tiki.vn/the-thao-da-ngoai/c1975?src=c.1975.hamburger_menu_fly_out_banner') import pandas as pd pd.DataFrame(data=data, columns = d[1].keys()) ``` Link to Google Colab notebook: https://colab.research.google.com/drive/1yD7dreb7ytKQQzS-HIINcY-SKwWmPP_E?usp=sharing </div>