Week 3 - HackMD

--- tags: FTMLE-Philipines-2020 --- # Week 3 ## Monday 1.6.2020 ### Some questions? Google Cloud Platform, BigQuery. what is **feature engineering**? why do we have to visualize data? ### Descriptive Statistics <div style="text-align: justify"> - Qualitative data vs Quantitative data - Discrete data vs Continuous data - Sample vs Population Three types of mean: weighted mean, geometric mean, harmonic mean. When to use which? Mean, Median or Mode? #### Population variance and Sample variance? Why are they different? </div> ## Tuesday 2.6.2020 **ADVANCED SQL** <div style="text-align: justify"> - Use database and SQL to communicate with the database. - There can be many tables in a database and they are connected. This is called relational database. - There are other types of databases, for example NoSql - Missing values will appear if we use SQL for some types of data. So it is better to use the NoSQL database. Items in NoSQL database looks like a dictionary. - NoSQL is suitable for Hierarchy Data Storage , but we cannot define a hierarchy in SQL. - NoSQL is much more flexible than SQL. - Scaling: Vertically vs Horizontally. We cannot make the NoSQL bigger but we can build more databases. That means Horizontal scaling. Vertical Scaling means that we make the database bigger! - Give some examples on applications in which we should use SQL. Which type of databases should be used for a specific type of data? </div> ### Order of execution in SQL <div style="text-align: justify"> Recall the order of execution with more details. Use Sub-query, the sub-query will be executed before the main query. A sub-query is a SQL query nested in a bigger query. Consider the following example: We write a sub-query ```SQL WHERE i.Total > ( SELECT AVG(total) FROM invoices ) ``` ```python # List all the customers who has at least one invoice with value higher than the average query = ''' SELECT DISTINCT c.customerId FROM invoices as i JOIN customers as c ON i.CustomerId = c.CustomerId WHERE i.Total > ( SELECT AVG(total) FROM invoices ) ''' pd.read_sql_query(query, conn) ``` Notice that the order of execution require us to use sub-query!! There are two types of sub-query: - Correlated sub-query. - Uncorrelated sub-query: It does not depend on anything and can be executed separately from the main query. It stil returns a result. Example Correlated sub-query ```python= # List all the tracks that has file size greater or equal to the average file size of their corresponding genre query = ''' SELECT * FROM tracks as t JOIN genres as g ON t.genreID = g.genreId WHERE Bytes >= ( SELECT AVG(Bytes) FROM tracks WHERE genreID = t.genreID ) ''' pd.read_sql_query(query, conn) ``` Explain this example!!! On line 12, ```SQL WHERE genreID = t.genreID ``` the ```t.genreId``` means the whole joined table above, not only the ```tracks``` anymore. Complexity of corelated sub-query is $O(n^2)$ while uncorrelated is $O(2n)$. Another way to do that is: ```python= query = ''' SELECT * FROM tracks AS t JOIN (SELECT GenreId, AVG(Bytes) AS AvgBytes FROM tracks GROUP BY GenreId) AS g ON t.GenreId = g.GenreId WHERE t.Bytes > g.AvgBytes ''' pd.read_sql_query(query, conn) ``` So this one is a uncorrelated sub-query, so the complexity is smaller. We use ```GROUP BY``` to compute all the average values, then ```JOIN``` and compare the value in the final table. </div> ### Common table expression (CTE) <div style="text-align: justify"> CTE is a temporary result to use with other queries. It helps organize complex queries. ```SQL WITH temporary Table (averageValue) as (SELECT avg(Attr1) FROM Table), SELECT Attr1 From Table Where Table.Attr1 > temporaryTable.averageValue; ``` (This is what we have done in the previous example.) ```python query = ''' WITH avg_genres (GenreId, AvgBytes) AS ( SELECT GenreId, AVG(Bytes) AS AvgBytes FROM tracks GROUP BY GenreId ) SELECT * FROM tracks AS t JOIN avg_genres AS g ON t.GenreId = g.GenreId WHERE t.Bytes > g.AvgBytes ''' pd.read_sql_query(query, conn) ``` Find the average unitprice by album of the above tracks ```python= # Find the average unitprice by album of the above tracks query = ''' WITH avg_genres (GenreId, AvgBytes) AS ( SELECT GenreId, AVG(Bytes) AS AvgBytes FROM tracks GROUP BY GenreId ) SELECT AlbumID, AVG(UnitPrice) FROM tracks AS t JOIN avg_genres AS g ON t.GenreId = g.GenreId WHERE t.Bytes > g.AvgBytes GROUP BY AlbumId ''' pd.read_sql_query(query, conn) ``` On line 15, we can replace the name ```AlbumId``` by the number ```1``` to represents the first column ```GROUP BY 1```. We can also ``` GROUP BY ``` 2 columns. </div> ### Set Operators <div style="text-align: justify"> - Set operators allow the results multiple queries to be combined together. - We need to match the number of columns. In the following, we will walk through some exercises/examples on advanced SQL queries. In each example, we will try to explain the idea/solution before writing the SQL command. The first important thing is: **ORDER OF EXECUTION** </div> ## Wednesday 3.6.2020 <div style="text-align: justify"> ### BigQuery Google Cloud Platform. Online - Database We work with the dataset from a Brazillian E-commerce company. Here are some exercises to remember the commands of SQL 1. Query the top 10 category_name_english having the most number of product on this E-commerce Site. ```SQL -- Paste your query here SELECT c.Category_name_english, COUNT(c.category_Name) as Number_of_Products FROM e_commerce.products as p JOIN e_commerce.category_name_translation as c ON p.product_category_Name = c.category_name GROUP BY c.category_Name_english ORDER BY Number_of_Products DESC LIMIT 10 ``` 2. Query the distribution (Count) of late and on-time orders delivered to customers ```SQL -- Paste your query here WITH main1 AS( SELECT * FROM e_commerce.orders WHERE order_delivered_customer_date IS NOT NULL) ,main2 AS( SELECT *, (CASE WHEN main1.order_delivered_customer_date <= main1.order_estimated_delivery_date THEN "ONTIME" ELSE "LATE" END) AS Status FROM main1 ) SELECT status, COUNT(status) as COUNT FROM main2 GROUP BY status ``` 3. Query top 5 seller_id by the total value of orders delived to customers in the first quarter of 2017. ```SQL -- Paste your query here - THIS IS WRONG!!!! WITH main1 AS (SELECT *, FROM e_commerce.orders as o WHERE o.order_delivered_customer_date > "2017-01-01 00:00:00 UTC" AND o.order_delivered_customer_date < "2017-03-31 00:00:00 UTC"), main2 AS (SELECT order_items.order_id, SUM(order_items.price+order_items.freight_value) as Total FROM main1 JOIN e_commerce.order_items as order_items ON order_items.order_id = main1.order_id GROUP BY order_items.order_id ) SELECT order_items.seller_id, main2.total FROM e_commerce.order_items as order_items JOIN main2 ON main2.order_id = order_items.order_id GROUP BY order_items.seller_id ORDER BY Total DESC LIMIT 5 ``` Solution: ```SQL -- Paste your query here WITH main1 AS (SELECT *, FROM e_commerce.orders as o WHERE o.order_delivered_customer_date > "2017-01-01 00:00:00 UTC" AND o.order_delivered_customer_date < "2017-03-31 00:00:00 UTC"), main2 AS (SELECT order_items.seller_id, SUM(order_items.price+order_items.freight_value) as Total FROM main1 JOIN e_commerce.order_items as order_items ON order_items.order_id = main1.order_id GROUP BY order_items.seller_id ) SELECT * FROM main2 ORDER BY total DESC LIMIT 5 ``` 4. Query all order_id that have the following exact payment sequence: credit_card - voucher - voucher ```sql -- Paste your query here WITH main1 AS( SELECT * FROM e_commerce.order_payments as op WHERE op.payment_sequential = 1 AND op.payment_type = "credit_card" ), main2 AS( SELECT * FROM e_commerce.order_payments as op WHERE op.payment_sequential = 2 AND op.payment_type = "voucher") , main3 AS( SELECT * FROM e_commerce.order_payments as op WHERE op.payment_sequential = 3 AND op.payment_type = "voucher") SELECT order_id FROM main1 INTERSECT DISTINCT SELECT order_id FROM main2 INTERSECT DISTINCT SELECT order_id FROM main3 ``` Note that the syntax in BigQuery is slightly different from what we have used in sqlite3. </div> ## Thursday 4.6.2020 <div style="text-align: justify"> In this session, we will work with the ```pandas``` library in python, which is a very powerful and popular library for data manipulation and analysis. Here are some remarks: - ```loc``` will look for the row first!! Take the example on the rows'index given by integers to see how we need to indicate to access to elements in the dataframe. - If we do not specify and just use ```[]```, it will use ```loc``` first and then ```iloc```. ```loc``` is the default. - ```loc``` uses name index while ```iloc``` uses the position index. In the following, we note some exercises that combine several techniques from ```pandas```. ### Use the methods ```contains``` to look for names. ```python # Find the customers who have the name James or Rose # Your code here) df1 = titanic[titanic['Name'].str.contains('James')] df2 = titanic[titanic['Name'].str.contains('Rose')] result2 = pd.concat([df1, df2]) result2 ``` </div> ### Define a function and use the method ```apply``` to apply it to the dataframe columns/rows <div style="text-align: justify"> ```python # Analyze the Cabin, which one is for individual, which one is for group(or family)? # Your code here def checkcabin(x): if x == 1: return 'Individual' else: return 'Group' cabin = titanic.groupby('Cabin').count() cabin['Type'] = cabin['PassengerId'].apply(checkcabin) cabin ``` and analyze the data ```python # Had women better chance to survive than men? # Your code here df1 = titanic.groupby('Sex').Survived.sum() df2 = titanic.groupby('Sex').Survived.count() ``` Classify the data's range and add one more columns, for instance: ```python # Write a function that takes in an age and return the age range defined as: # Infants: < 1, Children: < 10, Teens: < 18, Adults: < 40, Middle Age: < 60, Elders: >= 60 # Create a new column called AgeRange with the function # How was the chance of survival in different Age Ranges? # Your code here def agerange(x): if x < 1: return 'Infants' elif x >= 1 and x<10: return 'Children' elif x>=10 and x<18: return 'Teens' elif x>=18 and x<40: return 'Adults' elif x>40 and x<60: return 'Middle Age' else: return 'Elders' titanic['Age Range'] = titanic['Age'].apply(agerange) titanic.groupby('Age Range').Survived.count() ``` I myself find these two exercises below are very exciting and useful ```python # Imagine the first time you had this dataset, there was no feature Sex # How can you create that feature base on the other features? # Your code here def gettitle(x): a = x.split(', ')[1].split('. ')[0] return a titanic['Title'] = titanic['Name'].apply(gettitle) titanic.sample() def checkgender(x): male_title = ['Mr', 'Master', 'Don', 'Rev', 'Dr', 'Major', 'Sir', 'Col', 'Capt', 'Jonkheer'] if x in male_title: return 'male' else: return 'female' titanic['NewGender'] = titanic['Title'].apply(checkgender) ``` ```python # After you finished the previous step, one of your co-worker has found data of the gender # You are curious about how good your prediction is compare to the true label # Write a function to print out the accuracy score # Your code here def check_accu(inputdata): x = inputdata[0] y = inputdata[1] if x == y: return 1 else: return 0 titanic['Check accuracy'] = titanic[['Sex', 'NewGender']].apply(check_accu, axis = 1) a = titanic['Check accuracy'].sum() print(a/891*100) titanic[titanic['Check accuracy']==0] ``` </div> ## Friday 5.6.2020 ### Data For data visualization, we need to remember the following details: - For Categorical Univariate Data -- Count Plot: Frequency of the categories in a variable is represented by the size of the slices. -- Pie Chart: Frequency of the categories in a variable is represented by the size of the slices. - For Numerical Univariate Data -- Bar Chart: Size of the values in a variable is represented by the height of the bars. -- Line Plot: Represent the trend of a variable over an ordinal variable. -- Histogram: Distribution of a variable is represented by the height of the bins. -- Box Plot: Distribution of a variable is represented based on five number summary. - For Multivariate Data Analysis Analyzing data using two variables. It is used to show relationships between the variables: -- Scatter Plot: Relationship between two variables is represented as individual points. -- Heat Map: Correlation between two (or more) variables is represented by the color degree. We will use a lot of data visualization in this weekly project. The figures will be added once it is completed.