---
tags: FTMLE-Philipines-2020
---
# Week 3
## Monday 1.6.2020
### Some questions?
Google Cloud Platform, BigQuery.
what is **feature engineering**?
why do we have to visualize data?
### Descriptive Statistics
<div style="text-align: justify">
- Qualitative data vs Quantitative data
- Discrete data vs Continuous data
- Sample vs Population
Three types of mean: weighted mean, geometric mean, harmonic mean.
When to use which? Mean, Median or Mode?
#### Population variance and Sample variance?
Why are they different?
</div>
## Tuesday 2.6.2020
**ADVANCED SQL**
<div style="text-align: justify">
- Use database and SQL to communicate with the database.
- There can be many tables in a database and they are connected. This is called relational database.
- There are other types of databases, for example NoSql
- Missing values will appear if we use SQL for some types of data. So it is better to use the NoSQL database. Items in NoSQL database looks like a dictionary.
- NoSQL is suitable for Hierarchy Data Storage , but we cannot define a hierarchy in SQL.
- NoSQL is much more flexible than SQL.
- Scaling: Vertically vs Horizontally. We cannot make the NoSQL bigger but we can build more databases. That means Horizontal scaling. Vertical Scaling means that we make the database bigger!
- Give some examples on applications in which we should use SQL. Which type of databases should be used for a specific type of data?
</div>
### Order of execution in SQL
<div style="text-align: justify">
Recall the order of execution with more details.
Use Sub-query, the sub-query will be executed before the main query.
A sub-query is a SQL query nested in a bigger query. Consider the following example: We write a sub-query
```SQL
WHERE i.Total > ( SELECT AVG(total)
FROM invoices
)
```
```python
# List all the customers who has at least one invoice with value higher than the average
query = ''' SELECT DISTINCT c.customerId
FROM invoices as i
JOIN customers as c
ON i.CustomerId = c.CustomerId
WHERE i.Total > ( SELECT AVG(total)
FROM invoices
)
'''
pd.read_sql_query(query, conn)
```
Notice that the order of execution require us to use sub-query!!
There are two types of sub-query:
- Correlated sub-query.
- Uncorrelated sub-query: It does not depend on anything and can be executed separately from the main query. It stil returns a result.
Example Correlated sub-query
```python=
# List all the tracks that has file size greater or equal to the average file size of their corresponding genre
query = '''
SELECT *
FROM tracks as t
JOIN genres as g
ON t.genreID = g.genreId
WHERE Bytes >=
(
SELECT AVG(Bytes)
FROM tracks
WHERE genreID = t.genreID
)
'''
pd.read_sql_query(query, conn)
```
Explain this example!!!
On line 12,
```SQL
WHERE genreID = t.genreID
```
the ```t.genreId``` means the whole joined table above, not only the ```tracks``` anymore.
Complexity of corelated sub-query is $O(n^2)$ while uncorrelated is $O(2n)$.
Another way to do that is:
```python=
query = '''
SELECT *
FROM tracks AS t
JOIN
(SELECT GenreId, AVG(Bytes) AS AvgBytes
FROM tracks
GROUP BY GenreId) AS g
ON t.GenreId = g.GenreId
WHERE t.Bytes > g.AvgBytes
'''
pd.read_sql_query(query, conn)
```
So this one is a uncorrelated sub-query, so the complexity is smaller. We use ```GROUP BY``` to compute all the average values, then ```JOIN``` and compare the value in the final table.
</div>
### Common table expression (CTE)
<div style="text-align: justify">
CTE is a temporary result to use with other queries. It helps organize complex queries.
```SQL
WITH temporary Table (averageValue) as
(SELECT avg(Attr1)
FROM Table),
SELECT Attr1
From Table
Where Table.Attr1 > temporaryTable.averageValue;
```
(This is what we have done in the previous example.)
```python
query = '''
WITH avg_genres (GenreId, AvgBytes) AS (
SELECT GenreId, AVG(Bytes) AS AvgBytes
FROM tracks
GROUP BY GenreId
)
SELECT *
FROM tracks AS t
JOIN
avg_genres AS g
ON t.GenreId = g.GenreId
WHERE t.Bytes > g.AvgBytes
'''
pd.read_sql_query(query, conn)
```
Find the average unitprice by album of the above tracks
```python=
# Find the average unitprice by album of the above tracks
query = '''
WITH avg_genres (GenreId, AvgBytes) AS (
SELECT GenreId, AVG(Bytes) AS AvgBytes
FROM tracks
GROUP BY GenreId
)
SELECT AlbumID, AVG(UnitPrice)
FROM tracks AS t
JOIN
avg_genres AS g
ON t.GenreId = g.GenreId
WHERE t.Bytes > g.AvgBytes
GROUP BY AlbumId
'''
pd.read_sql_query(query, conn)
```
On line 15, we can replace the name ```AlbumId``` by the number ```1``` to represents the first column ```GROUP BY 1```. We can also ``` GROUP BY ``` 2 columns.
</div>
### Set Operators
<div style="text-align: justify">
- Set operators allow the results multiple queries to be combined together.
- We need to match the number of columns.
In the following, we will walk through some exercises/examples on advanced SQL queries. In each example, we will try to explain the idea/solution before writing the SQL command. The first important thing is: **ORDER OF EXECUTION**
</div>
## Wednesday 3.6.2020
<div style="text-align: justify">
### BigQuery
Google Cloud Platform. Online - Database
We work with the dataset from a Brazillian E-commerce company.
Here are some exercises to remember the commands of SQL
1. Query the top 10 category_name_english having the most number of product on this E-commerce Site.
```SQL
-- Paste your query here
SELECT c.Category_name_english, COUNT(c.category_Name) as Number_of_Products
FROM e_commerce.products as p
JOIN e_commerce.category_name_translation as c
ON p.product_category_Name = c.category_name
GROUP BY c.category_Name_english
ORDER BY Number_of_Products DESC LIMIT 10
```
2. Query the distribution (Count) of late and on-time orders delivered to customers
```SQL
-- Paste your query here
WITH main1 AS(
SELECT * FROM e_commerce.orders
WHERE order_delivered_customer_date IS NOT NULL)
,main2 AS( SELECT *, (CASE
WHEN main1.order_delivered_customer_date <= main1.order_estimated_delivery_date THEN "ONTIME"
ELSE "LATE"
END) AS Status
FROM main1 )
SELECT status, COUNT(status) as COUNT FROM main2
GROUP BY status
```
3. Query top 5 seller_id by the total value of orders delived to customers in the first quarter of 2017.
```SQL
-- Paste your query here - THIS IS WRONG!!!!
WITH main1 AS (SELECT *,
FROM e_commerce.orders as o
WHERE o.order_delivered_customer_date > "2017-01-01 00:00:00 UTC"
AND o.order_delivered_customer_date < "2017-03-31 00:00:00 UTC"),
main2 AS (SELECT order_items.order_id, SUM(order_items.price+order_items.freight_value) as Total FROM main1
JOIN e_commerce.order_items as order_items
ON order_items.order_id = main1.order_id
GROUP BY order_items.order_id
)
SELECT order_items.seller_id, main2.total
FROM e_commerce.order_items as order_items
JOIN main2
ON main2.order_id = order_items.order_id
GROUP BY order_items.seller_id
ORDER BY Total DESC LIMIT 5
```
Solution:
```SQL
-- Paste your query here
WITH main1 AS (SELECT *,
FROM e_commerce.orders as o
WHERE o.order_delivered_customer_date > "2017-01-01 00:00:00 UTC"
AND o.order_delivered_customer_date < "2017-03-31 00:00:00 UTC"),
main2 AS (SELECT order_items.seller_id, SUM(order_items.price+order_items.freight_value) as Total FROM main1
JOIN e_commerce.order_items as order_items
ON order_items.order_id = main1.order_id
GROUP BY order_items.seller_id
)
SELECT * FROM main2
ORDER BY total DESC LIMIT 5
```
4. Query all order_id that have the following exact payment sequence: credit_card - voucher - voucher
```sql
-- Paste your query here
WITH main1 AS(
SELECT * FROM e_commerce.order_payments as op
WHERE op.payment_sequential = 1 AND op.payment_type = "credit_card"
),
main2 AS(
SELECT * FROM e_commerce.order_payments as op
WHERE op.payment_sequential = 2 AND op.payment_type = "voucher")
,
main3 AS(
SELECT * FROM e_commerce.order_payments as op
WHERE op.payment_sequential = 3 AND op.payment_type = "voucher")
SELECT order_id FROM main1
INTERSECT DISTINCT
SELECT order_id FROM main2
INTERSECT DISTINCT
SELECT order_id FROM main3
```
Note that the syntax in BigQuery is slightly different from what we have used in sqlite3.
</div>
## Thursday 4.6.2020
<div style="text-align: justify">
In this session, we will work with the ```pandas``` library in python, which is a very powerful and popular library for data manipulation and analysis. Here are some remarks:
- ```loc``` will look for the row first!! Take the example on the rows'index given by integers to see how we need to indicate to access to elements in the dataframe.
- If we do not specify and just use ```[]```, it will use ```loc``` first and then ```iloc```. ```loc``` is the default.
- ```loc``` uses name index while ```iloc``` uses the position index.
In the following, we note some exercises that combine several techniques from ```pandas```.
### Use the methods ```contains``` to look for names.
```python
# Find the customers who have the name James or Rose
# Your code here)
df1 = titanic[titanic['Name'].str.contains('James')]
df2 = titanic[titanic['Name'].str.contains('Rose')]
result2 = pd.concat([df1, df2])
result2
```
</div>
### Define a function and use the method ```apply``` to apply it to the dataframe columns/rows
<div style="text-align: justify">
```python
# Analyze the Cabin, which one is for individual, which one is for group(or family)?
# Your code here
def checkcabin(x):
if x == 1:
return 'Individual'
else:
return 'Group'
cabin = titanic.groupby('Cabin').count()
cabin['Type'] = cabin['PassengerId'].apply(checkcabin)
cabin
```
and analyze the data
```python
# Had women better chance to survive than men?
# Your code here
df1 = titanic.groupby('Sex').Survived.sum()
df2 = titanic.groupby('Sex').Survived.count()
```
Classify the data's range and add one more columns, for instance:
```python
# Write a function that takes in an age and return the age range defined as:
# Infants: < 1, Children: < 10, Teens: < 18, Adults: < 40, Middle Age: < 60, Elders: >= 60
# Create a new column called AgeRange with the function
# How was the chance of survival in different Age Ranges?
# Your code here
def agerange(x):
if x < 1:
return 'Infants'
elif x >= 1 and x<10:
return 'Children'
elif x>=10 and x<18:
return 'Teens'
elif x>=18 and x<40:
return 'Adults'
elif x>40 and x<60:
return 'Middle Age'
else:
return 'Elders'
titanic['Age Range'] = titanic['Age'].apply(agerange)
titanic.groupby('Age Range').Survived.count()
```
I myself find these two exercises below are very exciting and useful
```python
# Imagine the first time you had this dataset, there was no feature Sex
# How can you create that feature base on the other features?
# Your code here
def gettitle(x):
a = x.split(', ')[1].split('. ')[0]
return a
titanic['Title'] = titanic['Name'].apply(gettitle)
titanic.sample()
def checkgender(x):
male_title = ['Mr', 'Master', 'Don', 'Rev', 'Dr',
'Major', 'Sir', 'Col', 'Capt', 'Jonkheer']
if x in male_title:
return 'male'
else:
return 'female'
titanic['NewGender'] = titanic['Title'].apply(checkgender)
```
```python
# After you finished the previous step, one of your co-worker has found data of the gender
# You are curious about how good your prediction is compare to the true label
# Write a function to print out the accuracy score
# Your code here
def check_accu(inputdata):
x = inputdata[0]
y = inputdata[1]
if x == y:
return 1
else:
return 0
titanic['Check accuracy'] = titanic[['Sex', 'NewGender']].apply(check_accu, axis = 1)
a = titanic['Check accuracy'].sum()
print(a/891*100)
titanic[titanic['Check accuracy']==0]
```
</div>
## Friday 5.6.2020
### Data
For data visualization, we need to remember the following details:
- For Categorical Univariate Data
-- Count Plot: Frequency of the categories in a variable is represented by the size of the slices.
-- Pie Chart: Frequency of the categories in a variable is represented by the size of the slices.
- For Numerical Univariate Data
-- Bar Chart: Size of the values in a variable is represented by the height of the bars.
-- Line Plot: Represent the trend of a variable over an ordinal variable.
-- Histogram: Distribution of a variable is represented by the height of the bins.
-- Box Plot: Distribution of a variable is represented based on five number summary.
- For Multivariate Data Analysis
Analyzing data using two variables. It is used to show relationships between the variables:
-- Scatter Plot: Relationship between two variables is represented as individual points.
-- Heat Map: Correlation between two (or more) variables is represented by the color degree.
We will use a lot of data visualization in this weekly project. The figures will be added once it is completed.