---
tags: coderschool, note, week 3, statistics
---
# Week 3: Data Visualization and Analysis
Quote of the Week
> Correlation does not imply causation
>
> Garbage in, garbage out
>
> Graphs that take 2 seconds to understand are :shit:
___
**Good tips:**
How to read documentation?
* Check the object before the function
* Check definition
* Check returns
* Check parameters later for additional information and examples for details
___
**Table of Content**
[TOC]
## Week 2 Homework
* Check Team Dicky-Ly-Khoa's web scraper: https://github.com/TranLySFW/mariana_w1_project
* Check Bao's scraper: https://github.com/brookyct95/tiki_scraping
* Check Cuong-Felix's UI: https://github.com/FelixBecquart1990/Tiki_scraping_Cuong_Felix
* Check Tien's solution: https://github.com/txtien/tiki-scraping
* Check Tam's scraper: https://github.com/thtamho/tiki_PostGreSQL_app
___
## Day 1: Fundamental Statistics
**My team this week:**
**Young Buffalo** (Tuc - Ly - Tam)
Data analaysis: https://www.kaggle.com/START-UMD/gtd
**Weekly homework**
Each team has been given different datasets to provide analysis and present the findings and the story from the data on next Monday.
### Statistics: What to know
Referece: https://www.beautiful.ai/player/-LuVIkNIgFI9K47ycWlU/Mariana-Week-3-Basic-Stats
For further study: https://seeing-theory.brown.edu/?fbclid=IwAR2NSvrkF_JfHMZA3c3bC4aNz7iHXo3BoDvxPOg9snwkuJVZwW_X469kx4w
#### 1. What are types of data?
* What are categorical data and numerical data?
* **Categorical data** describes categories or groups, including 2 types: Nominal & Oridnal.
* Nominal data represent discrete units and are used to label variables with no quantiative value and order.
Example: Genders (Male, Female), Languages (English, French,etc.)
* Ordinal data are similar to nominal data but with ordering.
Example: Satisfaction level (1- Totally Satisfied, 2- Satisfied, 3- Neutral, 4- Dissatisfied, 5- Totally Dissatisfied)
* **Numerical data** are measurable, can be expressed in numbers, including 2 types: Continuous & Discrete.
* Continuous data (or intervals) are infinite, impossible to count.
Example: weight, height, distance, area, etc.
* Discrete data (or ratios) can be counted in a finite matter.
Example: shoe size, dice result, etc.

#### 2. What is the measure of central tendency?
* What are Mean, Median, Mode?
* **Mean**: Arithmetic average, affected by outlier
* **Median**: 50th percentile, middle of the dataset, dividing the distribution into 2 groups
* **Mode**: value of highest frequency with little information
Example:

* Outliers are values outside the main dataset

#### 3. What is the measure of variability?
The measure of variability is to see how data is spread out
* What is range?

Range is the difference between the lowest and highest values, useful to check if data point is on the correct scale.
* What are quartiles?

Quartiles are points in the distribution that cut data in equal parts (or quarters). Most common are q-quartiles.
* 1st quartile (Q1): 25%, middle value of first half
* The median (Q2): 50%
* 3rd quartiles (Q3): 75%, middle value of second half
* How to detect outlier with quartiles?

* The interquartile range (IQR) is a measure of variability, based on dividing a data set into quartiles.
Interquartile range (IQR) = Q3 - Q1
* Anything below Q1-IQR or above Q3 + IQR can be considered ouliers
* What is variance?
**Definition:** The average of the squared differences from the Mean.
* Variance measures how far each number in the set is from the mean and therefore from every other number in the set.
* Variance amplifies the effect of large differences
* Small variance means data points stay close together

*For reference:
https://www.quora.com/Why-is-variance-squared
https://stats.stackexchange.com/questions/118/why-square-the-difference-instead-of-taking-the-absolute-value-in-standard-devia*
* What is standard deviation?
* The Standard Deviation is a measure of how spread out numbers are.
* Its symbol is σ (the greek letter sigma).
* It is the square root of the Variance.
* Example for variance and standard deviation: https://www.mathsisfun.com/data/standard-deviation.html
* What is kernel density estimation?
http://homepages.inf.ed.ac.uk/rbf/CVonline/LOCAL_COPIES/AV0405/MISHRA/kde.html
#### 4. What is a distribution?
A distribution of statistical data shows all the possible values (or intervals) of the data and how often they occur (probability).
* What is an independent event?
Two events are independent of each other when the probability that one event occurs in no way affects the probability of the other event occurring.
* What is binomial distribution? Its formula?
The binomial is a type of distribution that has two possible outcomes. For example, a coin toss has only two possible outcomes: heads or tails and taking a test could have two possible outcomes: pass or fail.
* Formula:
$P(X) = _nC_x p^x(1-p)^{n-x}$
* What is normal distribution?

* Also known as Gaussian distribution
* Mean = Median = Mode
* Sum of Area Under the Curve (AUC) is 1
* Is a continuous distribution
* What is continuous probability distribution?
For further reading: https://support.minitab.com/en-us/minitab-express/1/help-and-how-to/basic-statistics/probability-distributions/supporting-topics/basics/continuous-and-discrete-probability-distributions/
* What is skewness?
Skewness indicates whether data is concentred on a side by comparing mean and median.
Example:
* mean > median: positive or right skew
* mean = median: symmetrical distribution
* mean < median: negative or left skew

#### 5. What is descriptive vs inferential?
* Descriptive Statistic describes a randomly-picked sample of data without making inferences from the sample to the whole population
* Inferential Statistic uses a random sample of data taken from a population to describe and make inferences about the population.

### Visualization techniques: What to know
#### 1. What are categorical data representations?
* Frequency tables, pie charts, and bar charts are the most appropriate graphical displays for categorical variables
* Pie chart is not reccommended for most cases (https://bernardmarr.com/default.asp?contentID=1779)
#### 2. What are numerical data representations?
For further reading: https://www2.le.ac.uk/offices/ld/resources/numerical-data/numerical-data
### Relationship between two variables
* What is correlation? Correlation simply means that there is some type of relationship between two variables. For reference: https://www.displayr.com/what-is-correlation/
* Correlation is not causation. Causation is the relationship between cause and effect. So, when a cause results in an effect, that's a causation. In other words, correlation between two events or variables simply indicates that a relationship exists, whereas causation is more specific and says that one event actually causes the other. For reference: https://www.mathtutordvd.com/public/Why-Correlation-does-not-Imply-Causation-in-Statistics.cfm

### Basic steps of Data Analysis
**Step 1**
* Find the question
* Business application
* Sufficient data for the answer?
**Step 2**
* Cleaning the data: find missing values, check data types, miscoding data, outliers
**Step 3**
* Univariate plots (histogram, distplot, boxplot)
* Correlation between variables (scatterplot, jointplot, kde plot)
* Feature engineering (using domain knowledge)
**Step 4**
* Provide clear brief
* Explain the data
* Make representations visible
___
## Day 2: Data Manipulation with Pandas
### Numpy Array vs Python List
Numpy is a libary from Python, adding support for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays.
* What is the difference between Python list and Numpy arrays?
| Python lists | Numpy arrays |
| -------- | -------- |
| [1,2,4,5] | [1 2 4 5] |
| many types | one type |
| memory duplicates | optimized memory |
| operations are complex | operations are easy |
Reference: https://medium.com/backticks-tildes/list-vs-array-python-data-type-40ac4f294551
* What is broadcasting with Numpy Arrays?
<iframe width="560" height="315" src="https://www.youtube.com/embed/tKcLaGdvabM" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
https://machinelearningmastery.com/broadcasting-with-numpy-arrays/
https://www.geeksforgeeks.org/python-broadcasting-with-numpy-arrays/
### Pandas
* What is a Pandas Series?
Pandas Series is a one-dimensional labeled array capable of holding data of any type (integer, string, float, python objects, etc.). The axis labels are collectively called index. Pandas Series is nothing but a column in an excel sheet. Reference: https://www.geeksforgeeks.org/python-pandas-series/
* What is Pandas Dataframe?
Pandas Dataframe is a two-dimensional data structure, i.e., data is aligned in a tabular fashion in rows and columns. Pandas DataFrame consists of three principal components, the data, rows, and columns.
https://www.geeksforgeeks.org/python-pandas-dataframe/
*Reference:
What is lambda? https://medium.com/@luijar/understanding-lambda-expressions-4fb7ed216bc5
pd.cut vs. pd.qcut: https://pbpython.com/pandas-qcut-cut.html*
### Operations
### Data Exploration
* Create a Pandas DataFrame: `pd.DataFrame()`
Note: DataFrame can be empty
* Count the number of rows: `len(df)`
* Show 5 random rows: `df.sample(5)`
* Show first 10 rows: `df.head(10)`
* Show last 15 rows: `df.tail(15)`
* Print a summary of the data: `df.info ()`
* `df[df["Name"].str.lower()` to select lower-case string
### Data Pre-processing
* Apply a function to loop through dataframe colums: `df['Name'].apply(func)``
* Delete column or row: `df.drop(['colorrow1','colorrow2'],axis=1,inplace=True)`, axis = 1 for columns, axis = 0 for rows
* Modify types of columns: `df.astype("new_type"), inplace = True)`
* Check missing data: `df.['column'].isnull()` returns True or False
* Fill in missing data: `df.fillna('value', inplace = True)`
`df['Age'].fillna(df['Age'].mean(), inplace = True)` replaces missing data in Age column with their mean
* Check this case of `.apply` method:
```
# Replace missing data in Embarked column with their mode then replace with their full name
## {"C": "Cherbourg", "S": "Southampton", "Q": "Queenstown"}
replace_embark = {"C": "Cherbourg", "S": "Southampton", "Q": "Queenstown"}
replace_embark['C']
# Replace ['C']
df['Embarked'] = df['Embarked'].fillna('S').apply(lambda x: replace_embark[x])
```
* Application of `.qcut` :
```
# Categorize Fare and Age columns into 5 equally-divided categories
pd.qcut(df['Fare'], q = 5).value_counts()
```
*Note: The argument of the function depends on the value of the dataframe*
```
def func(name):
return name.split('.')[0].split(',')[1]
df['Title'] = df['Name'].apply(func)
```
### Data Visualization: Seaborn
Seaborn is a Python data visualization library based on matplotlib.
* Distribution plot:
```
survived = df[df['Survived'] == 1]
non_survived = df[df['Survived'] == 0]
sns.distplot(survived['Age'], kde = False)
sns.distplot(non_survived['Age'], kde = False)
```
* Side-by-side bar chart: `sns.countplot(data = df, x = 'Sex', hue = 'Survived')`
___
## Day 3: Data Visualization
Reference: https://datavizcatalogue.com/
https://realpython.com/python-zip-function/
### What is matplotlib?
Matplotlib is a Python library used for plotting the Graphs. It is the most common library.
**Note:** Matplotlib has two interfaces: OO (Object-oriented) and MATLAB state-based.
Reference: https://matplotlib.org/3.1.1/tutorials/introductory/lifecycle.html
Referece: https://towardsdatascience.com/subplots-in-matplotlib-a-guide-and-tool-for-planning-your-plots-7d63fa632857
Check exercise to review: https://colab.research.google.com/drive/11UKGDHPy0XAiqeesCGD6DRNXyvrR7Ml-
### Example of Stacked Plot:
```
month_list = df['month_number'].tolist()
data = {
'Face Cream': df['facecream'].values,
'Face Wash': df['facewash'].values,
'Tooth Paste': df['toothpaste'].values,
'Bathing Soap': df['bathingsoap'].values,
'Shampoo': df['shampoo'].values,
'Moisturizer': df['moisturizer'].values
}
fig, ax = plt.subplots()
ax.stackplot(month_list, data['Face Cream'], data['Face Wash'], data['Tooth Paste'],
data['Bathing Soap'], data['Shampoo'], data['Moisturizer'],
labels=data.keys(),
colors=['m', 'c', 'r', 'k', 'g', 'y'])
c
ax.set(title='All Product Sales Data',
xlabel='Month Number',
ylabel='Sales Units in Number')
ax.legend(loc='upper left')
plt.show()
```

___
## Day 4: Working with Geo Data
http://geopandas.org/mapping.html
* Polygon are shapes in geopandas
* Usefull Pandas techniques:
```
To rename column: df.columns = ['Maker','BarName']
To get number of missing values: df.isnull().sum()
To drop value with null or NaN: df.dropna(inplace = True)
To get sorted unique values: df['BeanType'].sort_values().unique()
```
* Rework this exercise for better understanding: https://colab.research.google.com/drive/13qsHbeuoxVzB9si8moxQzoTXK4SXDRZk#scrollTo=1VYWGY5vNBBf
* Example of a plotted map with Geo Pandas

___
## Day 5: Google Data Studio
* Recreate this report: https://datastudio.google.com/s/mCLFaAwi6Ew
* Useful templates: https://datastudio.google.com/gallery?category=community