---
tags: Kermadec Class, Machine Learning, Data Visualization, Data Analysis, Advanced SQL, Pandas, Data Cleaning, Manipulation, Matplotlib, Chart Gallery - Preset code, Interactive Plot, Inferential Statistics
---
Machine Learning Week 3
=
# Day 1:
## Review Data Type:
Most of data is continuous.
Descrete | Continuous
---|---
Fix level of unit, pre-defined, categories | Infinite level of unit
Shoes size... | House price, Temperature, Weight, Age...
||Descriptive Statistics | Inferential Statistics
|---|---|---
|**How to get the result**| Survey | Test
|**Purpose**|Describe a sample | Use result of that sample to test on population
## Data Distribution:

### 1. Central Tendency
Where is the data concentrate on, the most representative of the data.

- **Average <> Mean**:
- **Arithmetic Mean:** sum of values / number of values
Try to cross out the **outlier** because outlier will make the mean **biased**.
- **Average**:
- **Median:** Center-most position: sort -> take the central. **Odd number** of values -> take the **average value** of the 2 central values.
Good for orderd categorical data (ordinal data) ~ Ranking.
- **Mode:** Values **appear the most time**. Appearance Frequncy.
**Bimodal:** 2 modes.
**Multimodal**.
_**Not useful for continuous data**_ unless converting continuous data to ranges/categories.
| |Mean | Median | Mode
-|-|-|-|
Categorical Data | | | x
Has Outliers | | x | x
Symmetrical | x | x | x
Skewed | | x |
### 2. Dispersion:
How shallow or deep the data distribute.


- **Range**: Difference between Min, Max values
- **Quartiles**: Break data into 4 parts, **using Median** to break, it is like binary search algo. Quartiles = Medians of each .
- **Interquartile Range (IQR)** to cross out outliers.
**IQR = Q3 - Q1**
Lower whisker: Q1 - 1.5\*IQR
Upper whisker: Q3 + 1.5\*IQR
In theory: Anything outside Lower whisker and Upper whisker are Outliers.
_Treat the outliers differently -> special categories._

- **Variance:** How **far apart** between **all values** versus the **mean**.
- **Standard Deviation** = $\sqrt{Variance}$
How spread the distribution is.

# Day 2
## Relational vs Non-relational Database
| |Relational | Non-relational Database
|-|-|-
Concept | Tables with relations. Have predefined schema | Tables with no/few relations -> No need to join tables. No predefined schema
Language | SQL (MySQL, SQLite, PostgreSQL) | Unique query language for each service/database
Schema (ERD)

MongoDB is a non-relational database, a document database.
### Non-relational database structure:
* Database
* Collection
* Documents
All stored as **json** file (similar to python dictionary), every document (data) with different values is stored in 1 column and 1 row.
Each collection can have different columns/values types.
**No Join -> Faster and more efficient query**.
Difficult to change values of documents. Example: User email are rarely changed.
**Want to be fast -> use No SQL database.**

**Hierachy data storage**: Stored as dictionary with multiple level, list or tuple...
### Scaling
- SQL Vertical scaling -> **Hardware limit** at certain time.
- No SQL Horizontal scaling -> More advantage because there is **no hardware limit** at certain time.


## Advanced SQL
### Sub Query
**Correlated vs Uncorrelated**
- **Correlated:** **Some relations** between of inner query and outer query. Inner query use result/data of outer query.
- **Uncorrelated:** **No relations** between of inner query and outer query. Inner query use no result/data of outer query.
### Set Operators
Union (distinct)
Union All
Intersect
Except
`Union` also perform `distinct` function and `sort` by first column of result table when concating tables.
# Day 4: Intermediate Pandas
## Data Analysis steps:
-> Get data
-> Read data
-> Recide what to do with the data (create the big question) to know what columns/rows need to clean -> clean data
-> Analyze data (query, create some baisc aggregated number, create new columns with modified data...) + Visualize data
Cheat Sheet -> no update
Library Documentation -> always update
## Pandas Component:
### 1. Series:
- Each column is a serie, work similar to 1D numpy array, but a little more advanced.
- Each serie can only have 1 data type at once.
**Implicit Index:** `...-2, -1, 0, 1, 2...`
**Explicit Index:** index can be assigned as anything
Serie has **explicit index**, can call data inside serie by any type of index (string, number). The index of a serie can be assigned.

Can only be called by string index after the index of a serie is re-assigned with new string index.
-> **Similar to how Dictionary works**
**Series as specialized dictionary**

**Can do slicing!!!**

Data inside a numpy array can only be called by index. Numpy index can't be assigned (always ...-2, -1, 0, 1, 2...).
**Series as generalized NumPy array**
Can only assign index to Series when creating a Series.
### Create Series

`pd.Series(data, index=index)`
-> **Can use dictionary to define index**
When the index is already defined, the `index=index` will return the only index was parsed in `index=`. If the index in `index=` is not in the index list of pre-defined index (in `data`), it will return `Nan`.

### 2. Index:
Each row/column has a index. However, index here is for **row** only.
**Index as immutable array**
Similar to 1D numpy array.
Index is immutable, not changeable.
### 3. DataFrame:
Can create DataFrame from Dictionary, Mulitiple series/Numpy Array with the same index.
Dictionary -> `key` will be columns, `value` will be rows.
## Data Exploration:
`df.T`: transpose -> change column names as timestamp to rows.
Most of functions in dataframe is "inplace" = False.
### Select Columns:
- df['A']
- df[['A', 'B']]
### Select Rows and/or Columns:
- df.loc['20200531'] # using a label
- df.loc[:, ['A', 'B']] # by list of columns
- df.loc['20200531', 'A'] # Getting a scalar value at a specific position
`df.loc` -> can only use with column/row names.
iloc: **integer** **loc**ate
`df.iloc[row, column]` -> can only use with column/row index (intger). Similar to Slicing a 2D NumPy.
- df.iloc[3] # 4th row
- df.iloc[3:5, 0:2] # slicing, similar to numpy/python
- df.iloc[[1, 2], [1, 3]] # by list of integer position
- df.iloc[2, 2] # getting a value explicitly
df.loc[df['B'] > 0] work the same as df[df['B'] > 0]
`df.drop(labels=df.index[:2], inplace=True)`: drop a list of index
`df.sort_index(axis=1/0, ascending=False)`: sort the index of columns/rows
`df.sort_values(by=['b', 'a', 'D'], ascending=[True, True, False])`: ascending can be a list. The sorting priority is the same as the list sequence.
## Data Manipulation:
### Missing Data

Pandas turn None and np.nan to `NaN`.

### Methods on Null Values
* Delete the case with missing observations. This is OK if this only causes the loss of a relatively small
number of cases. This is the simplest solution.
* Fill-in the missing value with mean, mode, median or other constant value.
* Use the rest of the data to predict the missing values by regression, KNN, ...
* Missing observation correlation. Consider just
(xi, yi) pairs with some observations missing. The
means and SDs of x and y can be used in the estimate even when a member of a pair is missing.
`data[data.isnull()]`: Get rows with None/NaN data.
`data.dropna(axis=1, how='any')`: how = any, drop columns that have at least 1 NaN.
`data.dropna(axis=1, how='all')`: how = all, drop columns that only have all NaN.
`data.dropna(axis=1, thresh=2)`: thresh = integer, drop columns that have exact number of NaN.
`data.fillna(method='ffill', axis=1)`: fill in NaN with values following a certain method. `ffill` forward fill, `bfill` before fill.
`df['A'].apply(lambda x: x + 100)` = df['A']+100
`df = pd.concat([df1, df2], axis = 1)`: df2 to the right side of df1
## Tips & Tricks:
**mean()** a column with only 2 values: `0` and `1` -> return value will be the percentage of `1` value.
# Day 5: Data Visualization
## Type of Data:
Ordinal: Ranking
Categorical: (can be) Range of Discrete
Continuous
Discrete
## Type of Charts:
- **Summary chart:** Take time to read and understand every details, which may contains a lot of data type (Oridnal, Categorical, Continuous, Discrete). Examples:
- Box-and-whisker plot

- **Univariate:** Describe one variable - Distribution of 1 variable. Most of the charts are univariate.
- Histogram
- Box-and-whisker plot: Show outliers - Quartiles. Difficult to understand, must understand Lower/Upper Whisker, IQR.
- Bar Chart: Comparison between bars
- Line Chart: Trend
- Pie Chart: Difficult to compare each (pie) section.
- 3D Chart can be illusionally biased.
- **Bivariate:** Show the relationship between two variables.
- Scatter Plot:
- Correlation Direction.

- Correlation Strength.

- Correlation Coefficient (r): The strength and direction of a linear relationship.
`r = [-1, 1]`
- Strong: 0.7 <= |r| <= 1.0
- Moderate: 0.3 <= |r| < 0.7
- Weak: 0.0 <= |r| < 0.3
- Heat Map
- **Multivariate:**
- Breakdown Dimension: Series in charts.
## Matplitlib:
### Components
<img src="https://matplotlib.org/_images/anatomy.png"/>
Matplitlib needs `figure` to draw on, charts can be in a suplot (aka "axes") drawn on a `figure`.
`figure` -> `suplot` ("axes") -> `charts`
`figure` like a big board for computer to locate where to draw.
`fig = plt.figure()`: an empty figure with no axes
`fig.suptitle('No axes on this figure')`: Add a title so we know which it is
`fig, ax = plt.subplots(2, 3)`: a figure with a 2x3 grid of Axes
`fig, ax = plt.subplots(figsize=(10,10))`: 1 chart with size in **inches**.
Can draw multiple charts in 1 axes (suplot).
```
ax[0][0].plot(data, data)
ax[0][0].plot(data, data**2)
ax[0][0].set_title("Plot 1")
ax[0][1].plot(data, data**2)
ax[0][1].set_title("Plot 2")
```
ax[row][column]
OOP Style
```
fig, ax = plt.subplots(2, 2, figsize=(12,12))
fig.suptitle("plots")
ax[0][0].plot(data, data)
ax[0][0].plot(data, data**2)
ax[0][0].set_title("Plot 1")
ax[0][1].plot(data, data**2)
ax[0][1].set_title("Plot 2")
ax[1][0].plot(data, data**3)
ax[1][0].set_title("Plot 3")
ax[1][1].plot(data, data**4)
ax[1][1].set_title("Plot 4")
plt.show()
```
Draw a suplot
MATLAB Style
```
plt.figure(figsize=(8, 8))
plt.plot(data, data**3, label='cubic')
plt.xlabel('x label')
plt.ylabel('y label')
plt.title('Power of 3')
plt.legend()
plt.show()
```

[Chart Gallery - Preset code](https://python-graph-gallery.com/)
[Options for styles](https://matplotlib.org/2.1.1/api/_as_gen/matplotlib.pyplot.plot.html):
* linestyle
* markers
* color
`plt.plot(t, t**2, 'b-o')` = plt.plot(t, t**2, color='blue', linestyle='solid', marker='o')
**[Preset Styles](https://matplotlib.org/3.3.1/gallery/style_sheets/style_sheets_reference.html):**
```
plt.style.available
plt.style.use('Solarize_Light2')
```
Can put text inside charts
[ax.text()](https://matplotlib.org/3.3.0/api/_as_gen/matplotlib.axes.Axes.text.html)
[Save figure](https://matplotlib.org/3.3.2/api/_as_gen/matplotlib.pyplot.savefig.html)
`fig.savefig()`
Expample:
fig.savefig('sales.png', transparent=False, dpi=80, bbox_inches="tight")
## Tips/Tricks:
`plt.axvline()`: vertical line
`plt.suplots(constrained_layout=True)` or `plt.tight_layout()`: no more overal axes on 1 figure

## Pandas Plot:
`df.plot()`: draw a quick chart straight from pandas
Visualization customization of Pandas Plot is limited.
Need high cusomization -> use matplotlib.
Result of `df.plot` is a suplot -> can use matplotlib function to customize the visualization.