Machine Learning Week 3

--- tags: Kermadec Class, Machine Learning, Data Visualization, Data Analysis, Advanced SQL, Pandas, Data Cleaning, Manipulation, Matplotlib, Chart Gallery - Preset code, Interactive Plot, Inferential Statistics --- Machine Learning Week 3 = # Day 1: ## Review Data Type: Most of data is continuous. Descrete | Continuous ---|--- Fix level of unit, pre-defined, categories | Infinite level of unit Shoes size... | House price, Temperature, Weight, Age... ||Descriptive Statistics | Inferential Statistics |---|---|--- |**How to get the result**| Survey | Test |**Purpose**|Describe a sample | Use result of that sample to test on population ## Data Distribution: ![](https://i.imgur.com/xN2SkKn.png) ### 1. Central Tendency Where is the data concentrate on, the most representative of the data. ![](https://i.imgur.com/KliKuHz.png) - **Average <> Mean**: - **Arithmetic Mean:** sum of values / number of values Try to cross out the **outlier** because outlier will make the mean **biased**. - **Average**: - **Median:** Center-most position: sort -> take the central. **Odd number** of values -> take the **average value** of the 2 central values. Good for orderd categorical data (ordinal data) ~ Ranking. - **Mode:** Values **appear the most time**. Appearance Frequncy. **Bimodal:** 2 modes. **Multimodal**. _**Not useful for continuous data**_ unless converting continuous data to ranges/categories. | |Mean | Median | Mode -|-|-|-| Categorical Data | | | x Has Outliers | | x | x Symmetrical | x | x | x Skewed | | x | ### 2. Dispersion: How shallow or deep the data distribute. ![](https://i.imgur.com/9Rxqrby.png) ![](https://i.imgur.com/i4fSep1.png) - **Range**: Difference between Min, Max values - **Quartiles**: Break data into 4 parts, **using Median** to break, it is like binary search algo. Quartiles = Medians of each . - **Interquartile Range (IQR)** to cross out outliers. **IQR = Q3 - Q1** Lower whisker: Q1 - 1.5\*IQR Upper whisker: Q3 + 1.5\*IQR In theory: Anything outside Lower whisker and Upper whisker are Outliers. _Treat the outliers differently -> special categories._ ![](https://i.imgur.com/ivsiSlM.png) - **Variance:** How **far apart** between **all values** versus the **mean**. - **Standard Deviation** = $\sqrt{Variance}$ How spread the distribution is. ![](https://i.imgur.com/YO5Tp9X.png) # Day 2 ## Relational vs Non-relational Database | |Relational | Non-relational Database |-|-|- Concept | Tables with relations. Have predefined schema | Tables with no/few relations -> No need to join tables. No predefined schema Language | SQL (MySQL, SQLite, PostgreSQL) | Unique query language for each service/database Schema (ERD) ![](https://i.imgur.com/kCaiMK5.png) MongoDB is a non-relational database, a document database. ### Non-relational database structure: * Database * Collection * Documents All stored as **json** file (similar to python dictionary), every document (data) with different values is stored in 1 column and 1 row. Each collection can have different columns/values types. **No Join -> Faster and more efficient query**. Difficult to change values of documents. Example: User email are rarely changed. **Want to be fast -> use No SQL database.** ![](https://i.imgur.com/sskON0n.png) **Hierachy data storage**: Stored as dictionary with multiple level, list or tuple... ### Scaling - SQL Vertical scaling -> **Hardware limit** at certain time. - No SQL Horizontal scaling -> More advantage because there is **no hardware limit** at certain time. ![](https://i.imgur.com/68DKgHq.png) ![](https://i.imgur.com/YO1XovA.png) ## Advanced SQL ### Sub Query **Correlated vs Uncorrelated** - **Correlated:** **Some relations** between of inner query and outer query. Inner query use result/data of outer query. - **Uncorrelated:** **No relations** between of inner query and outer query. Inner query use no result/data of outer query. ### Set Operators Union (distinct) Union All Intersect Except `Union` also perform `distinct` function and `sort` by first column of result table when concating tables. # Day 4: Intermediate Pandas ## Data Analysis steps: -> Get data -> Read data -> Recide what to do with the data (create the big question) to know what columns/rows need to clean -> clean data -> Analyze data (query, create some baisc aggregated number, create new columns with modified data...) + Visualize data Cheat Sheet -> no update Library Documentation -> always update ## Pandas Component: ### 1. Series: - Each column is a serie, work similar to 1D numpy array, but a little more advanced. - Each serie can only have 1 data type at once. **Implicit Index:** `...-2, -1, 0, 1, 2...` **Explicit Index:** index can be assigned as anything Serie has **explicit index**, can call data inside serie by any type of index (string, number). The index of a serie can be assigned. ![](https://i.imgur.com/w6ihtLI.png) Can only be called by string index after the index of a serie is re-assigned with new string index. -> **Similar to how Dictionary works** **Series as specialized dictionary** ![](https://i.imgur.com/j3EZbsy.png) **Can do slicing!!!** ![](https://i.imgur.com/qjpo0QP.png) Data inside a numpy array can only be called by index. Numpy index can't be assigned (always ...-2, -1, 0, 1, 2...). **Series as generalized NumPy array** Can only assign index to Series when creating a Series. ### Create Series ![](https://i.imgur.com/s8fOs9b.png) `pd.Series(data, index=index)` -> **Can use dictionary to define index** When the index is already defined, the `index=index` will return the only index was parsed in `index=`. If the index in `index=` is not in the index list of pre-defined index (in `data`), it will return `Nan`. ![](https://i.imgur.com/auY9FJg.png) ### 2. Index: Each row/column has a index. However, index here is for **row** only. **Index as immutable array** Similar to 1D numpy array. Index is immutable, not changeable. ### 3. DataFrame: Can create DataFrame from Dictionary, Mulitiple series/Numpy Array with the same index. Dictionary -> `key` will be columns, `value` will be rows. ## Data Exploration: `df.T`: transpose -> change column names as timestamp to rows. Most of functions in dataframe is "inplace" = False. ### Select Columns: - df['A'] - df[['A', 'B']] ### Select Rows and/or Columns: - df.loc['20200531'] # using a label - df.loc[:, ['A', 'B']] # by list of columns - df.loc['20200531', 'A'] # Getting a scalar value at a specific position `df.loc` -> can only use with column/row names. iloc: **integer** **loc**ate `df.iloc[row, column]` -> can only use with column/row index (intger). Similar to Slicing a 2D NumPy. - df.iloc[3] # 4th row - df.iloc[3:5, 0:2] # slicing, similar to numpy/python - df.iloc[[1, 2], [1, 3]] # by list of integer position - df.iloc[2, 2] # getting a value explicitly df.loc[df['B'] > 0] work the same as df[df['B'] > 0] `df.drop(labels=df.index[:2], inplace=True)`: drop a list of index `df.sort_index(axis=1/0, ascending=False)`: sort the index of columns/rows `df.sort_values(by=['b', 'a', 'D'], ascending=[True, True, False])`: ascending can be a list. The sorting priority is the same as the list sequence. ## Data Manipulation: ### Missing Data ![](https://i.imgur.com/9NhUxP8.png) Pandas turn None and np.nan to `NaN`. ![](https://i.imgur.com/oZHfSFL.png) ### Methods on Null Values * Delete the case with missing observations. This is OK if this only causes the loss of a relatively small number of cases. This is the simplest solution. * Fill-in the missing value with mean, mode, median or other constant value. * Use the rest of the data to predict the missing values by regression, KNN, ... * Missing observation correlation. Consider just (xi, yi) pairs with some observations missing. The means and SDs of x and y can be used in the estimate even when a member of a pair is missing. `data[data.isnull()]`: Get rows with None/NaN data. `data.dropna(axis=1, how='any')`: how = any, drop columns that have at least 1 NaN. `data.dropna(axis=1, how='all')`: how = all, drop columns that only have all NaN. `data.dropna(axis=1, thresh=2)`: thresh = integer, drop columns that have exact number of NaN. `data.fillna(method='ffill', axis=1)`: fill in NaN with values following a certain method. `ffill` forward fill, `bfill` before fill. `df['A'].apply(lambda x: x + 100)` = df['A']+100 `df = pd.concat([df1, df2], axis = 1)`: df2 to the right side of df1 ## Tips & Tricks: **mean()** a column with only 2 values: `0` and `1` -> return value will be the percentage of `1` value. # Day 5: Data Visualization ## Type of Data: Ordinal: Ranking Categorical: (can be) Range of Discrete Continuous Discrete ## Type of Charts: - **Summary chart:** Take time to read and understand every details, which may contains a lot of data type (Oridnal, Categorical, Continuous, Discrete). Examples: - Box-and-whisker plot ![](https://i.imgur.com/WCb2RLa.png) - **Univariate:** Describe one variable - Distribution of 1 variable. Most of the charts are univariate. - Histogram - Box-and-whisker plot: Show outliers - Quartiles. Difficult to understand, must understand Lower/Upper Whisker, IQR. - Bar Chart: Comparison between bars - Line Chart: Trend - Pie Chart: Difficult to compare each (pie) section. - 3D Chart can be illusionally biased. - **Bivariate:** Show the relationship between two variables. - Scatter Plot: - Correlation Direction. ![](https://i.imgur.com/OZacD0h.png) - Correlation Strength. ![](https://i.imgur.com/3XouRa8.png) - Correlation Coefficient (r): The strength and direction of a linear relationship. `r = [-1, 1]` - Strong: 0.7 <= |r| <= 1.0 - Moderate: 0.3 <= |r| < 0.7 - Weak: 0.0 <= |r| < 0.3 - Heat Map - **Multivariate:** - Breakdown Dimension: Series in charts. ## Matplitlib: ### Components <img src="https://matplotlib.org/_images/anatomy.png"/> Matplitlib needs `figure` to draw on, charts can be in a suplot (aka "axes") drawn on a `figure`. `figure` -> `suplot` ("axes") -> `charts` `figure` like a big board for computer to locate where to draw. `fig = plt.figure()`: an empty figure with no axes `fig.suptitle('No axes on this figure')`: Add a title so we know which it is `fig, ax = plt.subplots(2, 3)`: a figure with a 2x3 grid of Axes `fig, ax = plt.subplots(figsize=(10,10))`: 1 chart with size in **inches**. Can draw multiple charts in 1 axes (suplot). ``` ax[0][0].plot(data, data) ax[0][0].plot(data, data**2) ax[0][0].set_title("Plot 1") ax[0][1].plot(data, data**2) ax[0][1].set_title("Plot 2") ``` ax[row][column] OOP Style ``` fig, ax = plt.subplots(2, 2, figsize=(12,12)) fig.suptitle("plots") ax[0][0].plot(data, data) ax[0][0].plot(data, data**2) ax[0][0].set_title("Plot 1") ax[0][1].plot(data, data**2) ax[0][1].set_title("Plot 2") ax[1][0].plot(data, data**3) ax[1][0].set_title("Plot 3") ax[1][1].plot(data, data**4) ax[1][1].set_title("Plot 4") plt.show() ``` Draw a suplot MATLAB Style ``` plt.figure(figsize=(8, 8)) plt.plot(data, data**3, label='cubic') plt.xlabel('x label') plt.ylabel('y label') plt.title('Power of 3') plt.legend() plt.show() ``` ![](https://i.imgur.com/zuSIvF1.png) [Chart Gallery - Preset code](https://python-graph-gallery.com/) [Options for styles](https://matplotlib.org/2.1.1/api/_as_gen/matplotlib.pyplot.plot.html): * linestyle * markers * color `plt.plot(t, t**2, 'b-o')` = plt.plot(t, t**2, color='blue', linestyle='solid', marker='o') **[Preset Styles](https://matplotlib.org/3.3.1/gallery/style_sheets/style_sheets_reference.html):** ``` plt.style.available plt.style.use('Solarize_Light2') ``` Can put text inside charts [ax.text()](https://matplotlib.org/3.3.0/api/_as_gen/matplotlib.axes.Axes.text.html) [Save figure](https://matplotlib.org/3.3.2/api/_as_gen/matplotlib.pyplot.savefig.html) `fig.savefig()` Expample: fig.savefig('sales.png', transparent=False, dpi=80, bbox_inches="tight") ## Tips/Tricks: `plt.axvline()`: vertical line `plt.suplots(constrained_layout=True)` or `plt.tight_layout()`: no more overal axes on 1 figure ![](https://i.imgur.com/oc6neDW.png) ## Pandas Plot: `df.plot()`: draw a quick chart straight from pandas Visualization customization of Pandas Plot is limited. Need high cusomization -> use matplotlib. Result of `df.plot` is a suplot -> can use matplotlib function to customize the visualization.