---
title: 'Tâm - Week 3'
tags: CoderSchool, Mariana
---
Week 3
===
## Table of Contents
[TOC]
## Monday
### Weekly project presentation
> our group Long Tam github and Heroku app is at (if there is error just reload the website)
https://github.com/thtamho/tiki_PostGreSQL_app
https://tiki-postgresql-app.herokuapp.com/
- Khoa Dicky and Ly has a website that can filter data by rating/comments, newest (by id), sales off (by discount),
- Nhan and Tien has a website that implements a search bar in addition to top menu that shows categories
- Thich and Sean has descriptive statistics on the side
- Cuong and Felix has a details pop up box that shows the statistics. A search bar. Left side menu includes the main categories
:::info
**Lesson:** To avoid someone typing "DROP DATABASE;" a string wrap and strip(';') has to be implemented for the search bar input
:::
- Tan and Nam website shows the main categories and smaller categories and finally the products of the smallest categories as we click through
- Bao and Natalie website has an input box can be used to input SQL query
### Data Analysis
- Our team will explore global terrorism dataset from Kaggle.
- There are ufc fights, NYC airbnb, NBA players, TED talks, global terrorism, world cup
- our team https://colab.research.google.com/drive/1QA3N-P9Hsrtvio0vec-jE7yOUgsKCZtG
### Fundamental Statistics
> Lecture slides
https://www.beautiful.ai/player/-LuVIkNIgFI9K47ycWlU/Mariana-Week-3-Basic-Stats
- Categorial data: Nominal data and ordinal data
- Continuous data: discrete (finite) and continuous (infinite)
- **Measure of central tendency:** mean (affected by outliers), median (the 50% percentile), mode (the highest frequency)
- **Measure of variability:** range (check the maximum and minimum of a distribution), quartiles (gives an overview of the data point in the distribution, e.g 90% percentile), box and whisker plot (outliers lie outside of the whisker - outliers<(Q1-IQR) and (Q3+IQR)< outlier with IQR=Q3-Q1)
- Variance is $\frac {\sum (x - mean)^2}{N}$. standard deviation $\sigma$ is $\sqrt[2]{variance}$
- **Distribution**
- **Binomial distribution:** success/failure
- independent event: mutually exclusive
- formula: $C_n^x$$p^x(1-p)^{n-x}$
- **Normal distribution:**

- This shows the percentile of a data some standard deviation away from mean
- Skewness: mean compared with median -> left (mean<median) or right skew (mean>median)
- **Descriptive stats and inferential stats:** descriptive -> describe our data; inferential -> try to infer about the population from sample
- **Visualization techniques:** pie vs bar chart.
- histogram plot: show frequency distribution
- side-by-side bar chart:
> D3.js tutorial links
https://www.freecodecamp.org/news/learn-d3-js-in-5-minutes-c5ec29fb0725/
https://www.tutorialsteacher.com/d3js
https://www.tutorialspoint.com/d3js/index.htm
- **Causality vs Correlation:** causality means one variable causes another variable; correlation means two variables moving together.
- **Basic steps of data analysis:**
- cleaning data
- pandas syntax: `pd.DataFrame.shape()`, `pd.DataFrame.info()`
- imputation for missing values
- clean out missing values
- convert string values if neccessary
- check for duplication
- compare median with mean
- what to do with outliers? outliers tell us useful info about our data, sometimes it can be a special phenomena
- presentation: tell an interesting story
## Tuesday
### Pandas intro
- Difference btw python list and numpy array:
- types of elements can be varied in a list whereas numpy array only holds one kind of element
- data will be converted to the same type if possible upon array creation
- a list element is a reference to an object in memory. a list contains the addresses to objects. the reference each costs 8 bytes
- the array contains the objects themselves
- operations on array will perform **element-wise**
- pandas is built on top of numpy
- an image is a 5 dimensional piece of data, HxWx(RBG)
- a series can be understood as a one-dimentional array with flexible indices. flexible indices mean we can use type of data different than integers to be index
- explicit indexing: `series['item 1']`
- implicit indexing: `series[0]`
- slicing: `series[1:]` or `series['Lele':]`
- fancy `series[['Uke','Lele']]`
- filtering/boolean indexing `series[series == 1]`
- in pandas the higher priority indexing system is explicit indexing
- loc -> explicit indexing. iloc -> implicit indexing (integer location)
-
**DataFrame creation**
- wecan create DataFrame from dictionary of list. each list has to be of the same size. the data has to be in the correct order
- we can also create DF from dictionary of dictionaries or dictionary of pandas series. we can have missing data NaN in this way
-
**Dataselection in DF**
- each column in DF is a series. we can select each column by the column name `df['Column1']` or `df.Column1` (we avoid this)
- loc: `df.loc['row1','column1']`
- iloc: `df.iloc[0,0]`
- filtering: `df[df['CatServant'] == 'Natalie']`
### Data Operation in Pandas
- read csv into pandas by `pd.read_csv`
- `df.shape`
- `df.head`
- `df.tail`
- `df.sample`
- `df.info`
- `df.describe`
- `str.lower()` allows strings to be lowered case
- why `and` `or` doesnt work in multiple condition filtering? they don't work with truth values. we use `&` `|` for bitwise operation
- We have to be careful when subseting data and using conditions: we have to use the subsequent conditions in order. For example, for titanic dataset to put age condition on the non-survived passengers, we have to put age condition on the dataframe that has already been put the survived condition.
- `df.groupby()`
- `value_counts` will count the unique values
- we can apply `str.split` to a column and count the number of unique values, e.g `df['Name'].str.split(".").str[0].str.split(', ').str[1].value_counts()`. Notice we have to use `str[0]` to select the first element of the split function result
- `apply` method: apply a function to every value in the dataframe column `df['Col'].apply(func)`
- we can delete col or row by `df.drop(['colorrow1','colorrow2'],axis=1,inplace=True)` axis 1 indicates columns, axis 0 indicates rows, inplace can be True or False
- `df['col'].astype('category', inplace=True)` adjust the type of column
- what about missing data:
```PYTHON=
# Replace missing data in Age column with their mean
df['Age'].fillna(df['Age'].mean())
```
`df['col'].notnull()` `df['col'].isnull()`
- you cannot do ```None + 1``` but you can do `numpy.nan + 1` which returns `np.nan`
:::info
**Casting:** python will cast the type of an integer (array) to float if we try to add a float element
:::
- we can use `df.replace(to_replace=(),value=() )` or we can have a dictionary of `to_replace ` as keys and `value` as values and pass the dictionary directly like `df.replace(dict)`
- we can use `df.fillna(value)` method on object type and numeric type
`df.mode(self, axis=0, numeric_only=False, dropna=True)` can be used to replace NaN in df
`inplace=True` not return any dataframe so in order to use `df.apply` we have to set `inplace` to `False`
- **lambda function:** lambda function is an anonymous function `lambda x: func(x)` and use in `df.apply`
- **Cutting df into parts:**
`pd.cut(array-like x, bins=num_bins)` `pd.qcut(x=1ndarray or Series, q=int num of quantiles)`
### Seaborn
`import seaborn as sns`
- distribution plot: `sns.distplot(series or array, kde=True)`
- side-by-side bar chart: `sns.countplot(data=df,x='Sex',hue='Survived')`
-
**Titanic dataset prediction** if we predict survival by gender we can get accuracy more than 70%
## Wednesday
### DataViZ with Matplotlib
- `import matplotlib.pyplot as plt`
- **Matplotlib.pyplot.bar** `x=x_data, height = y`
- Set user input with CAPITAL CASE
- show plot with `plt.show()`
### Matplotlib OO style
-
- `eval` function
### Things I learn doing data preprocessing
- Change a value in pdDataFrame by `df.at[]`
-
## Thursday
### Geopandas
- Importing geopandas
```
import geopandas as gpd
import geoplot as gplt
```
- read data
```
geo_data = gpd.read_file('/content/drive/My Drive/CoderSchool/CoderSchool-Mariana/datasets/geo_data/ne_10m_admin_0_countries.shp')
```
- Convert pandas into geopandas
```
plot_data = gpd.GeoDataFrame(plot_data, geometry = 'geometry')
```
- Plot with geopandas
```
gplt.choropleth(plot_data, hue = 'Rating', cmap = 'Blues', figsize = (20,10))
for _, data in plot_data.iterrows():
plt.text(x = data['coords'][0], y = data['coords'][1],
s = data['country'], ha = 'center', color = 'red')
plt.text(x = data['coords'][0], y = data['coords'][1] - 2,
s = f"Rating: {data['Rating']:.2f}", ha = 'center', color = 'red')
```
- POLYGON is a shape in geopandas
- 
## Friday
### Google Data Studio
- GDS is really convenient for analyzing data. Making charts is really fast with drag and drop
- New Data Field can be make like
```
CASE
WHEN state='successful' THEN 1
ELSE 0
END
```

:::info
**Find this document incomplete?** Leave a comment!
:::
###### tags: `CoderSchool` `Mariana` `MachineLearning`