Tâm - Week 3 - HackMD

--- title: 'Tâm - Week 3' tags: CoderSchool, Mariana --- Week 3 === ## Table of Contents [TOC] ## Monday ### Weekly project presentation > our group Long Tam github and Heroku app is at (if there is error just reload the website) https://github.com/thtamho/tiki_PostGreSQL_app https://tiki-postgresql-app.herokuapp.com/ - Khoa Dicky and Ly has a website that can filter data by rating/comments, newest (by id), sales off (by discount), - Nhan and Tien has a website that implements a search bar in addition to top menu that shows categories - Thich and Sean has descriptive statistics on the side - Cuong and Felix has a details pop up box that shows the statistics. A search bar. Left side menu includes the main categories :::info **Lesson:** To avoid someone typing "DROP DATABASE;" a string wrap and strip(';') has to be implemented for the search bar input ::: - Tan and Nam website shows the main categories and smaller categories and finally the products of the smallest categories as we click through - Bao and Natalie website has an input box can be used to input SQL query ### Data Analysis - Our team will explore global terrorism dataset from Kaggle. - There are ufc fights, NYC airbnb, NBA players, TED talks, global terrorism, world cup - our team https://colab.research.google.com/drive/1QA3N-P9Hsrtvio0vec-jE7yOUgsKCZtG ### Fundamental Statistics > Lecture slides https://www.beautiful.ai/player/-LuVIkNIgFI9K47ycWlU/Mariana-Week-3-Basic-Stats - Categorial data: Nominal data and ordinal data - Continuous data: discrete (finite) and continuous (infinite) - **Measure of central tendency:** mean (affected by outliers), median (the 50% percentile), mode (the highest frequency) - **Measure of variability:** range (check the maximum and minimum of a distribution), quartiles (gives an overview of the data point in the distribution, e.g 90% percentile), box and whisker plot (outliers lie outside of the whisker - outliers<(Q1-IQR) and (Q3+IQR)< outlier with IQR=Q3-Q1) - Variance is $\frac {\sum (x - mean)^2}{N}$. standard deviation $\sigma$ is $\sqrt[2]{variance}$ - **Distribution** - **Binomial distribution:** success/failure - independent event: mutually exclusive - formula: $C_n^x$$p^x(1-p)^{n-x}$ - **Normal distribution:** ![image alt](https://miro.medium.com/max/24000/1*IZ2II2HYKeoMrdLU5jW6Dw.png) - This shows the percentile of a data some standard deviation away from mean - Skewness: mean compared with median -> left (mean<median) or right skew (mean>median) - **Descriptive stats and inferential stats:** descriptive -> describe our data; inferential -> try to infer about the population from sample - **Visualization techniques:** pie vs bar chart. - histogram plot: show frequency distribution - side-by-side bar chart: > D3.js tutorial links https://www.freecodecamp.org/news/learn-d3-js-in-5-minutes-c5ec29fb0725/ https://www.tutorialsteacher.com/d3js https://www.tutorialspoint.com/d3js/index.htm - **Causality vs Correlation:** causality means one variable causes another variable; correlation means two variables moving together. - **Basic steps of data analysis:** - cleaning data - pandas syntax: `pd.DataFrame.shape()`, `pd.DataFrame.info()` - imputation for missing values - clean out missing values - convert string values if neccessary - check for duplication - compare median with mean - what to do with outliers? outliers tell us useful info about our data, sometimes it can be a special phenomena - presentation: tell an interesting story ## Tuesday ### Pandas intro - Difference btw python list and numpy array: - types of elements can be varied in a list whereas numpy array only holds one kind of element - data will be converted to the same type if possible upon array creation - a list element is a reference to an object in memory. a list contains the addresses to objects. the reference each costs 8 bytes - the array contains the objects themselves - operations on array will perform **element-wise** - pandas is built on top of numpy - an image is a 5 dimensional piece of data, HxWx(RBG) - a series can be understood as a one-dimentional array with flexible indices. flexible indices mean we can use type of data different than integers to be index - explicit indexing: `series['item 1']` - implicit indexing: `series[0]` - slicing: `series[1:]` or `series['Lele':]` - fancy `series[['Uke','Lele']]` - filtering/boolean indexing `series[series == 1]` - in pandas the higher priority indexing system is explicit indexing - loc -> explicit indexing. iloc -> implicit indexing (integer location) - **DataFrame creation** - wecan create DataFrame from dictionary of list. each list has to be of the same size. the data has to be in the correct order - we can also create DF from dictionary of dictionaries or dictionary of pandas series. we can have missing data NaN in this way - **Dataselection in DF** - each column in DF is a series. we can select each column by the column name `df['Column1']` or `df.Column1` (we avoid this) - loc: `df.loc['row1','column1']` - iloc: `df.iloc[0,0]` - filtering: `df[df['CatServant'] == 'Natalie']` ### Data Operation in Pandas - read csv into pandas by `pd.read_csv` - `df.shape` - `df.head` - `df.tail` - `df.sample` - `df.info` - `df.describe` - `str.lower()` allows strings to be lowered case - why `and` `or` doesnt work in multiple condition filtering? they don't work with truth values. we use `&` `|` for bitwise operation - We have to be careful when subseting data and using conditions: we have to use the subsequent conditions in order. For example, for titanic dataset to put age condition on the non-survived passengers, we have to put age condition on the dataframe that has already been put the survived condition. - `df.groupby()` - `value_counts` will count the unique values - we can apply `str.split` to a column and count the number of unique values, e.g `df['Name'].str.split(".").str[0].str.split(', ').str[1].value_counts()`. Notice we have to use `str[0]` to select the first element of the split function result - `apply` method: apply a function to every value in the dataframe column `df['Col'].apply(func)` - we can delete col or row by `df.drop(['colorrow1','colorrow2'],axis=1,inplace=True)` axis 1 indicates columns, axis 0 indicates rows, inplace can be True or False - `df['col'].astype('category', inplace=True)` adjust the type of column - what about missing data: ```PYTHON= # Replace missing data in Age column with their mean df['Age'].fillna(df['Age'].mean()) ``` `df['col'].notnull()` `df['col'].isnull()` - you cannot do ```None + 1``` but you can do `numpy.nan + 1` which returns `np.nan` :::info **Casting:** python will cast the type of an integer (array) to float if we try to add a float element ::: - we can use `df.replace(to_replace=(),value=() )` or we can have a dictionary of `to_replace ` as keys and `value` as values and pass the dictionary directly like `df.replace(dict)` - we can use `df.fillna(value)` method on object type and numeric type `df.mode(self, axis=0, numeric_only=False, dropna=True)` can be used to replace NaN in df `inplace=True` not return any dataframe so in order to use `df.apply` we have to set `inplace` to `False` - **lambda function:** lambda function is an anonymous function `lambda x: func(x)` and use in `df.apply` - **Cutting df into parts:** `pd.cut(array-like x, bins=num_bins)` `pd.qcut(x=1ndarray or Series, q=int num of quantiles)` ### Seaborn `import seaborn as sns` - distribution plot: `sns.distplot(series or array, kde=True)` - side-by-side bar chart: `sns.countplot(data=df,x='Sex',hue='Survived')` - **Titanic dataset prediction** if we predict survival by gender we can get accuracy more than 70% ## Wednesday ### DataViZ with Matplotlib - `import matplotlib.pyplot as plt` - **Matplotlib.pyplot.bar** `x=x_data, height = y` - Set user input with CAPITAL CASE - show plot with `plt.show()` ### Matplotlib OO style - - `eval` function ### Things I learn doing data preprocessing - Change a value in pdDataFrame by `df.at[]` - ## Thursday ### Geopandas - Importing geopandas ``` import geopandas as gpd import geoplot as gplt ``` - read data ``` geo_data = gpd.read_file('/content/drive/My Drive/CoderSchool/CoderSchool-Mariana/datasets/geo_data/ne_10m_admin_0_countries.shp') ``` - Convert pandas into geopandas ``` plot_data = gpd.GeoDataFrame(plot_data, geometry = 'geometry') ``` - Plot with geopandas ``` gplt.choropleth(plot_data, hue = 'Rating', cmap = 'Blues', figsize = (20,10)) for _, data in plot_data.iterrows(): plt.text(x = data['coords'][0], y = data['coords'][1], s = data['country'], ha = 'center', color = 'red') plt.text(x = data['coords'][0], y = data['coords'][1] - 2, s = f"Rating: {data['Rating']:.2f}", ha = 'center', color = 'red') ``` - POLYGON is a shape in geopandas - ![](https://i.imgur.com/KavnzoT.png) ## Friday ### Google Data Studio - GDS is really convenient for analyzing data. Making charts is really fast with drag and drop - New Data Field can be make like ``` CASE WHEN state='successful' THEN 1 ELSE 0 END ``` ![](https://i.imgur.com/V1xVu6O.png) :::info **Find this document incomplete?** Leave a comment! ::: ###### tags: `CoderSchool` `Mariana` `MachineLearning`