--- tags: coderschool, note, week 3, statistics --- # Week 3: Data Visualization and Analysis Quote of the Week > Correlation does not imply causation > > Garbage in, garbage out > > Graphs that take 2 seconds to understand are :shit: ___ **Good tips:** How to read documentation? * Check the object before the function * Check definition * Check returns * Check parameters later for additional information and examples for details ___ **Table of Content** [TOC] ## Week 2 Homework * Check Team Dicky-Ly-Khoa's web scraper: https://github.com/TranLySFW/mariana_w1_project * Check Bao's scraper: https://github.com/brookyct95/tiki_scraping * Check Cuong-Felix's UI: https://github.com/FelixBecquart1990/Tiki_scraping_Cuong_Felix * Check Tien's solution: https://github.com/txtien/tiki-scraping * Check Tam's scraper: https://github.com/thtamho/tiki_PostGreSQL_app ___ ## Day 1: Fundamental Statistics **My team this week:** **Young Buffalo** (Tuc - Ly - Tam) Data analaysis: https://www.kaggle.com/START-UMD/gtd **Weekly homework** Each team has been given different datasets to provide analysis and present the findings and the story from the data on next Monday. ### Statistics: What to know Referece: https://www.beautiful.ai/player/-LuVIkNIgFI9K47ycWlU/Mariana-Week-3-Basic-Stats For further study: https://seeing-theory.brown.edu/?fbclid=IwAR2NSvrkF_JfHMZA3c3bC4aNz7iHXo3BoDvxPOg9snwkuJVZwW_X469kx4w #### 1. What are types of data? * What are categorical data and numerical data? * **Categorical data** describes categories or groups, including 2 types: Nominal & Oridnal. * Nominal data represent discrete units and are used to label variables with no quantiative value and order. Example: Genders (Male, Female), Languages (English, French,etc.) * Ordinal data are similar to nominal data but with ordering. Example: Satisfaction level (1- Totally Satisfied, 2- Satisfied, 3- Neutral, 4- Dissatisfied, 5- Totally Dissatisfied) * **Numerical data** are measurable, can be expressed in numbers, including 2 types: Continuous & Discrete. * Continuous data (or intervals) are infinite, impossible to count. Example: weight, height, distance, area, etc. * Discrete data (or ratios) can be counted in a finite matter. Example: shoe size, dice result, etc. ![](https://i.imgur.com/Oe9ehip.jpg) #### 2. What is the measure of central tendency? * What are Mean, Median, Mode? * **Mean**: Arithmetic average, affected by outlier * **Median**: 50th percentile, middle of the dataset, dividing the distribution into 2 groups * **Mode**: value of highest frequency with little information Example: ![](https://i.imgur.com/e47LyWE.png) * Outliers are values outside the main dataset ![](https://i.imgur.com/JiyVfBt.png) #### 3. What is the measure of variability? The measure of variability is to see how data is spread out * What is range? ![](https://i.imgur.com/ETflmoc.png) Range is the difference between the lowest and highest values, useful to check if data point is on the correct scale. * What are quartiles? ![](https://i.imgur.com/nuj8eTO.png) Quartiles are points in the distribution that cut data in equal parts (or quarters). Most common are q-quartiles. * 1st quartile (Q1): 25%, middle value of first half * The median (Q2): 50% * 3rd quartiles (Q3): 75%, middle value of second half * How to detect outlier with quartiles? ![](https://i.imgur.com/NIuT4h8.png) * The interquartile range (IQR) is a measure of variability, based on dividing a data set into quartiles. Interquartile range (IQR) = Q3 - Q1 * Anything below Q1-IQR or above Q3 + IQR can be considered ouliers * What is variance? **Definition:** The average of the squared differences from the Mean. * Variance measures how far each number in the set is from the mean and therefore from every other number in the set. * Variance amplifies the effect of large differences * Small variance means data points stay close together ![](https://i.imgur.com/pdUIPqi.png) *For reference: https://www.quora.com/Why-is-variance-squared https://stats.stackexchange.com/questions/118/why-square-the-difference-instead-of-taking-the-absolute-value-in-standard-devia* * What is standard deviation? * The Standard Deviation is a measure of how spread out numbers are. * Its symbol is σ (the greek letter sigma). * It is the square root of the Variance. * Example for variance and standard deviation: https://www.mathsisfun.com/data/standard-deviation.html * What is kernel density estimation? http://homepages.inf.ed.ac.uk/rbf/CVonline/LOCAL_COPIES/AV0405/MISHRA/kde.html #### 4. What is a distribution? A distribution of statistical data shows all the possible values (or intervals) of the data and how often they occur (probability). * What is an independent event? Two events are independent of each other when the probability that one event occurs in no way affects the probability of the other event occurring. * What is binomial distribution? Its formula? The binomial is a type of distribution that has two possible outcomes. For example, a coin toss has only two possible outcomes: heads or tails and taking a test could have two possible outcomes: pass or fail. * Formula: $P(X) = _nC_x p^x(1-p)^{n-x}$ * What is normal distribution? ![](https://i.imgur.com/FCdk4VU.png) * Also known as Gaussian distribution * Mean = Median = Mode * Sum of Area Under the Curve (AUC) is 1 * Is a continuous distribution * What is continuous probability distribution? For further reading: https://support.minitab.com/en-us/minitab-express/1/help-and-how-to/basic-statistics/probability-distributions/supporting-topics/basics/continuous-and-discrete-probability-distributions/ * What is skewness? Skewness indicates whether data is concentred on a side by comparing mean and median. Example: * mean > median: positive or right skew * mean = median: symmetrical distribution * mean < median: negative or left skew ![](https://i.imgur.com/wwBNkjT.jpg) #### 5. What is descriptive vs inferential? * Descriptive Statistic describes a randomly-picked sample of data without making inferences from the sample to the whole population * Inferential Statistic uses a random sample of data taken from a population to describe and make inferences about the population. ![](https://i.imgur.com/tS5uWuZ.png) ### Visualization techniques: What to know #### 1. What are categorical data representations? * Frequency tables, pie charts, and bar charts are the most appropriate graphical displays for categorical variables * Pie chart is not reccommended for most cases (https://bernardmarr.com/default.asp?contentID=1779) #### 2. What are numerical data representations? For further reading: https://www2.le.ac.uk/offices/ld/resources/numerical-data/numerical-data ### Relationship between two variables * What is correlation? Correlation simply means that there is some type of relationship between two variables. For reference: https://www.displayr.com/what-is-correlation/ * Correlation is not causation. Causation is the relationship between cause and effect. So, when a cause results in an effect, that's a causation. In other words, correlation between two events or variables simply indicates that a relationship exists, whereas causation is more specific and says that one event actually causes the other. For reference: https://www.mathtutordvd.com/public/Why-Correlation-does-not-Imply-Causation-in-Statistics.cfm ![](https://i.imgur.com/giqNNhg.png) ### Basic steps of Data Analysis **Step 1** * Find the question * Business application * Sufficient data for the answer? **Step 2** * Cleaning the data: find missing values, check data types, miscoding data, outliers **Step 3** * Univariate plots (histogram, distplot, boxplot) * Correlation between variables (scatterplot, jointplot, kde plot) * Feature engineering (using domain knowledge) **Step 4** * Provide clear brief * Explain the data * Make representations visible ___ ## Day 2: Data Manipulation with Pandas ### Numpy Array vs Python List Numpy is a libary from Python, adding support for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays. * What is the difference between Python list and Numpy arrays? | Python lists | Numpy arrays | | -------- | -------- | | [1,2,4,5] | [1 2 4 5] | | many types | one type | | memory duplicates | optimized memory | | operations are complex | operations are easy | Reference: https://medium.com/backticks-tildes/list-vs-array-python-data-type-40ac4f294551 * What is broadcasting with Numpy Arrays? <iframe width="560" height="315" src="https://www.youtube.com/embed/tKcLaGdvabM" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe> https://machinelearningmastery.com/broadcasting-with-numpy-arrays/ https://www.geeksforgeeks.org/python-broadcasting-with-numpy-arrays/ ### Pandas * What is a Pandas Series? Pandas Series is a one-dimensional labeled array capable of holding data of any type (integer, string, float, python objects, etc.). The axis labels are collectively called index. Pandas Series is nothing but a column in an excel sheet. Reference: https://www.geeksforgeeks.org/python-pandas-series/ * What is Pandas Dataframe? Pandas Dataframe is a two-dimensional data structure, i.e., data is aligned in a tabular fashion in rows and columns. Pandas DataFrame consists of three principal components, the data, rows, and columns. https://www.geeksforgeeks.org/python-pandas-dataframe/ *Reference: What is lambda? https://medium.com/@luijar/understanding-lambda-expressions-4fb7ed216bc5 pd.cut vs. pd.qcut: https://pbpython.com/pandas-qcut-cut.html* ### Operations ### Data Exploration * Create a Pandas DataFrame: `pd.DataFrame()` Note: DataFrame can be empty * Count the number of rows: `len(df)` * Show 5 random rows: `df.sample(5)` * Show first 10 rows: `df.head(10)` * Show last 15 rows: `df.tail(15)` * Print a summary of the data: `df.info ()` * `df[df["Name"].str.lower()` to select lower-case string ### Data Pre-processing * Apply a function to loop through dataframe colums: `df['Name'].apply(func)`` * Delete column or row: `df.drop(['colorrow1','colorrow2'],axis=1,inplace=True)`, axis = 1 for columns, axis = 0 for rows * Modify types of columns: `df.astype("new_type"), inplace = True)` * Check missing data: `df.['column'].isnull()` returns True or False * Fill in missing data: `df.fillna('value', inplace = True)` `df['Age'].fillna(df['Age'].mean(), inplace = True)` replaces missing data in Age column with their mean * Check this case of `.apply` method: ``` # Replace missing data in Embarked column with their mode then replace with their full name ## {"C": "Cherbourg", "S": "Southampton", "Q": "Queenstown"} replace_embark = {"C": "Cherbourg", "S": "Southampton", "Q": "Queenstown"} replace_embark['C'] # Replace ['C'] df['Embarked'] = df['Embarked'].fillna('S').apply(lambda x: replace_embark[x]) ``` * Application of `.qcut` : ``` # Categorize Fare and Age columns into 5 equally-divided categories pd.qcut(df['Fare'], q = 5).value_counts() ``` *Note: The argument of the function depends on the value of the dataframe* ``` def func(name): return name.split('.')[0].split(',')[1] df['Title'] = df['Name'].apply(func) ``` ### Data Visualization: Seaborn Seaborn is a Python data visualization library based on matplotlib. * Distribution plot: ``` survived = df[df['Survived'] == 1] non_survived = df[df['Survived'] == 0] sns.distplot(survived['Age'], kde = False) sns.distplot(non_survived['Age'], kde = False) ``` * Side-by-side bar chart: `sns.countplot(data = df, x = 'Sex', hue = 'Survived')` ___ ## Day 3: Data Visualization Reference: https://datavizcatalogue.com/ https://realpython.com/python-zip-function/ ### What is matplotlib? Matplotlib is a Python library used for plotting the Graphs. It is the most common library. **Note:** Matplotlib has two interfaces: OO (Object-oriented) and MATLAB state-based. Reference: https://matplotlib.org/3.1.1/tutorials/introductory/lifecycle.html Referece: https://towardsdatascience.com/subplots-in-matplotlib-a-guide-and-tool-for-planning-your-plots-7d63fa632857 Check exercise to review: https://colab.research.google.com/drive/11UKGDHPy0XAiqeesCGD6DRNXyvrR7Ml- ### Example of Stacked Plot: ``` month_list = df['month_number'].tolist() data = { 'Face Cream': df['facecream'].values, 'Face Wash': df['facewash'].values, 'Tooth Paste': df['toothpaste'].values, 'Bathing Soap': df['bathingsoap'].values, 'Shampoo': df['shampoo'].values, 'Moisturizer': df['moisturizer'].values } fig, ax = plt.subplots() ax.stackplot(month_list, data['Face Cream'], data['Face Wash'], data['Tooth Paste'], data['Bathing Soap'], data['Shampoo'], data['Moisturizer'], labels=data.keys(), colors=['m', 'c', 'r', 'k', 'g', 'y']) c ax.set(title='All Product Sales Data', xlabel='Month Number', ylabel='Sales Units in Number') ax.legend(loc='upper left') plt.show() ``` ![](https://i.imgur.com/tnq5FYK.png) ___ ## Day 4: Working with Geo Data http://geopandas.org/mapping.html * Polygon are shapes in geopandas * Usefull Pandas techniques: ``` To rename column: df.columns = ['Maker','BarName'] To get number of missing values: df.isnull().sum() To drop value with null or NaN: df.dropna(inplace = True) To get sorted unique values: df['BeanType'].sort_values().unique() ``` * Rework this exercise for better understanding: https://colab.research.google.com/drive/13qsHbeuoxVzB9si8moxQzoTXK4SXDRZk#scrollTo=1VYWGY5vNBBf * Example of a plotted map with Geo Pandas ![](https://i.imgur.com/KWb7qKt.png) ___ ## Day 5: Google Data Studio * Recreate this report: https://datastudio.google.com/s/mCLFaAwi6Ew * Useful templates: https://datastudio.google.com/gallery?category=community