# Week 3 ## Day 1 ### Data Analyst Job Description of Data Analyst: - Obtain Data: Using query language like SQL, BigQuery - Clean Data: Using tool like Python/Pandas to clear incorrect data, detect outliers - Analyzing Data: -- Explotary Data Analysis (EDA) is an approach to analyze data to summarize and get insight from them -- Feature Engineering is technique to transform raw data and create features to help represent insight -- Needed skill: Data Manipulation (Pandas), Data Visualization (Matplotlib, Seaborn) - Visualize Data: Put data on visualization tool like Matplotlib, Seaborn, Data Studio - Present Data: Present data and help turn data into decision ### Descriptive Analysis Summarize: - Qualitative Data: non-numeric data - Quantitative Data: numeric data What is Statistic: - A branch of math that deal with data collection, organization, analysis, interprepation and presentation. - In statistic, we can use sample to support claim about whole population. Central tendency: - The tendency that Data is clustered around some central value - Dispersion: how scatter that data is? - There are 3 common used measure to represent central tendency: -- Mean: the sum of value divide by number of values -- Median: the middle most number in the ordered sequence -- Mode: the value of occur most in the sequence. Unique Number sequence do not have mode ![](https://i.imgur.com/tDftKy4.png) Mean is good for measuring regular symmetrical dataset, but if the dataset has outlier and is scattered , mean value can't present the central tendency of data Median and Mode is good for datasets that have outliers since these two measure by passs the outliers when determine central point of dataset. And mode can be even used for non-numeric dataset. ## Day 2 ### SQL vs NoSQL - NoSQL database do not have rigid structure like Relational Database, if flexible for writting - SQL is good for Vertical Scaling (Upgrade more powerful hardware to server) while NoSQL is good for Horizontal Scaling (Add more Server) - SQL is for highly structured data like Finance, NoSQL is more non-structured data like blog, social network. ### Advance SQL - SubQuery - CTE (Common Table Expression): Syntax to write SubQuery in a more organizaed way. - UNION/UNION ALL/EXCEPT/INTERCERP: Append two tables together using different rules ## Day 3 - Countinue to work on Excercise - Hackathon ## Day 4 **About Pandas:** - Python package which is built on top of Numpy - Two types: Series (1-D) & Dataframe (2-D) -- Dataframe: container for Series -- Series: container for scalar (one row, one column is series) - Learning python syntax for: -- Create and load dataframe -- Select columns from dataframe -- Select rows from dataframe -- Index using name or index number -- Drop rows/columns -- Sort -- Handling NA/Null values -- Aggregation: sum, count, mean, min, max, median, mode, ... -- Apply and lambda function -- Historgraming -- Merge -- Groupby - Titanic analyst ## Day 5 **Matplotlib** - Advance packages of python for visualization - Two type of create chart: -- Matplotlib style -- OOP style