Week 3 - HackMD

# Week 3 ## Monday: Data Analysis ### Data Obtaining Database in SQL: SQLite BigQuery: data warehouse ### Data cleaning Potential errors: - Wrong data label - Incomplete data - Wrong sampling method Skill used: Pandas ### Data analyzing Feature Engineering: adding more features, transforming raw data into features ### Data visualizing and presenting Easier to look at the result of data analyzing Way to provide communicate method to present the results for non-analysts PowerBI, Tableau, Google Data Studio *Week 3: more code in how to clean the data *Week 4: more about presentations ### Descriptive Statistics Using statistic to describe a sample Central tendency: use one point to describe the data Different types of mean: - Arithmetic mean = Average - Geometric mean = nth root of the product of n numbers - Harmonic mean Mode mostly won't be used for quantitative data. For categorical data, can only use mode. For ordered catergorical data, median can also be used. Dispersion: how spread the data is from the mean IQR is a better measurement of dispersion than range since it is unaffected by outliers Standard deviation is easier to use than variance in measuring dispersion since the unit for variance is squared The smaller the standard deviation, the less the data spread. ## Tuesday: Advanced SQL SQL Database is a Relational Database and a structured one whereas NoSQL is a Non-relational database likely to be more document and distributed than structured | SQL | NoSQL | | -------- | -------- | | Smaller size | Bigger | | Structured | Unstructured (don't organize data in tables but in a collection of items) | | No hierarchy data storage | Can store hierarchy data | | Require consistency but easy application | More flexible but much more difficult to work with | | Scaling vertically (buy bigger storage) | Scaling horizontally (build more databases) | GROUP BY required an aggregated function along with it Sub-Queries: SQL query nested inside a larger query. 1. Uncorrelated sub-queries: the sub-query doesn't depend on anything else 2. Correlated sub-queries: the sub-query depends on the outer query Sub-query will always going to be executed first. CTE: Common Table Expression: WITH tablename (columnname) as (SELECT * ); always need to be in the beginning of the query Group by 2 columns will group the distinct combination of value from both columns UNION will return all unique values from two tables while UNION ALL will return all values from two tables ## Wednesday: BigQuery Data warehouse platform (temporary place for data) for analysis. Faster than running query on a database. Cloud computing: provide online server Query editor: typing query Syntax will be different from SQL. ## Thursday: Pandas Pandas is built on top of Numpy so it has many similar features. Two new data structures: 1. Series: like a column in a table (1D array) 2. Dataframe: like a table (2D) Matplotlib: library to visualize data Index is shown here in pandas. One series can only contain one type of data similar to NumPy. Categorical data: can only contain a limited number of values; it's more memory efficient Object: can be stretch out to match the shape of the array ```shape```: returns number of rows/columns in a tuple; shape is an attribute (chaaracteristic of an object) ```iloc```: integer location to select by integer position instead of by label ```loc``` NA values will be considered as float Joining the two tables: If the columns have different name: ...,table1_on = 'column1_name', table2_on = 'column2_name' The difference between using ```loc``` , ```iloc```and ```[]```: 1. ```loc``` will select the row first and ```[]``` will select the whole column 2. pandas will prefer to use ```loc``` before ```iloc``` so if you don't specify which, it will use ```loc``` 3. ```iloc``` uses python index, ```loc``` uses defined index 4. When the defined index is integers, using ```[]``` is not recommened (especially for rows) ## Friday: Data Visualization Count Plot: type of bar chart with the height of the bar is the frequency of the categorical variable; the visualization version of pd.value_counts() Bar chart is more recommended than pie chart Univariate: (use a bar chart with the height of the bar is the size of the values) analyzing data using one variable Line Plot: trends or time series; x-axis is ordered Histogram: (type of bar charts with the height of the bins is the frequency of numerical variable (continuous)) KDE: kernel density estimation (ex. histograms): how we present continuous data Box-plot: showing distribution of variable based on five-num summary; useful to get the outliers but it's not good to visualize the shape of the data Multivariate/Bivariate: correlation between two and multiple variables Scatter plot: relationship between two variables is represented as individual points Heat maps: relationship among multiple variables Matplotlib: python library for data visualization Subplots: a figure that contain many plots (axes) Plot can be a function or a method depending on whether or not the figure contains subplots