# Week 3
## Monday: Data Analysis
### Data Obtaining
Database in SQL: SQLite
BigQuery: data warehouse
### Data cleaning
Potential errors:
- Wrong data label
- Incomplete data
- Wrong sampling method
Skill used: Pandas
### Data analyzing
Feature Engineering: adding more features, transforming raw data into features
### Data visualizing and presenting
Easier to look at the result of data analyzing
Way to provide communicate method to present the results for non-analysts
PowerBI, Tableau, Google Data Studio
*Week 3: more code in how to clean the data
*Week 4: more about presentations
### Descriptive Statistics
Using statistic to describe a sample
Central tendency: use one point to describe the data
Different types of mean:
- Arithmetic mean = Average
- Geometric mean = nth root of the product of n numbers
- Harmonic mean
Mode mostly won't be used for quantitative data.
For categorical data, can only use mode. For ordered catergorical data, median can also be used.
Dispersion: how spread the data is from the mean
IQR is a better measurement of dispersion than range since it is unaffected by outliers
Standard deviation is easier to use than variance in measuring dispersion since the unit for variance is squared
The smaller the standard deviation, the less the data spread.
## Tuesday: Advanced SQL
SQL Database is a Relational Database and a structured one whereas NoSQL is a Non-relational database likely to be more document and distributed than structured
| SQL | NoSQL |
| -------- | -------- |
| Smaller size | Bigger |
| Structured | Unstructured (don't organize data in tables but in a collection of items) |
| No hierarchy data storage | Can store hierarchy data |
| Require consistency but easy application | More flexible but much more difficult to work with |
| Scaling vertically (buy bigger storage) | Scaling horizontally (build more databases) |
GROUP BY required an aggregated function along with it
Sub-Queries:
SQL query nested inside a larger query.
1. Uncorrelated sub-queries: the sub-query doesn't depend on anything else
2. Correlated sub-queries: the sub-query depends on the outer query
Sub-query will always going to be executed first.
CTE: Common Table Expression: WITH tablename (columnname) as (SELECT * ); always need to be in the beginning of the query
Group by 2 columns will group the distinct combination of value from both columns
UNION will return all unique values from two tables while UNION ALL will return all values from two tables
## Wednesday: BigQuery
Data warehouse platform (temporary place for data) for analysis.
Faster than running query on a database.
Cloud computing: provide online server
Query editor: typing query
Syntax will be different from SQL.
## Thursday: Pandas
Pandas is built on top of Numpy so it has many similar features.
Two new data structures:
1. Series: like a column in a table (1D array)
2. Dataframe: like a table (2D)
Matplotlib: library to visualize data
Index is shown here in pandas.
One series can only contain one type of data similar to NumPy.
Categorical data: can only contain a limited number of values; it's more memory efficient
Object: can be stretch out to match the shape of the array
```shape```: returns number of rows/columns in a tuple; shape is an attribute (chaaracteristic of an object)
```iloc```: integer location to select by integer position instead of by label ```loc```
NA values will be considered as float
Joining the two tables:
If the columns have different name: ...,table1_on = 'column1_name', table2_on = 'column2_name'
The difference between using ```loc``` , ```iloc```and ```[]```:
1. ```loc``` will select the row first and ```[]``` will select the whole column
2. pandas will prefer to use ```loc``` before ```iloc``` so if you don't specify which, it will use ```loc```
3. ```iloc``` uses python index, ```loc``` uses defined index
4. When the defined index is integers, using ```[]``` is not recommened (especially for rows)
## Friday: Data Visualization
Count Plot: type of bar chart with the height of the bar is the frequency of the categorical variable; the visualization version of pd.value_counts()
Bar chart is more recommended than pie chart
Univariate: (use a bar chart with the height of the bar is the size of the values) analyzing data using one variable
Line Plot: trends or time series; x-axis is ordered
Histogram: (type of bar charts with the height of the bins is the frequency of numerical variable (continuous))
KDE: kernel density estimation (ex. histograms): how we present continuous data
Box-plot: showing distribution of variable based on five-num summary; useful to get the outliers but it's not good to visualize the shape of the data
Multivariate/Bivariate: correlation between two and multiple variables
Scatter plot: relationship between two variables is represented as individual points
Heat maps: relationship among multiple variables
Matplotlib: python library for data visualization
Subplots: a figure that contain many plots (axes)
Plot can be a function or a method depending on whether or not the figure contains subplots