Week_3 Note
**Monday - Statistic**
* Helps describe and understand the features of a specific dataset by giving short summaries.
* Mean, or the average is caculated by adding all then dividing by the number of features:
* Use all available data values, hence is a good representation of the data
* Sensivtive outliers
* Cannot be calculated for qualitative data
* Median is the figure situated in the middle of the dataset.
* Unaffected by outlier
* Can be Used in ordered categorical data(ordinal data)
* Does not use all available information in the data
* Incapable of further algebraic treament.
* Mode is value appearing most often
* Unaffected by outliers
* Can be used in categorical data.
* Some data may have more than one mode, or may not have mode at all.
* Could be an in accurate representative of the data
* Range:
* The difference between the smallest value and the largest value in the dataset.

* Quartiles:
* Values that break down the dataset into quarters, or quartiles.

* Interquatile Range(IQR):
* Difference between the upper (Q3) and lower (Q1) uqrtiles
* Describes the middle 50% of values when ordered from lowest to highest
* IQR is not affected by outliers.

* Ouliers:
* Values that are far from the middle.
* Can be indetified by using IQR:
* Any value smaller than Q1-1.5*IQR
* Any value larger than Q3-1.5*IQR

* Variance:
* Average of the squared differences from the Mean.
* Standard Deviation(std):
* Square Root of Variance
* Using Std, we have a "standard" way of working what is normal and what is extra large of extra small.

* Descriptive statistics consists of two basic categories of measures:
* Measures of central tendency:
* Describes the center of a dataset
* The statiscal measures that identifies a single value as representative of an entire data
* Measures of variability or spread.
* Describes the dispersion of data within the set.
---
**Tuesday - Advanced SQL**
* Structured data is data that is highly-organized and formatted in a way so it's easily searchable in relational databases. SQL is the programming language for managing relational database.

* Order of Execution:
* FROM & JOIN: Get the base data to query on
* WHERE: Filter the base data
* GROUP BY: Aggregate(Group) the base data
* HAVING: Filter the aggregated(grouped) data
* SELECT: Return the final data
* DISTINCT: Return non-duplicated final data
* UNION: Combining final data with other sets of data
* ORDER BY: Sort the final data
* LIMIT & OFFSET: Return only a certain row of final data

Sub-queries:
* Is SQL query nested inside a larger query
* Often used to calculate summary value to use for a query
* Two types of Sub-query:
* Correlated sub-query:
* Used to select data from a table referenced in the outer query.
* A table alias(also called a correlation name) must be used to specify which table reference is to be used
* Uncorrelated sub-query
Common Table Expression(CTE):
* Is a temporary result to use with other queries

* Helps organize complex queries.
* Always returns a result set. They are used to simplify queries, for example, you could use one to eliminate a derived table from the main query body.
* Prefer to use common table expressions rather than to use subqueries because common table expressions are more readable.
---
**Wednesday - Pandas**
Some common syntax used with pandas.
[Pandas cheat sheet](https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf)
s.astype('category') # This will convert the datatype of the series to float *Please note that “s” here is a Pandas Series
s.replace(1,'one') # This will replace all values equal to 1 with ‘one’
s.replace([1,3],['one','three']) # This will replace all 1 with ‘one’ and 3 with ‘three’
data.rename(columns=lambda x: x + 1) # Mass renaming of columns
data.rename(columns={'oldname': 'new name'}) # Selective renaming
data.set_index('column_one') # This will change the index
data.rename(index=lambda x: x + 1) # Mass renaming of index
data.dropna() # Remove missing values
data.fillna(x) # This will replaces all null values with x
s.fillna(s.mean()) # This will replace all null values with the mean (mean can be replaced with almost any function from the below section) :
data.corr() # This will return the correlation between columns in a DataFrame
data.count() # This will return the number of non-null values in each DataFrame column
data.max() # This will return the highest value in each column
data.min() # This will return the lowest value in each column
data.median() # This will return the median of each column
data.std() # This will returns the standard deviation of each column
---
**Thursday - Data Visualization**
Is the graphical representation of information and data.
* We can quicky see trends, outliers, and paterns in data.
* A good visualization tells a story, removing the noise from data and highlighting the useful information.
* There are 2 libraries supporting visualization:
* Matplotlib is old, powerful, more customizable and low-level library providing various manipulations on visualizing data.
* Seaborn is more elegant, built-on-top of Matplotlib and provide high-level usage to plot various plots by just passing data to the plot.
* Before visualizing data, first thing we need to do is cleaning and organizing data.
* Types of visualization:
* Univariate:
* Analyzing data using one variable. It is used to describe some thing.
* Count Plot, Pie Chart describe frequency of the categories in a variable represented by the size of the slices.
* Bar chart: size of the values in a variable represented by the height of the bars.
* Line plot represents the trend of a variable over an ordinal variable.
* Histogram: Distribution of a variable represented by the height of the bins
* Box plot contains 5 metrics which is used to describe the distribution of a variable.
* Bivariate: Analyzing data using two variables. It is used to show relationships between the variables.
* Scatter plot: relationship of the two variables represented as individual points.
* Heat Map: Correlation between two(or more) variables is represented by color degree.
Matplotlib
Creating Plots
Figure
Operator Description
fig = plt.figures() a container that contains all plot elements
Axes
Operator Description
fig.add_axes() Initializes subplot
a = fig.add_subplot(222) A subplot is an axes on a grid system row-col-num.
fig, b = plt.subplots(nrows=3, nclos=2) Adds subplot
ax = plt.subplots(2, 2) Creates subplot
Plotting
Operator Description
lines = plt.plot(x,y) Plot data connected by lines
plt.scatter(x,y) Creates a scatterplot, unconnected data points
plt.bar(xvalue, data , width, color...) simple vertical bar chart
plt.barh(yvalue, data, width, color...) simple horizontal bar
plt.hist(x, y) Plots a histogram
plt.boxplot(x,y) Box and Whisker plot
plt.violinplot(x, y) Creates violin plot
ax.fill(x, y, color='lightblue');ax.fill_between(x,y,color='yellow') Fill area under/between plots
Limits
Operators Description
plt.xlim(0, 7) Sets x-axis to display 0 - 7
plt.ylim(-0.5, 9) Sets y-axis to display -0.5 - 9
ax.set(xlim=[0, 7], ylim=[-0.5, 9]);ax.set_xlim(0, 7) Sets limits
plt.margins(x=1.0, y=1.0) Set margins: add padding to a plot, values 0 - 1
plt.axis('equal') Set the aspect ratio of the plot to 1
Legend/labels
Operator Description
plt.title('just a title') Sets title of plot
plt.xlabel('x-axis') Sets label next to x-axis
plt.ylabel('y-axis') Sets label next to y-axis
ax.set(title='axis', ylabel='Y-Axis', xlabel='X-Axis') Set title and axis labels
ax.legend(loc='best') No overlapping plot elements
Tick
Operator Description
plt.xticks(x, labels, rotation='vertical') Set ticks
ax.xaxis.set(ticks=range(1,5), ticklabels=[3,100,-12,"foo"]) Set x-ticks
ax.tick_params(axis='y', direction='inout', length=10) Make y-ticks longer and go in and out