HackMD - Collaborative Markdown Knowledge Base

Week_3 Note **Monday - Statistic** * Helps describe and understand the features of a specific dataset by giving short summaries. * Mean, or the average is caculated by adding all then dividing by the number of features: * Use all available data values, hence is a good representation of the data * Sensivtive outliers * Cannot be calculated for qualitative data * Median is the figure situated in the middle of the dataset. * Unaffected by outlier * Can be Used in ordered categorical data(ordinal data) * Does not use all available information in the data * Incapable of further algebraic treament. * Mode is value appearing most often * Unaffected by outliers * Can be used in categorical data. * Some data may have more than one mode, or may not have mode at all. * Could be an in accurate representative of the data * Range: * The difference between the smallest value and the largest value in the dataset. ![](https://i.imgur.com/j6SjsMq.png) * Quartiles: * Values that break down the dataset into quarters, or quartiles. ![](https://i.imgur.com/SOvBvpD.png) * Interquatile Range(IQR): * Difference between the upper (Q3) and lower (Q1) uqrtiles * Describes the middle 50% of values when ordered from lowest to highest * IQR is not affected by outliers. ![](https://i.imgur.com/frFA3aD.png) * Ouliers: * Values that are far from the middle. * Can be indetified by using IQR: * Any value smaller than Q1-1.5*IQR * Any value larger than Q3-1.5*IQR ![](https://i.imgur.com/NQ9ndJ8.png) * Variance: * Average of the squared differences from the Mean. * Standard Deviation(std): * Square Root of Variance * Using Std, we have a "standard" way of working what is normal and what is extra large of extra small. ![](https://i.imgur.com/K41xYY3.png) * Descriptive statistics consists of two basic categories of measures: * Measures of central tendency: * Describes the center of a dataset * The statiscal measures that identifies a single value as representative of an entire data * Measures of variability or spread. * Describes the dispersion of data within the set. --- **Tuesday - Advanced SQL** * Structured data is data that is highly-organized and formatted in a way so it's easily searchable in relational databases. SQL is the programming language for managing relational database. ![](https://i.imgur.com/LFSKdhY.png) * Order of Execution: * FROM & JOIN: Get the base data to query on * WHERE: Filter the base data * GROUP BY: Aggregate(Group) the base data * HAVING: Filter the aggregated(grouped) data * SELECT: Return the final data * DISTINCT: Return non-duplicated final data * UNION: Combining final data with other sets of data * ORDER BY: Sort the final data * LIMIT & OFFSET: Return only a certain row of final data ![](https://i.imgur.com/S0Taycw.png) Sub-queries: * Is SQL query nested inside a larger query * Often used to calculate summary value to use for a query * Two types of Sub-query: * Correlated sub-query: * Used to select data from a table referenced in the outer query. * A table alias(also called a correlation name) must be used to specify which table reference is to be used * Uncorrelated sub-query Common Table Expression(CTE): * Is a temporary result to use with other queries ![](https://i.imgur.com/rGyOvlD.png) * Helps organize complex queries. * Always returns a result set. They are used to simplify queries, for example, you could use one to eliminate a derived table from the main query body. * Prefer to use common table expressions rather than to use subqueries because common table expressions are more readable. --- **Wednesday - Pandas** Some common syntax used with pandas. [Pandas cheat sheet](https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf) s.astype('category') # This will convert the datatype of the series to float *Please note that “s” here is a Pandas Series s.replace(1,'one') # This will replace all values equal to 1 with ‘one’ s.replace([1,3],['one','three']) # This will replace all 1 with ‘one’ and 3 with ‘three’ data.rename(columns=lambda x: x + 1) # Mass renaming of columns data.rename(columns={'oldname': 'new name'}) # Selective renaming data.set_index('column_one') # This will change the index data.rename(index=lambda x: x + 1) # Mass renaming of index data.dropna() # Remove missing values data.fillna(x) # This will replaces all null values with x s.fillna(s.mean()) # This will replace all null values with the mean (mean can be replaced with almost any function from the below section) : data.corr() # This will return the correlation between columns in a DataFrame data.count() # This will return the number of non-null values in each DataFrame column data.max() # This will return the highest value in each column data.min() # This will return the lowest value in each column data.median() # This will return the median of each column data.std() # This will returns the standard deviation of each column --- **Thursday - Data Visualization** Is the graphical representation of information and data. * We can quicky see trends, outliers, and paterns in data. * A good visualization tells a story, removing the noise from data and highlighting the useful information. * There are 2 libraries supporting visualization: * Matplotlib is old, powerful, more customizable and low-level library providing various manipulations on visualizing data. * Seaborn is more elegant, built-on-top of Matplotlib and provide high-level usage to plot various plots by just passing data to the plot. * Before visualizing data, first thing we need to do is cleaning and organizing data. * Types of visualization: * Univariate: * Analyzing data using one variable. It is used to describe some thing. * Count Plot, Pie Chart describe frequency of the categories in a variable represented by the size of the slices. * Bar chart: size of the values in a variable represented by the height of the bars. * Line plot represents the trend of a variable over an ordinal variable. * Histogram: Distribution of a variable represented by the height of the bins * Box plot contains 5 metrics which is used to describe the distribution of a variable. * Bivariate: Analyzing data using two variables. It is used to show relationships between the variables. * Scatter plot: relationship of the two variables represented as individual points. * Heat Map: Correlation between two(or more) variables is represented by color degree. Matplotlib Creating Plots Figure Operator Description fig = plt.figures() a container that contains all plot elements Axes Operator Description fig.add_axes() Initializes subplot a = fig.add_subplot(222) A subplot is an axes on a grid system row-col-num. fig, b = plt.subplots(nrows=3, nclos=2) Adds subplot ax = plt.subplots(2, 2) Creates subplot Plotting Operator Description lines = plt.plot(x,y) Plot data connected by lines plt.scatter(x,y) Creates a scatterplot, unconnected data points plt.bar(xvalue, data , width, color...) simple vertical bar chart plt.barh(yvalue, data, width, color...) simple horizontal bar plt.hist(x, y) Plots a histogram plt.boxplot(x,y) Box and Whisker plot plt.violinplot(x, y) Creates violin plot ax.fill(x, y, color='lightblue');ax.fill_between(x,y,color='yellow') Fill area under/between plots Limits Operators Description plt.xlim(0, 7) Sets x-axis to display 0 - 7 plt.ylim(-0.5, 9) Sets y-axis to display -0.5 - 9 ax.set(xlim=[0, 7], ylim=[-0.5, 9]);ax.set_xlim(0, 7) Sets limits plt.margins(x=1.0, y=1.0) Set margins: add padding to a plot, values 0 - 1 plt.axis('equal') Set the aspect ratio of the plot to 1 Legend/labels Operator Description plt.title('just a title') Sets title of plot plt.xlabel('x-axis') Sets label next to x-axis plt.ylabel('y-axis') Sets label next to y-axis ax.set(title='axis', ylabel='Y-Axis', xlabel='X-Axis') Set title and axis labels ax.legend(loc='best') No overlapping plot elements Tick Operator Description plt.xticks(x, labels, rotation='vertical') Set ticks ax.xaxis.set(ticks=range(1,5), ticklabels=[3,100,-12,"foo"]) Set x-ticks ax.tick_params(axis='y', direction='inout', length=10) Make y-ticks longer and go in and out