the objective is to understand the problem in order to generate testable hypotheses
clear, concise and measurable
define
target/label (dependent variable)
features (indepent variables)
crucial to select the right class of algorithms
Basic Statistics
describe dimensions
type of distributions
descriptive statistics
mean, median, mode, std
correlation between features
relationships and pattern due to the structure of the data
Visualization
Quintessential rules
Data visualization is a key part of communicating your work to others
less is more
check properly the type of graph, with a graph you are able to tell a story
check dimension of axes marks
reduce the clutter
avoid unecessary or distracting visual elements
ornametal shading, dark gridlines
3D when not mandatory
Tips
A color can be defined using three components (aka RGB channels)
hue: component that distinguishes “different colors”
vary hue to distinguish categorical data
saturation: the colorfulness
vary saturation to stratify the plot
luminance: how much light is emitted, ranging from black to white
vary luminance to rages/bins in numerical data
Sequential palettes
A sequential palette ranges between two colours ranging from a lighter shade to a darker one. Same or similar hue are used and saturation varies.
Viridis palette
is implemented using blues and yellow sequences (and avoiding reds), in order to increase the readability for the visualizations
When to use it:
intended to represent numeric values
range of the data without meaningful midponint, no highlighting a specific value
Diverging palettes
A diverging palettes can be created by combining two sequential palettes (e.g. join them at the light colors and then let them diverge to different dark colors)
Icefire palette
When to use it:
two hue are used indicating a division, such as positive and negative values or booleans
there is a value of importance around which the data are to be compared
Visualization packages
Matplotlib
used for basic graph plotting like line charts, bar graphs
it works with datasets and arrays
is more customizable and pairs well with Pandas and Numpy
Seaborn
can perform complex visualizations with fewer commands
It works with entire datasets treated as solitary unit
it contains more inbuilt theme, and it is considerably more organized and functional than Matplotlib and treats the entire dataset as a solitary unit
Machines don't learn A learning machine finds a mathematical formula, which, when applied to a collections of input produces the desired output. If you distort your data inputs, the output is very likely to become completely wrong
Why the name Machine Learning?
Arthur Lee Samuel was an American pioneer in the field of computer gaming and artificial intelligence.
He popularized the term "machine learning" in 1959 at IBM.
…Marketing reason…
Two Types of Learning
Supervised Learning
The dataset is a collection of labeled examples is called feature vector is called label or target
Goal: use a dataset to produce a model that takes a feature vector as input and outputs informations that allows deducing the label for this feature vector
Unsupervised Learning
The dataset id a collection of unlabeled exaples
Goal: create a model that takes a feature vector as input and either trasforms it into another vector or into a value that can be used to solve a practical problem
Classification Problem
Classification predictive modeling is the task of approximating a mapping function from input variables to discrete output variables.
A discrete output variable is a category, such as a boolean variable.
Example: Spam detection
Regression Problem
Regression predictive modeling is the task of approximating a mapping function from input variables to a continuous output variable.
A continuous output variable is a real-value, such as an integer or floating point value.