Try   HackMD

Hello
Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →


Patricio Reyes


You ?


Your Name here


Dadaist approach

feel free to collaborate on this presentation

  • suggestions?
  • new content?
  • errors?
    • typos
    • Italenglish?
    • Spanglish?

Share your roadmap


Wise Apple Bowl 2020


A different approach


Let's collaborate

  • Slack/Discord group?
    • learning community

Next steps

  • share yor notes Example: fastai course
  • share your ML roadmap Example
  • start a repository
    • wiki?
    • README file
    • tools
      • markdown
  • show your own data science roadmap

Pre-requisites


  • Python
  • Chocolate
    Image Not Showing Possible Reasons
    • The image file may be corrupted
    • The server hosting the image is unavailable
    • The image path is incorrect
    • The image format is not supported
    Learn More →
  • Anaconda
  • Github + Google (colab)
  • Brewed coffee
    Image Not Showing Possible Reasons
    • The image file may be corrupted
    • The server hosting the image is unavailable
    • The image path is incorrect
    • The image format is not supported
    Learn More →
  • Good Will!

Tips for data scientists


  1. social skills!!!
    • documentation
    • team members
    • clients

  1. Reproducibility/Replicability

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →


  1. Data analysis
    Stats on a MacBook

How to structure a data science project


collaboration
Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

  • Slack/MS Teams/Discord is not enough!
  • team, team, team

1. "I work alone. I don't care"
Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

  • you collaborate with yourself
  • your future self will need
    • documentation
    • debugging

you always need to collaborate

  • looking for advice
    • blogs: bloggers share knowledge
    • books: authors share knowledge
  • beta-testing
    • you are not the best programmer to test your own code
  • why don't you ask for collaboration?!?!

2. "I work on a team"

  • your future self is part of the same team
    • smart member
      Image Not Showing Possible Reasons
      • The image file may be corrupted
      • The server hosting the image is unavailable
      • The image path is incorrect
      • The image format is not supported
      Learn More →
    • your future self will need documentation
  • you (your code) have to interact with others
    • documentation
    • README file
    • end-user
      • how to run the code?
    • developer
      • how to start working on the code?

Reproducibility


Data Science Template


Install the template


Directory Structure


Data is immutable

  • Don't ever edit your raw data
    • especially not manually
    • and especially not in Excel
  • Don't overwrite your raw data
  • Treat the data (and its format) as immutable.
  • data folder in .gitignore

Data version control


  • Data science project template
    • templates
    • documentation
    • README files
    • LICENSE
    • semantic versioning
    • collaboration
      • issues
      • Pull Requests

  • tools
    • Github
    • CookieCutter
    • documentation
      • Sphinx
      • MkDocs

Template for Workflows


AI Ethics


Data Analytics Tools


jupyter notebook


Deployment

  • according to wikipedia

    Software deployment is all of the activities that make a software system available for use.


notebooks are just for exploration

  • I don't like notebooks - Joel Grus
  • well
  • let's deploy jupyter notebooks
    Image Not Showing Possible Reasons
    • The image file may be corrupted
    • The server hosting the image is unavailable
    • The image path is incorrect
    • The image format is not supported
    Learn More →

papermill + nbconvert


papermill + nbconvert

  • jupyter notebook
    webpage (html)
  • how to


webapps / Dashboarding


Maria Teresa Grifa

  • data scientist at Bridgestone EMA
  • github: MT-G

Data Science Steps



Data Preparation


Raw Data

  • structured data: data matrix
  • graph: web and social networks
  • spatial data
  • time series
    • sensors data
    • stock exchange data
  • unstructured data:
    • text
    • images

Data Cleaning

  • extract data with different formats
    • excel, json, csv, pdf, jpg, mp4, etc.
  • evaluation of data accuracy and reliability
    • presence of missing values
    • outliers
    • inconsitencies
    • level of noise


Data Consolidation

  • consistency across different data sources
  • consistency of units
  • consistency of scales
  • consistency of file, folder names, etc..
  • data testing

Exploratory Data Analysis


Problem identification

  • analysis before modelling
  • the objective is to understand the problem in order to generate testable hypotheses
  • clear, concise and measurable
  • define
    • target/label (dependent variable)
    • features (indepent variables)
  • crucial to select the right class of algorithms

Basic Statistics

  • describe dimensions
  • type of distributions
  • descriptive statistics
    • mean, median, mode, std
  • correlation between features
  • relationships and pattern due to the structure of the data

Visualization


Quintessential rules

Data visualization is a key part of communicating your work to others


  • less is more
    • check properly the type of graph,
      with a graph you are able to tell a story
    • check dimension of axes marks
  • reduce the clutter
    • avoid unecessary or distracting visual elements
      • ornametal shading, dark gridlines
      • 3D when not mandatory

Tips

A color can be defined using three components (aka RGB channels)

  • hue: component that distinguishes “different colors”
    • vary hue to distinguish categorical data
  • saturation: the colorfulness
    • vary saturation to stratify the plot
  • luminance: how much light is emitted, ranging from black to white
    • vary luminance to rages/bins in numerical data

Sequential palettes

A sequential palette ranges between two colours ranging from a lighter shade to a darker one. Same or similar hue are used and saturation varies.


Viridis palette

  • is implemented using blues and yellow sequences (and avoiding reds), in order to increase the readability for the visualizations
  • When to use it:
    • intended to represent numeric values
    • range of the data without meaningful midponint, no highlighting a specific value

Diverging palettes

A diverging palettes can be created by combining two sequential palettes (e.g. join them at the light colors and then let them diverge to different dark colors)


Icefire palette

  • When to use it:
    • two hue are used indicating a division, such as positive and negative values or booleans
    • there is a value of importance around which the data are to be compared

Visualization packages

  • Matplotlib
    • used for basic graph plotting like line charts, bar graphs
    • it works with datasets and arrays
    • is more customizable and pairs well with Pandas and Numpy

  • Seaborn
    • can perform complex visualizations with fewer commands
    • It works with entire datasets treated as solitary unit
    • it contains more inbuilt theme, and it is considerably more organized and functional than Matplotlib and treats the entire dataset as a solitary unit

Hands-on


Machine Learning Intro


ML WorkFlow

Process of solving a practical problem by

  1. gathering a dataset
  2. building a statistical model on that dataset

Machines don't learn
A learning machine finds a mathematical formula, which, when applied to a collections of input produces the desired output.
If you distort your data inputs, the output is very likely to become completely wrong


Why the name Machine Learning?

Arthur Lee Samuel was an American pioneer in the field of computer gaming and artificial intelligence.

He popularized the term "machine learning" in 1959 at IBM.

Marketing reason


Two Types of Learning


Supervised Learning

The dataset is a collection of labeled examples

{(xi,yi)}i=1N
xi,i=1,,N
is called feature vector
yi,i=1,,N
is called label or target

Goal: use a dataset to produce a model that takes a feature vector as input and outputs informations that allows deducing the label for this feature vector


Unsupervised Learning

The dataset id a collection of unlabeled exaples

{(xi)}i=1N

Goal: create a model that takes a feature vector as input and either trasforms it into another vector or into a value that can be used to solve a practical problem


Classification Problem

Classification predictive modeling is the task of approximating a mapping function from input variables to discrete output variables.

A discrete output variable is a category, such as a boolean variable.

Example: Spam detection


Regression Problem

Regression predictive modeling is the task of approximating a mapping function from input variables to a continuous output variable.

A continuous output variable is a real-value, such as an integer or floating point value.

Example: House price prediction


Machine Learning Map

https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html


Scikit-Learn

https://scikit-learn.org


Linear Regression

{(xi,yi)}i=1N

xi
D-dimensional feature vector of sample
i=1,,N


yiR
i=1,,N
,
xi(j)R,j=1,,D

Model:

fw,b(x)=wx+b,wis a D-dimesional vector of parameter,bR


Goal:
predict the unknown

y for a given
x

y=fw,b(x)
find the best set of parameters
(w,b)

How:
Minimize the objective function

1Ni=1N(fw,b(xi)yi)2


Logistic Regression

{(xi,yi)}i=1N

xi
D-dimensional feature vector of sample
i=1,,N


yi{0,1}
i=1,,N
,
xi(j)R,j=1,,D


Model:

fw,b(x)=11+exp((wx+b))
where
w
is a D-dimesional vector of parameter;
bR


Goal: maximize the likelihood of the training set

Lw,b=i=1Nfw,b(xi)yi(1fw,b(xi))(1yi)
When
yi=1
then
fw,b(x)

When
yi=0
then
(1fw,b(x))

No close solution, use numerical optimization via gradient descent


Basic Practice


Feature Engineering

Problem of traforming raw data into a dataset

Everything measurable can be used as a feature

Define features with high predictive power


Feature creation

  • creativity
  • domain knowledge
  • aggregation: define new features s.a. sum, product, linear combination, power, lags in time
  • binning: from numerical data to categorical data
  • encoding: from categorical data to numerical data
  • normalization: reduce to same range, avoid numerical overflow

Feature Validation

  • Missing data
    • data imputation techiques
  • Imbalance data
    • algorithm based techniques: oversampling minority class using synthetic examples (SMOTE)

Choose the right algo for your problem

  • Try all algos
  • Explainability: black-box issue
  • Nonlinearity of the data
  • Number of features and examples

Splitting Techniques

  • Training set
    built your model
  • Holdout sets:
    • Validation set
      model selection and hyperparameter tuning
    • Test set
      evaluation

The rule of thumb
70% training set, 15% validation set, 15% test set

On Big Data: 95% training set, 2.5% validation set, 2.5% test set


Model Performace Visualization

image alt


Model Performace

Overfitting

  • high variance
  • models the training set too well
  • learns detail and noise in the training data and it negatively impacts the performance of the model on new data
    • probs: the models ability of the model to generalize

When:

  • nonlinear model, flexibility when learning a target function

How to solve:

  • try simpler model
  • dimensionality reduction
  • regularization

Underfitting

  • high bias
  • can neither model the training data nor generalize to new data
  • probs: poor performance on the training data
  • how to solve:
    • increase the algo complexity
    • engineer fetures with higher predictive power

Model Performance Metrics i

Qst: How good is my model on unseen data?

Linear regression metrics examples

  • Mean squared error

    MSE=1Ni=1N(yiy^i)2

    • MSE(test)>>MSE(train)Overfitting
  • Coefficient of determination

    R2=1i=1N(yiy^i)2mean(yi)
    indication of the goodness of fit of a set of predictions to the actual values


Model Performance Metrics ii

Qst: How good is my model on unseen data?

Classification Metrics Example

  • classification accuracy:
    ratio number of correct predictions on all predictions made
  • confusion matrix
    table presents predictions on the x-axis and accuracy outcomes on the y-axis

Improve Model Performace

hyperparameter tuning
model configuration argument specified by the developer to guide the learning process for a specific dataset

  • grid search:
    define a search space as a grid of hyperparameter values and evaluate every position in the grid

cross validation

  • k-fold cross validation
  • stratified cross validation
  • rolling cross validation (time series)

Hands-on


1. exploratory data analysis

colab

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →


2. scikit-learn (Colab)

colab

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

  • linear regression

colab

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

  • logistic regression

project template


Thanks!


Tips



Acknowledgements



Thanks to all the contributors
Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →


References