Running Machine Learning Experiments

--- breaks: true --- # Running Machine Learning Experiments March 15th, 2022 [toc] ## Computing environment ### Local-based options It is recommended to use a Unix workstation.If you are a Windows user, the simplest solution comprises in setting up a Docker container or to leverage Windows Subsystem for Linux (WSL), a compatibility layer that enables you to run Linux applications from Windows. It may seem like a hassle, but it will save you a lot of time and trouble in the long run. #### Jupyter notebooks: the prefered way to develop and run machine learning experiments Jupyter notebooks are a great way to develop and run machine learning experiments. They are widely used by the data science and machine learning communities. ##### What is a _notebook_? A _notebook_ is a file generated by the [Jupyter Notebook application](https://jupyter.org). It mixes the ability to execute Python code with rich text-editing capabilities for annotating what are being doing. A notebook also enables one to break up long experiments into smaller pieces that can be executed indepedently, which makes development interactive without requiring one to rerun all previous code if something goes wrong later in an experiment. A Jupyter notebook supports a mixing of executable code, mathematical description data visualization, and raw text, allowing users to bring together data, code, and prose, to tell an interactive, computational story. Wherever the topic under studying, Jupyter notebooks can combine explanations with the interactivity of an application. Concretely, a **notebook** comprises a set of **cells**. A **cell** may contain raw text, code, images, _markdown_ texts, and $\LaTeX$ It's recommended to use Jupyter notebooks to get familiar with machine learning development. You can also run standalone Python scripts or run code from within an IDE such as [PyCharm](https://www.jetbrains.com/pycharm). #### Docker Docker is an application to orchestrate (i.e., create, start, and stop) Linux containers. A **container** is an operating system-based virtualization environment. Container-based virtualization provides an elegant solution for running applications in isolated environments. Nowadays, containers are used to help on having reproducible computing enviroments. ![Difference between container- and virtual machine-based computing environments](https://i.imgur.com/eegzyAx.png) Docker is available for the popular platforms. You can download it at [https://docs.docker.com/get-docker/](https://docs.docker.com/get-docker/). ##### Docker image Indeed, Docker containers are described through a Docker **image**. An image is a read-only template that contains a set of instructions for creating a container that can run on the Docker platform. It provides a convenient way to package up applications and preconfigured computing environments, which you can use for your own private use or share publicly with other Docker users throughout the [Docker Hub](https://hub.docker.com). Docker images are also the starting point for anyone using Docker for the first time. As images are just templates, you cannot start or run them. What you can do is use that template as a base to build a container. In this case, a container is a running image. Once you create a container, it adds a writable layer on top of the immutable image, meaning you can now modify it. The image-base on which you create a container exists separately and cannot be altered. When you run a containerized environment, you create a read-write copy of that filesystem inside the container, adding a writable layer, which enables one to change the image. ![Structure of a Docker image](https://i.imgur.com/626RYPx.png) ##### Dockerfile We use a **Dockerfile** to describe a Docker image. > A **[Dockerfile](https://docs.docker.com/engine/reference/builder/)** is **a text document that contains all the commands a user could call on the command line to assemble an image**. ```dockerfile FROM continuumio/miniconda3:4.10.3p1 RUN conda install -c conda-forge\ numpy==1.21.2\ pandas==1.4.1\ matplotlib==3.5.1\ scikit-learn==1.0.2 -y ``` We can build an image through the command ```shell=bash docker build -t mlprj:1.0.0 . ``` , where: * **`-t`** flags specifies a name and a version for the image. * **`.`** specifies the location to look for the files Once built the image, we can run it by ```shell=bash docker run -it --name mlprj --rm -v $PWD:/prj mlprj:1.0.0 ``` , where: * **`-it`** flag activates the iterative mode * **`--name`** sets up a name for the container * **`-v`** flag mounts the current directory to the `/prj` path in the container. This represents a docker volume. #### Docker-compose **Docker Compose**, on the other hand, provides a descriptive way to define how to run multiple containers. It can be download at [https://docs.docker.com/compose/install/](https://docs.docker.com/compose/install/). #### Visual Studio Code **Visual Studio Code** (or VSCode for short) is a cross-platform source-code editor with support for various programming languages, including Go, Python, Julia, C/C++, and Java. It offers support for debugging, syntax highlighting, code completion, and embedded code versioning. Besides, its features can be enhanced by installing extensions. Visual studio code is available for download at [https://code.visualstudio.com/](https://code.visualstudio.com/#alt-downloads) Once you have installed it, it's recommended to install the extensions described below. 1. [Docker for Visual Studio Code](https://marketplace.visualstudio.com/items?itemName=ms-azuretools.vscode-docker) 2. [Remote SSH](https://marketplace.visualstudio.com/items?itemName=ms-vscode-remote.remote-ssh) 3. [Remote Containers](https://marketplace.visualstudio.com/items?itemName=ms-vscode-remote.remote-containers) 4. [Remote Development](https://marketplace.visualstudio.com/items?itemName=ms-vscode-remote.vscode-remote-extensionpack) 5. [Python](https://marketplace.visualstudio.com/items?itemName=ms-python.python) 6. [Anaconda Extension Pack](https://marketplace.visualstudio.com/items?itemName=ms-python.anaconda-extension-pack) 7. [Jupyter](https://marketplace.visualstudio.com/items?itemName=ms-toolsai.jupyter) #### Conda ![](https://i.imgur.com/IE0IDWm.png) When running machine learning experiments, we usually need to run them on a specific Python's version or to use specific versions of some libraries. This is where **virtual environments** become useful. Virtual environments isolate these dependencies in separate _sandboxes_ so you can switch between versions easily and get them running. They are are analogous to Docker containers. <center> <img src="https://i.imgur.com/JRBqcr9.png"/> </center> There are multiple ways to create a Python environment, including using _virtualenv_, [conda](https://docs.conda.io/en/latest/), and Docker container. ##### What is Conda? > Conda is an open source package management system and environment management system that runs on Windows, macOS and Linux. > It can quickly install, run, and update packages and their dependencies. It can also easily create, save, load and switches between **environments** on your computer. > Conda as a package manager helps you find and install packages. If you need a package that requires a different version of Python, you do not need to switch to a different environment manager, because conda is also an environment manager. With just a few commands, you can set up a totally separate environment to run that different version of Python, while continuing to run your usual version of Python in your normal environment. ##### Creating a conda environment To quickly create an environment using conda, you can execute the command: ```shell=bash conda create --name mlprj -c conda-forge\ python=3.9\ numpy==1.21.2\ pandas==1.4.1\ matplotlib==3.5.1\ scikit-learn==1.0.2 -y ``` ,where: * **--name** or **-n** sets the name for the new environment * -c conda-forge specifies the respositories to look for the libraries * python=3.9 specifies which Python version to set up * `numpy==1.21.2`, `pandas==1.4.1`, `matplotlib==3.5.1`, and `scikit-learn==1.0.2` are the Python libraries to install with their corresponding versions * *-y* a flag tells the command line to answer **yes** to all prompts. It is also possible to create a **YAML** (YAML Ain’t Markup Language) file describing the environment. For the previous command, the YAML file look something like this: ```yaml name: mlprj channels: - defaults - conda-forge dependencies: - python==3.9 - numpy==1.21.2 - pandas==1.4.1 - matplotlib==3.5.1 - scikit-learn==1.0.2 ``` The file is usually named _environment.yml_. We create an environment through the command ```bash conda env create -f environment.yml ``` ##### Listing exiting conda environments ```shell=bash conda env list ``` ##### Activating an existing conda environment We can activate a conda environment through the command ```shell=bash conda activate mlprj ``` ##### Deactivating an active conda environment We can **deactivate** an **active** environment through ```shell=bash conda deactivate ``` ##### Updating a conda environment ```bash conda env update -f environment.yml ``` ##### Exporting an existing conda environment When a conda environment already exists, we can generate an **YAML** file to duplicate it or document it through the command: ```bash conda env export > environment.yml ``` ##### Removing a conda environment ```shell=bash conda remove --name mlprj --all -y ``` ### Cloud-based environment #### Using Google Colaboratory [Google Colaboratory](https://colab.research.google.com/) (or [Colab](https://colab.research.google.com/) for short) is a **Jupyter notebook service** that runs entirely on the cloud. It provides free GPU and TPU runtimes, so you don't have to configure your own computing environment. #### Google Colab Setup #### Getting starting Access [https://colab.research.google.com](https://colab.research.google.com) and then click on the **New Notebook** button. You will see the standard Notebook interface shown in below. ![Default Google Colab Interface](https://i.imgur.com/PUzFWaJ.png) You notice that there are two buttons in the toolbar: **+ Code** and **+ Text**. They are used for creating executable Python code cells and text cells. After entering the code in a **code cell**, you can execute it by **pressing Shift-Enter** or **Command+Enter** on macOS #### Mouting your Google Drive You can mount your Google Drive on Colab to access your files stored in Google Drive. For this, you must run the following code. ```python=3.9 from google.colab import drive drive.mount('/content/drive') ``` When running this code, the sytem will ask you to confirm the operation as depicted in the below. ![](https://i.imgur.com/U3Jk2tB.png) Follow the link, sign in to your Google account, and accept the requested permissions. #### Executing shell command You can use any command line instructions on Colab by prefixing the command by `%`. Example: ```python=3.9 %ls "/content/drive/MyDrive/" ``` like `ls` and `cd` to find the folder that a notebook is stored. #### Installing new packages The default Colab environment already comes with different Python libraries installed such as TensorFlow, Keras, Numpy, Pandas, Matplotlib, scikit-learn, among others. You can check the listing of installed packages through the command ```python !pip list -v ``` Cells starting with a ! indicates that it is a shell command rather than Python code. If you ever need to install a package, you can do it with pip ```python !pip install some_package ``` To use GPU or TPU runtime with Colab, select **Runtime > Change Runtime Type** in the menu and select GPU for the Hardware Accelerator. ![Changing runtime type in Google Colab](https://i.imgur.com/TbnjjRt.png) ### Locally running a Jupyter Notebook ## Scientific Computing with NumPy, Pandas, and Matplotlib You can use the following resources: [Introduction to NumPy and Matplotlib](https://sebastianraschka.com/blog/2020/numpy-intro.html) by Sebastian Raschka and [10 minutes to pandas](https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html) to learn more about these libraries. ### NumPy **NumPy** is a library that provides: * an extension package to Python to work with vectors and matrices * a capacity to handle efficiently large multi-dimensional arrays with performance closer to hardware * a memory-efficient container that can handle fast numerical operations * a convenient abstraction (i.e., data structure and functions) for scientific computation * a data structure and a set of functions for handling numeric vectors and matrices in Python The table below describes the main NumPy functions. | Function | Description | | -------- | ----------- | | np.ones | Creates an array with the given shape filled with ones | |np.zeros | Creates an array with the given shape filled with zeros| |np.eye| Creates an identity matrix of the given size| |np.diag| Creates a diagonal matrix| Reduction functions |Function | Description| | -------- | ----------- | |sum | computes the sum of the elements of a vector| |min| computes the minimal value of an array| |max| computes the maximal value of an array| |argmin| returns the index of the element with the lowest value| |argmax| returns the index of the element with the maximum value| |mean| computes the average of the elements of a given array| |median|computes the median of the elements of a given array | ### Plotting with Matplotlib * Matplotlib is a 2D plotting library for Python * It provides quick way to visualize data from Python * It comes with a set plots (e.g., scatter, histogram, and lines) -- See matplotlib.org/gallery.html * We can import its functions through the code ```python=3.9 import matplotlib.pyplot as plt import numpy as np x = np.linspace(0, 3, 15) y = np.linspace(0, 9, 15) plt.plot(x, y) plt.show() ``` * We can use Jupyter's notebook magic `%matplotlib inline` to display the plots in the notebook and to enable interactive plots * We don't need to call the function `plt.show()` in the interactive mode. ### Pandas - Pandas is a library that provides: - An efficient **DataFrame** data structure for data manipulation - Functions for **reading** and **writing** data in different formats: - CSV, text, and Microsoft Excel - SQL databases - HDF5 format - Functions to **aggregate** and to **transform** data enabling the split, apply, and combine operations on data sets - Flexible **reshaping** and **pivoting** of data sets - Flexible functions to handle **missing** data - It is in use in a variety of domains, including for instance, Finance, Neuroscience, Economics, Statistics, among others. The main functions to inspect the data are: |Function | Description| | -------- | -----------| |df.head(n)|Shows the first $n$ rows of the DataFrame| |df.tail(n)|Shows the last $n$ rows of the DataFrame| |df.shape}|Returns the dimensions of the DataFrame| |df.info()|Returns index, data type, and memory information of the DataFrame| |df.describe()|Shows summary statistics for numerical columns |df.apply(pd.Series.value\_counts)|Shows unique values and counts of all the columns|| The main ones for data cleaning are: |Function} | Description| | -------- | -----------| |pd.isnull()|Checks for null-values| |pd.notnull()|Checks for non-null values| |df.dropna()|Drops all rows that have null-values| |df.dropna(axis=1)|Drops all columns that have null-values| |df.dropna(axis=1,thresh=n)|Drops all rows that have less than $n$ non null-values| |df.fillna(x)|Replaces all null-values by $x$| |s.fillna(s.mean())|Replaces all null-values by the mean| |s.astype(float)|Converts the data type of the series to float| |s.replace(x,y)|Replaces all values equal to $x$ by $y$ | ## Regression ### Model Assessment * Residual sum of squares (RSS) $$ \begin{equation*} RSS = \sum\limits_{i=1}^{n}{{(y_i - f(x^i))}^2} \end{equation*} $$ * Root-mean squared error (RMSE) $$ \begin{equation*} RMSE = \sqrt{\frac{\sum\limits_{i=1}^{n}{{(y_i - f(x^i))}^2}}{n}} \end{equation*} $$ * Relative squared error (RSE) $$ \begin{equation*} RSE = \frac{\sum\limits_{i=1}^{n}{{(y_i - f(x^i))}^2}}{\sum\limits_{i=1}^{n}{(y^i - \bar{y})}^2} \end{equation*} $$ * Coefficient of determination $$ \begin{equation*} R^2 = 1 - RSE \end{equation*} $$ ### Learning curves * When building a data model, our goal is to design one that better fits the data * How to assess that we are building a good enough model? * In other words, what can we do to check that the model is not **overfitting** or **underfitting** the data? * A model is *overfitting} when it performs well on training data, but generalizes poorly on test data * A model is **underfitting** when it performs poorly on both training and test sets * We can use **learning curves** to visualize the performance of a model on training and test sets as a function of the training size: - To generate them, we have to train the model on different training sets size ### Assessing performance, error types, and bias/variance trade-off ### Overfitting, regularized regression, ridge regression, lasso ![Learning curves](https://i.imgur.com/MkOpOZm.png) * When a model is **underfitting** the training data, adding more training example is **useless**. We must use a more complex model or come up with better features * On the other hand, when a model is **overfitting**, we can feed it more training examples until the validation error reaches the training error ### Bias, variance, and trade-off * A model generalization error can be expressed as the sum of its bias, variance, and irreducible error * **Bias** comprises the wrong hypotheses, such as assuming that the data follow a linear law. A high-biased model is most likely to underfit the training data * **Variance** comprises excessive sensitivity for small variations in the training data. A model with high-degree of freedom usually has high-variance, and thus is most likely to overfit the training data. * An **irreducible error** comprises the noises of the data. One way to reduce this part of the generalization error is to clean up the data. * Trade-off: - **increasing** a model's complexity commonly increases its variance and reduces its bias - **reducing** a model's complexity increases its bias and reduces its variance ## References 1. [Python Virtual Environments: A Primer](https://realpython.com/python-virtual-environments-a-primer/) 1. [Conda: Myths and Misconceptions](https://jakevdp.github.io/blog/2016/08/25/conda-myths-and-misconceptions/) 1. [Getting started with Python environments (using Conda)](https://towardsdatascience.com/getting-started-with-python-environments-using-conda-32e9f2779307) 1. [Benefits of conda vs. pip](https://www.scivision.dev/benefits-of-conda-vs-pip/) 1. [Dockerfile reference](https://docs.docker.com/engine/reference/builder/) 1. [A Beginner’s Guide to Understanding and Building Docker Images](https://jfrog.com/knowledge-base/a-beginners-guide-to-understanding-and-building-docker-images/) 1. [Docker Image vs Container: The Major Differences](https://phoenixnap.com/kb/docker-image-vs-container) 1. [Pyplot tutorial](https://matplotlib.org/3.5.1/tutorials/introductory/pyplot.html#sphx-glr-tutorials-introductory-pyplot-py) ###### tags: `NumPy` `ESME` `Machine Learning` `Conda` `Docker`