WEEK 5 - HackMD

--- tags: FTMLE-Philipines-2020 --- # WEEK 5 ## Monday 15.6.2020 **Module - Test solution** See in gmail. ### 1. What is Machine Learning? (What is Supervised machine learning model?) <div style="text-align: justify"> Here we give a mathematical definition for the Machine learning model. The first type we want to mention is the **Supervised learning**. In this type, we also have many subtypes of models: Given $n$ pairs of training data $T: = (X,Y)_{j=1}^n$ where the training inputs are denoted by $(X_j)$ and the outputs are $(Y_j)$. The goal of a machine learning model is to derive/construct/select a function $f: \mathscr{X} \to \mathscr{Y}$, which is called the classifier in some cases, from a set of functions $\{f:f:\mathscr{X} \to \mathscr{Y}\}$. The spaces $\mathscr{X}, \mathscr{Y}$ are usually in $\mathbb{R}^n$. Note that we have two important parameters: - $m$ is the sample's size - $n$ is the dimension of data. Roughly speaking, from the seen dataset $(X,Y)$, we want to generate a function $f$ such that for an unseen data $\bar{X}, \bar{Y}$, we have $$ f(\bar{X}) \approx \bar{Y} $$ with minimal error, that means, with highest possible accuracy. Depending on the output space $\mathscr{Y}$, we have the following subtypes of models: - **Binary classification**: when the output space is discrete space with cardinality $2$. The popular output for this type of model is answer for the "Yes/No" question, or $0,1$ if we want to use numerical values. If the cardinality is greater than $2$, we call it **multivariate classification**. - If $\mathscr{Y} = \mathbb{R}^d$, we call this **multivariate regression**. If $d= 1$ we call it simply the **regression problem**. Note that these type of regressions problems are similar to those in statistical regression. However, the approach is different! The definitions for other type of machine learning models are given in the Note [...]. An example on a simple regression model in python with the help of ```sklearn.linear_model```. ```python from sklearn.linear_model import LinearRegression model = LinearRegression() model.fit(df[['x']], df['y']) model.predict([[9]]) ``` </div> ### 2. Hypothesis classes and the No-Free-Lunch theorem <div style="text-align: justify"> 1. To be more precisely, the type of machine learning we consider is **statistical machine learning**. What does it mean? - That means, we always have to take into account that our training/testing data space are equiped some type of probability distribution $\mathbb{P}(\mathscr{X}\times \mathscr{Y})$. What's next? We have to accept the fact that our data is generated from that distribution. Everything is uncertain (The uncertainty!). Our training data can be affected by statistical fluctuation. A small change in the input data could lead to hugh changes in the output, which could destroy the robustness of the model. 2. The **[No Free Lunch Theorem](https://en.wikipedia.org/wiki/No_free_lunch_theorem)** states that every successful machine learning algorithm has to make assumptions about the dataset/distribution $P$. This also means that there is no perfect algorithm for all problems. 3. A loss function, let's call it $h$, has to be considered to see how good/bad our model is! Our goal now is also try to minimized the loss function as much as possible, which means the accuracy is as large as possible. The loss functino depends on the type of the machine learning models and their goals. </div> #### The curse of dimensionality <div style="text-align: justify"> Here we demonstrate an example on the **curse of dimensionality**. To build a machine learning model that fits our goal, we consider features of the training data that support our model. The more features, the higher the dimension of the data space! The advantage of this is: the more features we have, the more information we gain to classify our output (we have more information to discriminate two classes, for instance). However, high dimensions make it much more difficult to learn! Suppose that we consider $d$ features of a problem and each feature just takes value in the interval $[0,1]$, so that our input data space is in $[0,1]^d$ (a $d$-dimensional cube). </div> #### Overfitting and underfitting Take the example on linear regression when we increase the order! <div style="text-align: justify"> </div> ## Tuesday 16.6.2020 **Mathematics for Machine learning** Some helper functions that draw the vectors, determinants,... ```python # Helper functions def draw_grid(x_lim=np.array([-4, 4]), y_lim=np.array([-4, 4])): """Draw an empty grid""" ax = plt.gca() # Draw ticks and grid for i in range(int(x_lim.min()), int(x_lim.max())): ax.axvline(i, linestyle='--', color='#ecf0f1', zorder=0) ax.plot([i, i], [0.05, -0.05], color='#2c3e50') for i in range(int(y_lim.min()), int(y_lim.max())): ax.axhline(i, linestyle='--', color='#ecf0f1', zorder=0) ax.plot([0.05, -0.05], [i, i], color='#2c3e50') # x and y axis ax.axhline(0, color='#2c3e50', zorder=0) ax.axvline(0, color='#2c3e50', zorder=0) ax.scatter([0], [0], color='#c0392b', zorder=0) ax.grid(False) ax.set_xlim(x_lim) ax.set_ylim(y_lim) ``` ```python def draw_vectors(vectors, origin='origin', cmap=None, labels=None): if cmap: colors = cmap else: colors = ['#3498db', '#e67e22', '#f1c40f', '#2ecc71', '#1abc9c'] if origin == 'origin': x_0 = np.zeros(len(vectors)) y_0 = np.zeros(len(vectors)) else: x_0, y_0 = origin[:, 0], origin[:, 1] ax = plt.gca() if labels: for i in range(len(vectors)): ax.text(vectors[i][0]+0.1, vectors[i][1]+0.2, labels[i], {'color': 'black', 'fontsize': 14, 'ha': 'center', 'va': 'center', 'bbox': dict(boxstyle="round", fc="white", alpha=0)}) ax.quiver(x_0, y_0, vectors[:, 0], vectors[:, 1], angles='xy', scale_units='xy', scale=1, color=colors) ``` ```python def projection(x, y): color = '#95a5a6' ax = plt.gca() dot_product = np.dot(x, y) y_norm = y / np.dot(y, y) x_projection = y_norm * dot_product ax.plot((x[0], x_projection[0]), (x[1], x_projection[1]), linestyle='--', linewidth=3, c=color) draw_vectors(np.array([x_projection]), cmap=['#bdc3c7']) ``` ```python def determinant_area(i_hat, j_hat): ax = plt.gca() sum_vector = i_hat + j_hat polygon = plt.Polygon([(0, 0), i_hat, sum_vector, j_hat], fill=True, color='#f39c12', alpha=0.3) print('Determinant Area:', np.abs(i_hat[0]*j_hat[1] - i_hat[1]*j_hat[0])) ax.add_line(polygon) ``` ### 1. Linear Algerba for Machine Learning <div style="text-align: justify"> - Matrix computation - The span space - Linear dependence and Linear independence. - Matrix-matrix, matrix-vector multiplication. - Linear Transformations. - Clockwise rotation by $\theta$. - Linear systems of equations. - The determinant of a matrix. - Eigenvalues - Eigenvectors. </div> ### 2. Calculus for Machine Learning <div style="text-align: justify"> - Definition of derivative. - Derivative of sigma function $\sigma$. Saturation problem! </div> #### GRADIENT DESCENT ALGORITHM <div style="text-align: justify"> **Mathematical explanation** 1. What is gradient descent algorithm? Gradient descent algorithm is a type of finding minimizer algorithm. In other words, it is an optimization algorithm. Given a function $f: \mathbb{R}^d \to \mathbb{R}$, we want to find a point $x\in \mathbb{R}^d$ that is a local minimum of $f(x)$. The output heavily depends on the initial starting point; the guessing point. Therefore, this type of algorithm is not the most robust one! 2. Why do we choose the gradient direction? In order to clarify the role of **gradient** vector in this algorithm, we need to discriminate the two concepts of multidimensional differentiaion: - **The directional derivative**: The directional derivative tells us how our function $f$ change in the direction of $\vec{v}$; the rate of change of $f$ when we move $x$ along the directional vector $\vec{v}$. Note that Directional derivative is a scalar value! More precisely, it is defined as: $$ \nabla_v f(x) = \lim_{h \to 0} \frac{f(x+hv)-f(x)}{h} $$ - **The gradient** is different. First of all, it is a **vector**. This vector tells us the steepest ascent. That means, the direction that our function increase the most! Here is the explanation: As defined above, directional derivative gives us the rate of change of the function $f$ along direction $v$, but also by definition we have $$ \nabla_v f(x) = \nabla f \cdot v $$ where $\cdot$ denotes the dot product between two vectors. Therefore, $$ \nabla_v f(x) = \nabla f \cdot v = \|\nabla f\| \|v\| \cos(\theta) $$ Since one can normalize and choose the directional vector $v$ as a unit vector, we then have $$ \nabla_v f(x) = |\nabla f| \cos(\theta) $$ Then the directional derivative $\nabla_v f$ varies from $-1 \times |\nabla f|$ to $1 \times |\nabla f|$, which then obviously implies that the highest rate of change (maximum value of $\nabla_v f$) is when $\cos(\theta) = 1$, which means $\vec{v}$ and $\nabla f$ have the same direction. Hence, we conclude that $\vec{v} = C \nabla f$ is the steepest ascent direction. Therefore, if $v = C\nabla f$ is the direction of steepest ascent, what is then the direction of steepest descent? It is the opposite! In order to obtain the direction of descent, we choose a vector $p$ such that $$ \vec{p} \cdot \nabla f < 0 $$ In the gradient descent algorithm, the direction $p$ is chosen to be $-\alpha\nabla f$ with some step length $\alpha > 0$. To avoid confusing at this point, think of $p$ as the direction we want to move $x$, not the rate of change of the function $f$. Another point to take away is the difference between **directional derivative** and **gradient vector**. In 1-dimensional space, there are only two directions, $-1$ or $+1$. Let's consider the parabol $f(x) = x^2$ in the interval $[-1,1]$. For $x\in [-1,0]$, $f(x)$ is decreasing and the derivative is negative, which means it points to the left side, which is also the ascent direction. The gradient vector and the directional derivative in 1-dimensional space are the same, so it might be not so clearly to see the intuition behind the gradient descent algorithm. </div> ## Wednesday 17.6.2020 <div style="text-align: justify"> **Solutions to the "Total improvements after 1 year"** ```python # Your code here def get_daily_improvements(): '''Return 365 daily improvement''' improvements = stats.norm.rvs(loc=0.5, scale=0.15, size=365) improvements[improvements < 0.1] = 0.1 improvements[improvements > 1] = 1 return improvements def get_total_improvement(improvements): '''Return the total improvement after 1 year''' total_improvement = (1 + improvements/100).prod() * 100 return total_improvement n_samples = 10000 results = np.array([]) for i in range(n_samples): daily_improvements = get_daily_improvements() results = np.append(results, get_total_improvement(daily_improvements)) # Plotting print(f'Mean: {results.mean()}') print(f'Std: {results.std()}') print(f'Min: {results.min()}') print(f'Max: {results.max()}') sns.distplot(results) ``` **Gradient descent algorithm** ```python # Your code here def f(x): return 2*x**2 - 15*np.sin(x) def df(x): epsilon = 0.000001 return (f(x+epsilon) - f(x)) / epsilon # Initialize x def initialize(): return np.random.random()*20 - 10 # Update x in one step def update(x_curr, learning_rate): x_next = x_curr - learning_rate*df(x_curr) diff = f(x_next) - f(x_curr) return x_next, f(x_next), abs(diff) # Train function def train(iterations, learning_rate): x_next = initialize() x_t = [x_next] y_t = [f(x_next)] n_steps = iterations print('x_init = ', x_next) for t in range(iterations): x_next, y_next, diff = update(x_next, learning_rate) x_t.append(x_next) y_t.append(y_next) if diff < 10e-6: n_steps = t break return x_t, y_t, n_steps x_t, y_t, n_steps = train(10000, 10e-4) print('Number of steps:', n_steps) print('Minima:', x_t[-1], y_t[-1]) x_plot = np.linspace(-10, 10, 500) y_plot = f(x_plot) plt.figure(figsize=(10, 6)) plt.plot(x_plot, y_plot) plt.scatter(x_t, y_t, c='r') plt.xlim(-10, 10) plt.show() ``` - Try to change the learning rate $\alpha$ to see how it speeds up the process or break the program. The choice of learning rate $\alpha$ may depend on the specific problem! - The result depends on the starting position. It could be the local or global minimum point. </div> ### Linear Regression <div style="text-align: justify"> We convert the information to $0$ or $1$ so that all the values are in numerical values. ```python tips = sns.load_dataset('tips') D = tips.replace({'sex':{'Female':1, 'Male':0}, 'smoker':{'Yes':1, 'No':0}, 'time':{'Dinner':1, 'Lunch':0}}) D = pd.get_dummies(D, columns=['day'], drop_first=True) ``` Since we only have 4 days to consider, we use ```get_dummies``` and drop "Thursday". Because when the values of "Friday", "Saturday", "Sunday" are $0$, we can imply that the day is "Thursday". - early stopping error. </div> ## Thursday 18.6.2020 <div style="text-align: justify"> - Hyperplane. - Accuracy score is not enough to judge a model is good or bad. - Understand the problem and choose how to evaluate your model! Choose type 1 or type 2 error and the "precision" or "recall". </div> ## Friday 19.06.2020 Sentiment analysis of twits. - Bags of words (one-hot encoding). - one word is called a "token". - Split the sentence into list of words is called tokenizing. - Sort the dictionary first and then encode it. - Drawbacks: The order of the words in sentence !!! - The dataset is labeled with "positive" and "negative", based on that can we build a model to classify the new input sentence. - The search and sort algorithm would work faster if our input is sorted beforehand. - **stopwords**: this, that, is, are, ...These words appear a lot but they do not give much information. The solution is that we reduce the weight of that word if it occures too many times in the data. - How relevant are words? Term frequency-inverse document frequency We could use these raw term frequencies to score the words in our algorithm. There is a problem though: If a word is widespread in _all_ documents, then it probably doesn't carry a lot of information. To tackle this problem, we can use **term frequency-inverse document frequency**, which will reduce the score, the more frequent the word is across all twits. It is calculated like this: $$ tf-idf(t,d) = tf(t,d) ~ idf(t,d) $$ _tf(t,d)_ is the raw term frequency descrived above. _idf(t,d)_ is the inverse document frequency, than can be calculated as follows: $$ \log \frac{n_d}{1+df\left(d,t\right)} $$ Where `n` is the total number of documents, and _df(t,d)_ is the number of documents where the term `t` appears. The `1` addition in the denominator is just to avoid zero terms for terms that appear in all documents, will not be entirely ignored. And the `log` ensures that low-frequency term doesn't get too much weight. Fortunately for us, `scikit-learn` does all those calculations for us: