Stochastic methods

# Stochastic methods in training  slide: https://hackmd.io/@ccornwell/stochastic --- <h3>The idea of a stochastic method</h3> - <font size=+2>Have training data $S$ ("$\bf x$ points" and their labels, $y$), and procedure to train a ML model $h_S$ from $S$, in iterative way. Examples </font> - <font size=+2>logistic regression, $h_S({\bf x}) = \frac{1}{1+e^{-({\bf w}\cdot{\bf x}+b)}}$; procedure learns ${\bf w}$ and $b$ from $S$.</font> - <font size=+2>decision tree, $h_S({\bf x})=$ label for leaf of the tree corresp. to region ${\bf x}$ falls into; learn splits from $S$.</font> - <font size=+2 style="color:#181818;">*(Mini-batch) Stochastic method*: on $t^{th}$ iteration, use only a randomly selected subset $S_t\subset S$ for that step.</font> ---- <h3>The idea of a stochastic method</h3> - <font size=+2>Have training data $S$ ("$\bf x$ points" and their labels, $y$), and procedure to train a ML model $h_S$ from $S$, in iterative way. Examples </font> - <font size=+2>logistic regression, $h_S({\bf x}) = \frac{1}{1+e^{-({\bf w}\cdot{\bf x}+b)}}$; procedure learns ${\bf w}$ and $b$ from $S$.</font> - <font size=+2>decision tree, $h_S({\bf x})=$ label for leaf of the tree corresp. to region ${\bf x}$ falls into; learn splits from $S$.</font> - <font size=+2>*(Mini-batch) Stochastic method*: on $t^{th}$ iteration, use only a randomly selected subset $S_t\subset S$ for that step.</font> --- <h3>Stochastic method, example</h3> - <font size=+2>Gradient descent on log-loss function: recall, gradient is an average of contributions from each instance ${\bf x_i}$. </font> - <font size=+2 style="color:#181818;">Instead, use average of contributions over $S_t$: for the update, $\displaystyle {\bf v_t} = \frac{1}{|S_t|}\sum_{{\bf x_i}\in S_t}x_{i,j}(\sigma({\bf x_i}) - \overset{\circ}{y_i})$, subtract $\eta$ times ${\bf v_t}$ from current weight vector.</font> - <font size=+2 style="color:#181818;">Draw $S_t$ from $S$, via random process, such that $\mathbb E[{\bf v}_t]$ equals the gradient of the log-loss function (over all of $S$).</font> ---- <h3>Stochastic method, example</h3> - <font size=+2>Gradient descent on log-loss function: recall, gradient is an average of contributions from each instance ${\bf x_i}$. </font> - <font size=+2>Instead, use average of contributions over $S_t$: for the update, $\displaystyle {\bf v_t} = \frac{1}{|S_t|}\sum_{{\bf x_i}\in S_t}x_{i,j}(\sigma({\bf x_i}) - \overset{\circ}{y_i})$, subtract $\eta$ times ${\bf v_t}$ from current weight vector.</font> - <font size=+2 style="color:#181818;">Draw $S_t$ from $S$, via random process, such that $\mathbb E[{\bf v}_t]$ equals the gradient of the log-loss function (over all of $S$).</font> ---- <h3>Stochastic method, example</h3> - <font size=+2>Gradient descent on log-loss function: recall, gradient is an average of contributions from each instance ${\bf x_i}$. </font> - <font size=+2>Instead, use average of contributions over $S_t$: for the update, $\displaystyle {\bf v_t} = \frac{1}{|S_t|}\sum_{{\bf x_i}\in S_t}x_{i,j}(\sigma({\bf x_i}) - \overset{\circ}{y_i})$, subtract $\eta$ times ${\bf v_t}$ from current weight vector.</font> - <font size=+2>Draw $S_t$ from $S$, via random process, such that $\mathbb E[{\bf v}_t]$ equals the gradient of the log-loss function (over all of $S$).</font> --- <h3>Purpose of stochastic methods</h3> - <font size=+2>**Computational efficiency**: some update techniques computationally expensive on all $S$, particularly if $|S|$ is large</font> - <font size=+2 style="color:#181818;">**Non-convex optimization**: The "noisy" update ${\bf v_t}$ is effective at avoiding saddle critical points, and even non-optimal local minima of the loss function.</font> - <font size=+2 style="color:#181818;">**Generalization error**: The use of the stochastic method helps to lower error on unseen data ("generalization error"), even if error on training data is slightly higher.</font> ---- <h3>Purpose of stochastic methods</h3> - <font size=+2>**Computational efficiency**: some update techniques computationally expensive on all $S$, particularly if $|S|$ is large</font> - <font size=+2>**Non-convex optimization**: The "noisy" update ${\bf v_t}$ is effective at avoiding saddle critical points, and even non-optimal local minima of the loss function.</font> - <font size=+2 style="color:#181818;">**Generalization error**: The use of the stochastic method helps to lower error on unseen data ("generalization error"), even if error on training data is slightly higher.</font> ---- <h3>Purpose of stochastic methods</h3> - <font size=+2>**Computational efficiency**: some update techniques computationally expensive on all $S$, particularly if $|S|$ is large</font> - <font size=+2>**Non-convex optimization**: The "noisy" update ${\bf v_t}$ is effective at avoiding saddle critical points, and even non-optimal local minima of the loss function.</font> - <font size=+2>**Generalization error**: The use of the stochastic method helps to lower error on unseen data ("generalization error"), even if error on training data is slightly higher.</font> --- <h3>Examples of using a stochastic method</h3> - <font size=+2>Training of neural nets, uses stochastic gradient descent on loss function.</font> - <font size=+2>For mini-batch size $k$, random permutation "shuffles" training set $S$; then mini-batch 1: points $0,...,k-1$ in shuffled order; mini-batch 2: points $k,...,2k-1$; etc.</font> - <font size=+2 style="color:#181818;">Once through all of $S$ (one "epoch"), shuffle $S$ again, start over.</font> - <font size=+2 style="color:#181818;">Training random forests (:thought_balloon: an average of decision trees for prediction).</font> - <font size=+2 style="color:#181818;">Each decision tree in forest learns on sample from $S$. Classically, a *bootstrap* sample (same size as data, but drawn with replacement);</font> - <font size=+2 style="color:#181818;">Yet, can use samples of much smaller size for each decision tree, in line with the stochastic method. With sklearn's `RandomForestClassifier`, set option `max_samples` to the $|S_t|$ you want.</font> ---- <h3>Examples of using a stochastic method</h3> - <font size=+2>Training of neural nets, uses stochastic gradient descent on loss function.</font> - <font size=+2>For mini-batch size $k$, random permutation "shuffles" training set $S$; then mini-batch 1: points $0,...,k-1$ in shuffled order; mini-batch 2: points $k,...,2k-1$; etc.</font> - <font size=+2>Once through all of $S$ (one "epoch"), shuffle $S$ again, start over.</font> - <font size=+2 style="color:#181818;">Training random forests (:thought_balloon: an average of decision trees for prediction).</font> - <font size=+2 style="color:#181818;">Each decision tree in forest learns on sample from $S$. Classically, a *bootstrap* sample (same size as data, but drawn with replacement);</font> - <font size=+2 style="color:#181818;">Yet, can use samples of much smaller size for each decision tree, in line with the stochastic method. With sklearn's `RandomForestClassifier`, set option `max_samples` to the $|S_t|$ you want.</font> ---- <h3>Examples of using a stochastic method</h3> - <font size=+2>Training of neural nets, uses stochastic gradient descent on loss function.</font> - <font size=+2>For mini-batch size $k$, random permutation "shuffles" training set $S$; then mini-batch 1: points $0,...,k-1$ in shuffled order; mini-batch 2: points $k,...,2k-1$; etc.</font> - <font size=+2>Once through all of $S$ (one "epoch"), shuffle $S$ again, start over.</font> - <font size=+2>Training random forests (:thought_balloon: an average of decision trees for prediction).</font> - <font size=+2 style="color:#181818;">Each decision tree in forest learns on sample from $S$. Classically, a *bootstrap* sample (same size as data, but drawn with replacement);</font> - <font size=+2 style="color:#181818;">Yet, can use samples of much smaller size for each decision tree, in line with the stochastic method. With sklearn's `RandomForestClassifier`, set option `max_samples` to the $|S_t|$ you want.</font> ---- <h3>Examples of using a stochastic method</h3> - <font size=+2>Training of neural nets, uses stochastic gradient descent on loss function.</font> - <font size=+2>For mini-batch size $k$, random permutation "shuffles" training set $S$; then mini-batch 1: points $0,...,k-1$ in shuffled order; mini-batch 2: points $k,...,2k-1$; etc.</font> - <font size=+2>Once through all of $S$ (one "epoch"), shuffle $S$ again, start over.</font> - <font size=+2>Training random forests (:thought_balloon: an average of decision trees for prediction).</font> - <font size=+2>Each decision tree in forest learns on sample from $S$. Classically, a *bootstrap* sample (same size as data, but drawn with replacement);</font> - <font size=+2>Yet, can use samples of much smaller size for each decision tree, in line with the stochastic method. With sklearn's `RandomForestClassifier`, set option `max_samples` to the $|S_t|$ you want.</font> --- <h3>Mini-batch versus single-instance stochastic</h3> - <font size=+2>By using a mini-batch of size $k$, you reduce the variance of update by factor of $1/k$ (can be shown with induction). Too much variance of ${\bf v_t}$ can make it hard to train.</font> - <font size=+2 style="color:#181818;">So, mini-batch is a compromise. Get benefits of stochastic method, but don't loose ability to easily train the model.</font> ---- <h3>Mini-batch versus single-instance stochastic</h3> - <font size=+2>By using a mini-batch of size $k$, you reduce the variance of update by factor of $1/k$ (can be shown with induction). Too much variance of ${\bf v_t}$ can make it hard to train.</font> - <font size=+2>So, mini-batch is a compromise. Get benefits of stochastic method, but don't loose ability to easily train the model.</font> --- <h3>Mini-batch versus single-instance stochastic</h3> <font size=+2>Discussion</font> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br />