Learning with large datasets

This is my personal notes taken for the course Machine learning by Standford. Feel free to check the assignments.
Also, if you want to read my other notes, feel free to check them at my blog.

I) Introduction

The popularity of machine learning techniques have increased in the recent past. One of the reasons leading to this trend is the exponential growth in data available to learn from. Large datasets coupled with a high variance model has the potential to perform well. But as the size of datasets increase, it poses various problems in terms of space and time complexities of the algorithms.

Datasets can often approach such sizes as

m = 100, 000, 000

. In this case, our gradient descent (also called batch gradient descent) step will have to make a summation over all one hundred million examples. To avoid this, we will see some approaches for doing so.

II) Stochastic Gradient Descent

Stochastic gradient descent is an alternative to classic gradient descent (batch gradient descent) and is more efficient and scalable to large data sets.

It uses

1

example in each iteration. (batch gradient descent uses

m

examples in each iteration).

Stochastic gradient descent is written out in a different but similar way:

c o s t (θ, (x^{(i)}, y^{(i)})) = \frac{1}{2} (h_{θ} (x^{(i)}) - y^{(i)})^{2}

The only difference in the above cost function is the elimination of the m constant within

\frac{1}{2}

J_{t r a i n} (θ) = \frac{1}{m} \sum_{i = 1}^{m} c o s t (θ, (x^{(i)}, y^{(i)}))

The algorithms is as follows:

Randomly 'shuffle' the dataset.
For i = 1 …m

$Θ_{j} := Θ_{j} - α (h_{Θ} (x^{(i)}) - y^{(i)}) \cdot x_{j}^{(i)}$

This algorithm will only try to fit one training example at a time. This way we can make progress in gradient descent without having to scan all

m

training examples first.

Stochastic gradient descent will be unlikely to converge at the global minimum and will instead wander around it randomly, but usually yields a result that is close enough. Stochastic gradient descent will usually take

1 \sim 10

passes through your dataset to get near the global minimum.

To make sure that stochastic gradient descent is converging and not diverging, we can plot the average cost of the hypothesis applied to every 1000 in function of the number of iterations.

While learning rate

α

is kept constant in most implementations of stochastic gradient descent, it is observed in practice that it helps to taper off the value of learning rate as the iteration proceeds. It can be done as follows,

α = \frac{c o n s t a n t_{1}}{i t e r a t i o n_n u m b e r + c o n s t a n t_{2}}

Remark:

Large value of
$α$ will lead to overshoot the global minimum.
Small value of
$α$ will lead to slow convergence.

III) Mini-batch Gradient Descent

Mini-batch gradient descent can sometimes be even faster than stochastic gradient descent. Instead of using all

m

examples like in batch gradient descent, and instead of using only

1

example as in stochastic gradient descent, we will use some in-between number of examples

b

. Typical values for

b

range from

2 \sim 100

or so.

For example, with

b = 10

and

m = 1000

For i = 1,11,21,31,…,991:

θ_{j} := θ_{j} - α \frac{1}{b} \sum_{i = 1}^{i + b} (h_{θ} (x^{(i)}) - y^{(i)}) x_{j}^{(i)}

We're simply summing over ten examples at a time. The advantage of computing more than one example at a time is that we can use vectorized implementations over the

b

examples.

IV) Online Learning

Online learning is a form of learning when the system has a continuous stream of training data. It implements the stochastic gradient descent forever using the input stream of data and discarding it once the parameter updates have been done using it.

It is observed that such an online learning setting is capable of learning the changing trends of data streams.

Typical domains where online learning can be successfully implemented include, search engines (predict click through rate i.e. CTR), recommendation websites etc.

Many of the listed problems can be modeled as a standard learning problem with fixed dataset, but often such data streams are available in such abundance that it is useless to store the data. Better implement an online training system.

V) Map Reduce and Parallelism

Map-Reduce is a technique used in large scale learning when a single system is not enough to train the models required. Under this training paradigm, all the summation operations are parallelized over a set of slave systems by spliting the training data (batch or entire set) across the systems which compute on smaller datasets and feed the results to the master system that aggregates the results from all the slaves and combines them together. This parallelized implementation boosts the speed of algorithm.

If the network latencies are not high, then one can expect a boost in speed by up to

n

times by using a pool of

n

systems. So, in practice when the systems are on a network speed boost is slightly less than

n

times.

Only algorithms that can be expressed as a summation over the training sets can be parallelized using map-reduce.

Besides a pool of computers, parallelization also works on multi-core machines with the added benefit of near-zero network latencies and hence faster.

Learning with large datasets

I) Introduction

II) Stochastic Gradient Descent

III) Mini-batch Gradient Descent

IV) Online Learning

V) Map Reduce and Parallelism

Read more

Convolutional Neural Network with Numpy (Fast)

Convolutional Neural Network with Numpy (Slow)

AlexNet: Summary and Implementation

ZFNet/DeconvNet: Summary and Implementation