# A2 Overview
[# A2 overview](https://gkrgyfrnht.joplinusercontent.com/shares/f2HiMjpBpxuANCOT4zsOTK)
# part 1
- a1 we did:
- fully connected layer - every neuron to every other neuron - dense connections
- in tensorflow, its called dense layer
- in pytorch, its called linear layer
- but this leads to a lot of parameters
- input is 28x28, 784 x10 params
- for eg in hidden: 784 x 256 and 256 x 10
- motivations for cnn:
- can we organize our weights in smaller set of weights and convolve of cross correlate it? reduces the number of params a lot!
- and that motivates learning about CNNs.
TLDR: use numpy and do forward and backward pass
# part 2
uses torch's autograd and leverage the library to compute the gradients
- we compose the forward function
- then when backward method is called, the gradient is computed
- gradient can be used in optimization process
- do some experiments, play around with architecture
- tldr: experiments with CNNs in pytorch and do some experiments
# part 3 and 4
we use an imbalanced dataset and discuss some ways to deal with class imbalance
- read papers
- synthesize the implementation from those papers
- and run the experiements
# deliverables
- a2 code: 72 pts
- report: 38 pts
- results + exp
- theory question
- paper review
# part 1 in numpy
implement modules / layers:
- the forward and backward
- convolution
- max pooling
- relu (familiar from a1)
- linear (familiar from a2)
in a1, the architecture was hard coded but in a2 we're going to make generalized modules and chain them together
implement optimizer:
- sgd with momentum (slight extension to a1)
# convolution
![85f69911396f64ad9a2e64ccdddebac3.png](:/0aa17014f3c343cebaace3f8041806c2)
for eg: the first one:
- multiply and sum, for eg: the first one: 1\*1 + 1\*0 + 1\*1 (first row) + 0\*0 + 1\*1 + 1\*0 (second row) + 0\*1 + 0\*0 + 1\*1 (third row) = 4
- 9 multiplications and summations:
- 9 different linear transforms - 1 feature map has 9 values
- we can use different filters/kernels to extract certain type of features
- vertical or horizontal edges
- dots
- the weights are repeatedly shared across the image
- and important to note we can have multiple kernels/filters
- the images in the dataset are 3 channel: 32 x 32 x 3
- a linear layer (fully connected) on this will be huge
- the kernel is the depth of the input
- if input is 3 channel, our kernel 3x3x3 --> the depth of kernel == depth of input
- motivated by parameter sharing and notions from signal processing, we create this convolution operation that is applied to images. This is the FORWARD PASS.
# Vectorizing
- its okay to use loop for a2
- vectorizing is not necessary but its important in practice
- we know apriori from kenel size, image size and stride:
- how many operations we will do
- and how these feature views look like
- using this apriori we can use fancy indexing to take each of these views and create them all at once
- we can take copies of the kernels and broadcast it over each of the views
- my initial hypothesis: some kernel matrix multiplied over different indexes in a for loop? vfunc numpy with indexes?
## Vectorization: Stride tricks (1 of 2)
![426a8c3acbd2e52435e5c76a9ff445fa.png](:/52855b63789848db92ac6bb62f679865)
- Goal: create views of image the same size as filters we're applying:
- (c,h, w) -> (h_new, w_new, c, k, k) -> height_new * width_new * channel dims * 3 x 3
- Image A as array:
- A.shape = (c,h,w) = (3, 45, 40) = channel * height * width
- `A.strides` = (s_c, s_h, s_w), eg: (14400, 320, 8)
- strides represent how many bytes of offset to get to next value in the dimension
- (14400, 320, 8)
- 14400 - 45 x 40 x 8 bytes per stride
- number of steps between CHANNELS
- between R G B channel
- 320 - 40 x 8 bytes per stride
- if we in the first element and we want to jump to next pixel, we need to jump 320 bytes to go to next element
- 8 - 8 byters per stride
- if we want to jump to next in width dimensions then its just 8 bytes
## Vectorization: Stride tricks (2 of 2)
![d7f39d74940fcf0b6265cf940f79a7ed.png](:/b5ea2a84f6854a0598ea856210377bdc)
The input stride from #1 tricks goes to the above function.
- A is the input image tensor
- shape = the desired shape
- strides: how we want to index it
- s_h*2: strides=2 in h, similar for s_w
- s_c: we're not striding over channel dimension
- writeable=False --> check np docs.
# Vectorization: Multiplying tensors
![e848f53bb98666a092890f17639affc1.png](:/e7f5e723b5f849d4825ad9c233fbe347)
Both tensordot and einsum is fine. Tensordot is a tiny bit faster.
- We have a tensor A and kernel B.
- With einsum, you can do many things.
- We put in the string and that tells einsum what operation we intend to do.
- Dot products
- Transposes
- Diagonal etc
- strings:
- ('i,i', a,b): dot product
- ('ij,jk', a,b): matmul
- ('nchw, chwk->nk'): do a reduction across chw and we're left with a n by k matrix.
# Vectorization: Broadcasting
![a29a43c5de643447542e2442f7d64e10.png](:/f63bef38458540ab8391b48e16068086)