A2 Overview - HackMD

# A2 Overview [# A2 overview](https://gkrgyfrnht.joplinusercontent.com/shares/f2HiMjpBpxuANCOT4zsOTK) # part 1 - a1 we did: - fully connected layer - every neuron to every other neuron - dense connections - in tensorflow, its called dense layer - in pytorch, its called linear layer - but this leads to a lot of parameters - input is 28x28, 784 x10 params - for eg in hidden: 784 x 256 and 256 x 10 - motivations for cnn: - can we organize our weights in smaller set of weights and convolve of cross correlate it? reduces the number of params a lot! - and that motivates learning about CNNs. TLDR: use numpy and do forward and backward pass # part 2 uses torch's autograd and leverage the library to compute the gradients - we compose the forward function - then when backward method is called, the gradient is computed - gradient can be used in optimization process - do some experiments, play around with architecture - tldr: experiments with CNNs in pytorch and do some experiments # part 3 and 4 we use an imbalanced dataset and discuss some ways to deal with class imbalance - read papers - synthesize the implementation from those papers - and run the experiements # deliverables - a2 code: 72 pts - report: 38 pts - results + exp - theory question - paper review # part 1 in numpy implement modules / layers: - the forward and backward - convolution - max pooling - relu (familiar from a1) - linear (familiar from a2) in a1, the architecture was hard coded but in a2 we're going to make generalized modules and chain them together implement optimizer: - sgd with momentum (slight extension to a1) # convolution ![85f69911396f64ad9a2e64ccdddebac3.png](:/0aa17014f3c343cebaace3f8041806c2) for eg: the first one: - multiply and sum, for eg: the first one: 1\*1 + 1\*0 + 1\*1 (first row) + 0\*0 + 1\*1 + 1\*0 (second row) + 0\*1 + 0\*0 + 1\*1 (third row) = 4 - 9 multiplications and summations: - 9 different linear transforms - 1 feature map has 9 values - we can use different filters/kernels to extract certain type of features - vertical or horizontal edges - dots - the weights are repeatedly shared across the image - and important to note we can have multiple kernels/filters - the images in the dataset are 3 channel: 32 x 32 x 3 - a linear layer (fully connected) on this will be huge - the kernel is the depth of the input - if input is 3 channel, our kernel 3x3x3 --> the depth of kernel == depth of input - motivated by parameter sharing and notions from signal processing, we create this convolution operation that is applied to images. This is the FORWARD PASS. # Vectorizing - its okay to use loop for a2 - vectorizing is not necessary but its important in practice - we know apriori from kenel size, image size and stride: - how many operations we will do - and how these feature views look like - using this apriori we can use fancy indexing to take each of these views and create them all at once - we can take copies of the kernels and broadcast it over each of the views - my initial hypothesis: some kernel matrix multiplied over different indexes in a for loop? vfunc numpy with indexes? ## Vectorization: Stride tricks (1 of 2) ![426a8c3acbd2e52435e5c76a9ff445fa.png](:/52855b63789848db92ac6bb62f679865) - Goal: create views of image the same size as filters we're applying: - (c,h, w) -> (h_new, w_new, c, k, k) -> height_new * width_new * channel dims * 3 x 3 - Image A as array: - A.shape = (c,h,w) = (3, 45, 40) = channel * height * width - `A.strides` = (s_c, s_h, s_w), eg: (14400, 320, 8) - strides represent how many bytes of offset to get to next value in the dimension - (14400, 320, 8) - 14400 - 45 x 40 x 8 bytes per stride - number of steps between CHANNELS - between R G B channel - 320 - 40 x 8 bytes per stride - if we in the first element and we want to jump to next pixel, we need to jump 320 bytes to go to next element - 8 - 8 byters per stride - if we want to jump to next in width dimensions then its just 8 bytes ## Vectorization: Stride tricks (2 of 2) ![d7f39d74940fcf0b6265cf940f79a7ed.png](:/b5ea2a84f6854a0598ea856210377bdc) The input stride from #1 tricks goes to the above function. - A is the input image tensor - shape = the desired shape - strides: how we want to index it - s_h*2: strides=2 in h, similar for s_w - s_c: we're not striding over channel dimension - writeable=False --> check np docs. # Vectorization: Multiplying tensors ![e848f53bb98666a092890f17639affc1.png](:/e7f5e723b5f849d4825ad9c233fbe347) Both tensordot and einsum is fine. Tensordot is a tiny bit faster. - We have a tensor A and kernel B. - With einsum, you can do many things. - We put in the string and that tells einsum what operation we intend to do. - Dot products - Transposes - Diagonal etc - strings: - ('i,i', a,b): dot product - ('ij,jk', a,b): matmul - ('nchw, chwk->nk'): do a reduction across chw and we're left with a n by k matrix. # Vectorization: Broadcasting ![a29a43c5de643447542e2442f7d64e10.png](:/f63bef38458540ab8391b48e16068086)