# NNs from scratch Let's start with a super basic, single threaded implementation. Pretty much, all we need is matrix object. And a differentiable function object. the first performance log: ``` Perf of 3 layer NN with input dim 784 hidden dim 256 and output dim 1 Average time of forward pass: 3 milliseconds Average time to compute gradients (forward mode): 657148 milliseconds Average time to compute gradients (backward mode): 1379 milliseconds Time to multiply two 1000x1000 empty matrices 11011 milliseconds Time to fill a 1000x1000 uniform random matrix 106 milliseconds Time to multiply two 1000x1000 uniform random matrices 11199 milliseconds Time to add two 1000x1000 uniform random matrices 5 milliseconds ``` Holly fuck. Computing a single gradient on an MNIST image using forwarde mode takes arround 11 minutes. Using pytorch we could fully train on MNIST this exact same network multiple times before this single gradient computation finished. At 1.3 seconds, even the backwards mode gradient is too slow to bother to launching a training run yet. There is lot's of things we could do to make the code run faster. Profiling. Improving the cashe hit rates. Multithreading. Maybe we will even try writing some cuda kernels and running the code on GPUs. But let's start with the easiest possible thing. Let's play arround with compilation options. Before I compiled using the command ```g++-11 perf.cpp -o ./perf ```. After a bit of googling I see that the flag `-Ofast` is the one that tells the compiler to optimize shit as much as possible, we definetely want that! And the flag`-march=native` tells it to use any instuctions specific to the architecture of my machine. Let's see how compiling with `g++-11 perf.cpp -o ./perf -Ofast -march=native` affects performance ``` Perf of 3 layer NN with input dim 784 hidden dim 256 and output dim 1 Average time of forward pass: 485 microseconds Average time to compute gradients (forward mode): 217398 milliseconds Average time to compute gradients (backward mode): 945 milliseconds Time to multiply two 1000x1000 empty matrices 232 milliseconds Time to fill a 1000x1000 uniform random matrix 10 milliseconds Time to multiply two 1000x1000 uniform random matrices 922 milliseconds Time to add two 1000x1000 uniform random matrices 1 milliseconds ``` That is some progress! Forward mode gradient is 3X faster. Backwards mode is 1.5X faster. Interestngly, before, multiplying matrices took the exact same amount of time if they where filled with 0s as if they were filled with random numbers. But after the compiler optimizations, multyplying 0 matrices is close to 4X faster. I have no idea why. There are many other compiler options to play around with, but there is one part of the code that is really bugging me. See, this little funciton is the core of our neural network implementation: ``` Mat operator*(Mat& A) { assert (w == A.h); Mat M(h,A.w); for (int i=0; i<M.h; i++ ){ for (int j=0; j<M.w; j++ ){ for (int k=0; k<w; k++ ){ M.ind(i,j) += ind(i, k) * A.ind(k,j); } } } return M; } ``` It's a method of `class Mat` that implements matrix multiplicaiton. The issue here is that, even though we are going to use every element of the two matrices multiple times, we are iterating in such a way that there is a long interval between two usages of the same value. But if we change the loop order like this: ``` Mat operator*(Mat& A) { assert (w == A.h); Mat M(h,A.w); for (int k=0; k<w; k++ ){ for (int i=0; i<M.h; i++ ){ for (int j=0; j<M.w; j++ ){ M.ind(i,j) += ind(i, k) * A.ind(k,j); } } } return M; } ``` The behavior is the exact same, but `ind(i,k)` will be used repeatedly `M.w` times. So it will stay in a very fast register. Let's see how that affects performace. ``` Perf of 3 layer NN with input dim 784 hidden dim 256 and output dim 1 Average time of forward pass: 565 microseconds Average time to compute gradients (forward mode): 7279 milliseconds Average time to compute gradients (backward mode): 237 milliseconds Time to multiply two 1000x1000 empty matrices 109 milliseconds Time to fill a 1000x1000 uniform random matrix 9 milliseconds Time to multiply two 1000x1000 uniform random matrices 137 milliseconds Time to add two 1000x1000 uniform random matrices 1 milliseconds ``` Bam! Forward mode gradient is close to 100X the original time. Backwards mode gradient 6X. The 100x100 matmul is 80X faster.