**Neural Network & Deep Learning**

# **Neural Network & Deep Learning** [TOC] ![](https://i.imgur.com/wFZyqt4.png) ![](https://i.imgur.com/lB6VQjd.png) ![](https://i.imgur.com/opE0gw4.png) ## Logistic Regression as a Neural Network ### Binary Classification ![](https://i.imgur.com/u9GMgQM.png) ![](https://i.imgur.com/VkEycHV.png) ### Logistic Regression ![](https://i.imgur.com/4qCi10E.png) Sigmoid function used to 0<y<1 ![](https://i.imgur.com/vWG8l1b.png) Y not use Sum of Squared Errors? if we have a dataset of say 100 points, our SSE is, say, 200. If we increased data points to 500, our SSE would increase as the squared errors will add up for 500 data points now. So let’s say it becomes 800. The error should decrease as we increase our sample data as the distribution of our data becomes more and more narrower. The more data we have, the less is the error. But in the case of SSE, the complete opposite is happening. ![](https://i.imgur.com/aOE6U4a.png) ![](https://i.imgur.com/3hWqnnf.png) ### Computing Derivatives ![](https://i.imgur.com/ehOnvGi.png) ![](https://i.imgur.com/wOEcM21.png) ### Vectorization ![](https://i.imgur.com/I0NXYts.png) ![](https://i.imgur.com/EruTOgb.png) [Week 2 Lectures in pdf](https://d3c33hcgiwev3.cloudfront.net/uGak6NjqS0empOjY6ltHVg_e692585b60ad4fe984d1e41657cac1a1_C1W2.pdf?Expires=1630627200&Signature=TyeC23qQaqIvo0mi8S-zIbZc5KXr0-XLJyAmGMrK1gPxxUCWUBV4AXbg5NRF75UPnEDXAd2fFxkyH~-sQMlK1J~Y7KedJPsIw8lTAbAV4OxjE9FWdI15Neq9SL59B4lUUPScCP9wGcbYkvI~Cbdc1jRLlyd0jwcvk4HXtT1-zic_&Key-Pair-Id=APKAJLTNE6QMUY6HBC5A) ### Cost function ![](https://i.imgur.com/Wy2qcXt.png) ## Overview and Representation ![](https://i.imgur.com/cCMFNau.png) ### Shallow NN ![](https://i.imgur.com/66AwOuo.png) ![](https://i.imgur.com/JjgMztK.png) ![](https://i.imgur.com/jodbxb1.png) ![](https://i.imgur.com/80KW2dZ.png) ![](https://i.imgur.com/PO87xyn.png) * Sigmoid only for binary classification(outer layer) * Tanh 0 maen * RELU Mostly used If you were to use linear activation functions or identity activation functions, then the neural network is just outputting a linear function of the input. Linear activation function only useful in for example predicting housing value from 0 to x value. Linear function will give real value between -inf to inf To neglect -ve values u ll still use relu ### Derivatives Activation Function ![](https://i.imgur.com/DUye2Gy.png) ![](https://i.imgur.com/P6VsN2f.png) ![](https://i.imgur.com/yGhHlIU.png) ![](https://i.imgur.com/BGLaz3O.png) ![](https://i.imgur.com/v3S56uf.png) - [ ] dz[1] derivative how ???? Y not to initialize the weights to zero?? ![](https://i.imgur.com/8OmyspZ.png) And therefore, by induction, after two iterations, three iterations and so on, no matter how long you train your neural network, both hidden units are still computing exactly the same function. And so in this case, there's really no point to having more than one hidden unit. Because they are all computing the same thing ![](https://i.imgur.com/myVUCSA.png) Random values shldnt be very large or very small, where the slope or the gradient is very small. Meaning that gradient descent will be very slow. So learning was very slow. [Week 3 Lectures in pdf](https://d3c33hcgiwev3.cloudfront.net/loTqX50UQlSE6l-dFAJU_g_fd20e35d1c814792a6788105e09954a1_C1W3.pdf?Expires=1630627200&Signature=YyCSs6N2FrMsmwwlMf3MpOt3tuTyJ8gPZJ2UCdSs4bujzg9wu-Rhx7i4u1VUG-0ZHFgZgK1~1YTkhVn2HJVA7TvBPd1sWrwFE57BeVLyVbUD6dsoapKS8xzmkUYhugUbYBcWxY3oMacM~JkLZReqP3sGHU-HQaPoqP0f29g7CD4_&Key-Pair-Id=APKAJLTNE6QMUY6HBC5A) ## DL ![](https://i.imgur.com/J9A1gNy.png) Y Deep Learning or many Hidden LAyer ![](https://i.imgur.com/74mWC59.png) ![](https://i.imgur.com/5KDp0nB.png) ![](https://i.imgur.com/HwM0mP7.png) ![](https://i.imgur.com/BYzxn4h.png) ![](https://i.imgur.com/la8OEaX.png) ![](https://i.imgur.com/mlzkIZi.png) ### Parameters and Hyperparameters By changing hyperpara u change params ![](https://i.imgur.com/6bZjGOS.png)