# Objective Functions
###### tags: `Data Science`, `Notes`
## Regression Objective Functions
Suppose $x_i$ denoted the feature of sample $i$ (for simplification, the features $\pmb{x}$ is a 1D vector). Given that we have already known which parameters $w$ and $b$ yield the best prediction for sample $i$: $\hat{y_i}=f^*(x_i|w, b)$ (simplified as $f(x_i)$ in the following introduction), we can use objective functions to evaluate our model.

### Absolute Error (L1 Loss)
* Equation
\begin{align}L(f(\pmb{x}), \pmb{y})=\underset{i}\sum|f(x_i) - y_i|\end{align}
* Derivative
$$\frac{\partial L(f(\pmb{x}), \pmb{y})}{\partial f(\pmb{x})}=\begin{cases}
1\;&\text{if }f(\pmb{x}) > \pmb{y}\\
-1&\text{otherwise}\end{cases}\tag{2}$$
* Properties
* Less sensitive to outliers (samples with big residual between prediction and actual value) compared to square error
### Square Error (L2 Loss)
* Equation
\begin{align}L(f(\pmb{x}), \pmb{y})=\underset{i}\sum(f(x_i) - y)^2\end{align}
* Derivative
\begin{align}
\frac{\partial L(f(\pmb{x}), \pmb{y})}{\partial f(\pmb{x})}
=2(f(\pmb{x})-\pmb{y})\end{align}
* Properties
* More sensitive to outliers compared to absolute error (preferred when outliers are of interest)
### Huber Loss
* Equation
\begin{align}L(f(\pmb{x}), \pmb{y})=\underset{i}\sum
\begin{cases}\frac{1}{2}(f(x_i)-y_i)^2\;\;\;&\text{if}\;|f(x_i)-y_i|>\delta\\
\delta(|f(x_i)-y_i|-\frac{1}{2}\delta)&\text{otherwise}\end{cases}\end{align}where $\delta$ is a hyperparameter.
* Derivative
\begin{align}\frac{\partial L(f(\pmb{x}), \pmb{y})}{\partial f(\pmb{x})}=
\begin{cases}f(\pmb{x})-\pmb{y}\;\;\;&\text{if}\;|f(\pmb{x})-\pmb{y}|>\delta\\
1&\text{if}\;|f(\pmb{x})-\pmb{y}|\leq\delta\;\text{and}\;f(\pmb{x})-\pmb{y}>0\\
0&\text{otherwise}\end{cases}\end{align}
* Properties
* Hyperparamter $\delta$ can be used to control how much penalty given to sample with large residual.
## Binary Classification
### Cross Entropy (of Binomial Distribution)
* Description
Suppose $y_i=1$ for the sample $i$ being labled as class A, $y_i=0$ as class B, $f(x_i)$ being the network prediction of the probability the sample being classified as class A. $1-f(x_i)$ being the network prediction of the probability the sample being classified as class B.
* Equation
\begin{align}H(f(\pmb{x}), \pmb{y})=-[\underset{i}\sum y_i\ln f(x_i)+(1-y_i)\ln(1-f(x_i))]\end{align}
* Derivative
\begin{align}\frac{\partial H(f(\pmb{x}), \pmb{y})}{\partial f(\pmb{x})}=-\frac{\pmb{y}}{\pmb{f(x)}}+\frac{1-\pmb{y}}{1-f(\pmb{x})}\end{align}
* Properties
* Data point far away from the decision boundary also contribute to the loss function.
* The inferred probability $f(x_i)$ is more accurate compare to hinge loss.
* __Sigmoid__ or __softmax__ activation functions are preferred to used in the output layer for $f(x_i)$.
### Hinge Loss
* Description
Suppose $y_i=1$ for the sample being labled as class A, $y_i=-1$ as class B, $f(x_i)>0$ suggests network __predict__ the sample being classified as class A. $f(x_i)<0$ suggests the network __predict__ the sample being classified as class B:
\begin{align}L(f(\pmb{x}), \pmb{y})=\underset{i}\sum\max(0, 1-y_if(x_i))\end{align}
* Derivative
\begin{align}\frac{\partial H(f(\pmb{x}), \pmb{y})}{\partial f(\pmb{x})}=
\begin{cases}-\pmb{y}\;&\text{if }1-\pmb{y}\otimes f(\pmb{x})<0\\0&\text{otherwise}\end{cases}\end{align}
* Properties
* Data point far away from the decision boundary do not contribute to the loss function.
* The inferred probability $f(x_i)$ is less accurate compare to cross entropy.
* __Linear__ or __hyperbolic tangent__ activation functions are preferred to used in the output layer for $f(x_i)$.
## Multiclass Classification
### Cross Entropy (of Multinomial Distribution)
* Description
Suppose $y_ik=1$ and $y_il=0\;(k\ne l)$ for the sample $i$ being labled as class $k$, $f_k(x_i)$ being the network prediction of the probability the sample being classified as class $i$. Note that $\underset{c}\sum f_c(x_i)=1$ (which can be achieved by the __softmax__ function):
* Equation
\begin{align}H(f(\pmb{x}), \pmb{y})=-\underset{i}\sum \underset{c}\sum y_c\ln f_c(x_i)\end{align}
* Derivative
\begin{align}\frac{\partial H(f(\pmb{x}), \pmb{y})}{\partial f(\pmb{x})}=-\frac{\pmb{y}}{\pmb{f(x)}}\end{align}
### Hinge Loss (_Weston and Watkins_)
* Description
Suppose $y_{ik}=1$ and $y_{il}=-1\;(k\ne l)$ for the sample $i$ being labled as class $i$, $f_k(x_i)>0$ suggests network __predict__ the sample being classified as class $k$:
* Equation
\begin{align}H(f(x), y)=\underset{i}\sum \underset{k}\sum\underset{l\ne k}\sum\max(0, 1+f_k(x)-f_l(x))\end{align}
## Why are MSE and MAE not suitable for classification
MSE and MAE are not suitable for classification model function as the output layer. To prove this, let us first decompose a classification network with sigmoid activation function:
\begin{align}f(\pmb{x})=\sigma(w\pmb{x}+b)=\sigma(\pmb{\hat{x}})\end{align}where $\pmb{\hat{x}}=w\pmb{x}+b$. Now, consider the partial derivative of square error:
\begin{align}
\frac{\partial L(f(\pmb{x}), \pmb{y})}{\partial f(\pmb{x})}
&=2(f(\pmb{x})-\pmb{y})\\
&=2(\sigma(\pmb{\hat{x}})-\pmb{y})\end{align}And the derivative of sigmoid function:
\begin{align}\frac{\partial\sigma(\pmb{\hat{x}})}{\partial \pmb{\hat{x}}}=\sigma(\pmb{\hat{x}})(1-\sigma(\pmb{\hat{x}}))\end{align}We can then compute the derivative of $L(f(\pmb{x}), \pmb{y})$ with respect to $w$:
\begin{align}\frac{\partial L(f(\pmb{x}), \pmb{y})}{\partial w}
&=\frac{\partial L(f(\pmb{x}, \pmb{y}))}{\partial f(\pmb{x})}\frac{\partial f(\pmb{x})}{\partial w}\\
&=\frac{\partial L(f(\pmb{x}, \pmb{y}))}{\partial f(\pmb{x})}\frac{\partial \sigma(\pmb{\hat{x}})}{\partial w}\\
&=\frac{\partial L(f(\pmb{x}, \pmb{y}))}{\partial f(\pmb{x})}\frac{\partial \sigma(\pmb{\hat{x}})}{\partial \pmb{\hat{x}}}\frac{\partial \pmb{\hat{x}}}{\partial w}\\
&=2(\sigma(\pmb{\hat{x}})-\pmb{y})\sigma(\pmb{\hat{x}})(1-\sigma(\pmb{\hat{x}}))\pmb{x}
\end{align}Therefore $\frac{\partial L(f(\pmb{x}, \pmb{y}))}{\partial w}\to0$ when $f(x)=\sigma(\hat{x})\to1$ or $f(x)=\sigma(\hat{x})\to0$. (__when the model is certain about the prediction, the weight never update__). We can observe similar result for network using __softmax__ as the output layer and MAE objective function. On ther other hand, one can easily see that cross entropy does not have this problem (subsitute the first term with the derivative of cross entropy).
If the activation function of the output layer is __linear__ or __hyperbolic tangent__ (in this case, $y_i=1$ for the sample being labled as class A, $y_i=-1$ as class B, $f(x_i)>0$ suggests network __predict__ the sample being classified as class A). The predictor makes correct prediction if $y_if(x_i)>0$. However, the MSE and MAE increase when $y_if(x_i)>1$.
<span style="padding-right: 50px;"></span>