{%hackmd SybccZ6XD %}
# SKIP CONNECTIONS ELIMINATE SINGULARITIES
###### tags: `paper`
## INTRODUCTION
- extra connections between nodes in different layers of a neural network
- improved the training of ==very deep neural networks==
## RESULTS
### SINGULARITIES IN FULLY-CONNECTED LAYERS AND HOW SKIP CONNECTIONS BREAK THEM
:::warning
**SINGULARITIES**
Gradient is equal to zero or infinite
:::
::: warning
**Independent identifiability**
The ability to distinguish the contribution or impact of each unit or connection, i.e., the ability to accurately calculate the contribution of each unit or connection to the loss function.
:::
**Type1 Elimination singularities**
Weight is equal to zero.

**Type 2 Overlap singularities**
Permutation symmetry of the hidden units

**Type 3 Linear dependence singularities**
A subset of the hidden units in a layer become linearly dependent

### WHY ARE SINGULARITIES HARMFUL FOR LEARNING?
- The learning on gradient-based learning can be harm.
**Type1 Elimination singularities**
- h = 0 and z = +-1
- Even though the parameter is only in the neighbor, using SGD will be slower

### PLAIN NETWORKS ARE MORE DEGENERATE THAN NETWORKS WITH SKIP CONNECTIONS
:::warning
negative eigenvalues
- The situation where the weight matrix of a given layer is negatively correlated in certain directions.
- The gradient descent algorithm to push the model out of the weight space in certain directions, which can affect the learning and performance of the model.
- Regularization, weight initialization, and gradient clipping can help alleviate the impact of negative eigenvalues
:::

### TRAINING ACCURACY IS RELATED TO DISTANCE FROM DEGENERATE MANIFOLDS
- The result of SGD
- Training performance varied widely across runs.

### BENEFITS OF SKIP CONNECTIONS AREN’T EXPLAINED BY GOOD INITIALIZATION ALONE
To investigate if the benefits of skip connections can be explained in terms of favorable initialization of the parameters, we introduced a malicious initialization scheme for the residual network by subtracting the identity matrix from the initial weight matrices
- the malicious initialization only had a small adverse effect on the performance of the residual network
- skip connections do not eliminate the singularities in deep linear networks, but only shift the landscape so that typical initializations are farther from the singularities.
- skip connections genuinely eliminate the singularities in nonlinear networks
### NON-IDENTITY SKIP CONNECTIONS