# ML sequential data: Recurrent network RNN
### one-to-one, one-to-many, many-to-one, and many-to-many LSTM neural networks
**one-to-one** : single images ( or words,... ) are classified in single class ( binary classification ) i.e. is this a bird or not
**one-to-many** : single images ( or words,... ) are classified in multiple classes, many means output is a **one-hot encoded vector**
**many-to-one** : sequence of images ( or words, ... ) is classified in single class ( binary classification of a sequence )
**many-to-many** : sequence of images ( or words, ... ) is classified in multiple classes
* translation system, feed forward network won't work:
sequences have variable length,
feed forward will learn exact location
need to see all possible word setences options during training (tied weight at every position)
- remove position specific weight - share same weight:
individual predictions, no past/context
how can the past be included
Recurrent Neural Network:
Process a sequence in order and share the same parameters at each time step
```
For a sequence Xt fort = 1, 2, 3,..., T
**Ht** = d(X: Wah + Ht-1 Whh + bn),
**output** Ot = HtWhq + ba
Q: Explain the symbols?
• H;: Activation at current time step t
• %: non-linearity, (often a tanh, range in (-1,1) )
• Xt: input at time t
• Wxh: Learned weights for input at time t
• Ht-1: Activation at previous time step t - 1
• Whh: Learned weights: how to use the previous information at t - 1
• Whq: Learned weights for output at time t
• bh, bg: Learned bias terms
```
Sequential, RNN difficult to train in parallel
Teacher forcking, use previous output to train on, cut dependency use previous ground trouth
concatenate
A*B+C*D = A|C*B/D
Q: How is an RNN different from a FF net:
z = f3(f2(f1(x,w1), w2), w3)
A: RNNs share weights:
z = f3(f2(f1(x,w), w), w)
RNNs have difficluty with backpropping long range relations
vanishing gradient problem
Improve RNNs for long-range relationship learning
* Restet for different logical chunks in input
* Do not update for uniformative input
Gate: vector wiht entries in (0,1)
**Reset gate Rt**
cause to ignore previous memories Ht-1
Multiply gate Rt element-wise wiht Ht-1
**Update gate Zt**
How much an already updated hidden state Ht can update the previous hidden state Ht-1
Ht-1 X Zt + (1 - Zt) X Ht, where Ht is the already updated state.
Sigmoid activation to force gate Rt and Zt to be 0~1
Rt =$\sigma$(X. Wxr + Ht-1Whr + br)
It =$\sigma$(X, Wxz + Ht-1Whz + bz)
How a GRU differs from a standard RNN:
GRU adds a reset gate:
• Candidate state Ht, = tanh(XtWxh + (Rt+ O Ht-1) Whh + bh)
GRU adds an update gate:
• Current state Ht = Ht-1 O Zt + (1 - Zt)Ht, where Ht is candidate state.
$\sigma$
**LSTM**