ML sequential data: Recurrent network RNN

# ML sequential data: Recurrent network RNN ### one-to-one, one-to-many, many-to-one, and many-to-many LSTM neural networks **one-to-one** : single images ( or words,... ) are classified in single class ( binary classification ) i.e. is this a bird or not **one-to-many** : single images ( or words,... ) are classified in multiple classes, many means output is a **one-hot encoded vector** **many-to-one** : sequence of images ( or words, ... ) is classified in single class ( binary classification of a sequence ) **many-to-many** : sequence of images ( or words, ... ) is classified in multiple classes * translation system, feed forward network won't work: sequences have variable length, feed forward will learn exact location need to see all possible word setences options during training (tied weight at every position) - remove position specific weight - share same weight: individual predictions, no past/context how can the past be included Recurrent Neural Network: Process a sequence in order and share the same parameters at each time step ``` For a sequence Xt fort = 1, 2, 3,..., T **Ht** = d(X: Wah + Ht-1 Whh + bn), **output** Ot = HtWhq + ba Q: Explain the symbols? • H;: Activation at current time step t • %: non-linearity, (often a tanh, range in (-1,1) ) • Xt: input at time t • Wxh: Learned weights for input at time t • Ht-1: Activation at previous time step t - 1 • Whh: Learned weights: how to use the previous information at t - 1 • Whq: Learned weights for output at time t • bh, bg: Learned bias terms ``` Sequential, RNN difficult to train in parallel Teacher forcking, use previous output to train on, cut dependency use previous ground trouth concatenate A*B+C*D = A|C*B/D Q: How is an RNN different from a FF net: z = f3(f2(f1(x,w1), w2), w3) A: RNNs share weights: z = f3(f2(f1(x,w), w), w) RNNs have difficluty with backpropping long range relations vanishing gradient problem Improve RNNs for long-range relationship learning * Restet for different logical chunks in input * Do not update for uniformative input Gate: vector wiht entries in (0,1) **Reset gate Rt** cause to ignore previous memories Ht-1 Multiply gate Rt element-wise wiht Ht-1 **Update gate Zt** How much an already updated hidden state Ht can update the previous hidden state Ht-1 Ht-1 X Zt + (1 - Zt) X Ht, where Ht is the already updated state. Sigmoid activation to force gate Rt and Zt to be 0~1 Rt =$\sigma$(X. Wxr + Ht-1Whr + br) It =$\sigma$(X, Wxz + Ht-1Whz + bz) How a GRU differs from a standard RNN: GRU adds a reset gate: • Candidate state Ht, = tanh(XtWxh + (Rt+ O Ht-1) Whh + bh) GRU adds an update gate: • Current state Ht = Ht-1 O Zt + (1 - Zt)Ht, where Ht is candidate state. $\sigma$ **LSTM**