MT & Transformer

## MT * Encoder-Decoder * RNN * Problem: (same as RNN prob) * Fix: cơ chế attention * Attention: * Lấy info của source, đưa qua softmax, các thông tin không cần thiết (tại vị trí của decoder) giảm đi độ quan trọng => không bị tràn thông tin * Input: all encoder states w 1 decoder state * Cons: slow because of recurrent (tuần tự), ko xử lí // được ## Transformer * Pros: Faster learning * All using attention (encoder, decoder, en-de interaction) * Structure: * Query: encode "What am I looking for" * Key: encode "What can I offer" * Value: encode "What I actually offer during attention" * Self attention * Each word is converted into key and query vector and used to create vectors that better understand context * Cross attention * * Masked self-attention: thời điểm chưa có thông tin sẽ mask các previous words lại * Feed-forward blocks* * Residual connections* * Positional encoding: vì xử lý các từ cùng lúc nên cần phải biết vị trí của các từ, lớp này encode thông tin dưới dạng vector ### Encoder-only BERT Masked LM ### Decoder-only GPT-2, GPT-3,... Causal LM: phát sinh từ