5. HUNG-YI LEE 2022 ML - seq2seq

# 5. HUNG-YI LEE 2022 ML - seq2seq * 以Transformer為例 ###### tags: `Machine Learning` [【機器學習2021】Transformer (上)](https://www.youtube.com/watch?v=n9TlOhRjYoc&list=PLJV_el3uVTsMhtt7_Y6sgTHGHp1Vb2P2J&index=12&ab_channel=Hung-yiLee) [【機器學習2021】Transformer (下)](https://www.youtube.com/watch?v=N6aRv06iv2g&list=PLJV_el3uVTsMhtt7_Y6sgTHGHp1Vb2P2J&index=13&ab_channel=Hung-yiLee) ## Intro. * 是一個**sequence to sequence model(seq2seq)** * 輸入長度＝輸出長度：hw2 * output僅一個東西：hw4 * 由機器決定output長度：hw5(transformer!) 1. 語音辨識speech recognition * 中文語音>中文文字 2. 機器翻譯machine translation * 英文文字>中文文字 3. 語音翻譯speech translation * 英文語音>中文文字 * 為何不1+2>3？因為有些語言沒有文字 * seq2seq相關應用 * 台語語音辨識 * 台語語音合成-TTS(text-to-speech synthsis) * 輸入文字>輸出語音訊號 * 使用類似transformer的network:Tacotron model * 聊天機器人chatbot * 文章摘要 * sentiment analysis model * 判斷句子為正面/負面 * seq2seq可用在multi-label classification * multi-class classification是指有不只一個class；但multi-label classification指同一個東西可以屬於多個class，例如一篇文章可能屬於多個標籤 * seq2seq用在object detection * https://arxiv.org/abs/2005.12872 ## seq2seq組成 2014/09 [seq2seq的起源](https://arxiv.org/pdf/1409.3215.pdf) ![](https://i.imgur.com/7jbrtJ7.png) 常聽到的transformer架構：[Attention is a ll you Need](https://arxiv.org/pdf/1706.03762.pdf) ![](https://i.imgur.com/7Gv1k3Q.png) ### Encoder in transformer * 簡單講就是輸出一排資料，輸出一排資料 * 可用RNN/CNN實作 * transformer's encoder使用self-attention機制 * input為了有位置資訊，先使用positional encoding * 參考self-attention課程 * Residual Network * 輸出前**加上輸入值**才輸出 * Norm * Layer Norm * 比batch norm更簡單 * 全相連的feed forward * BERT = transformer's Encoder * 待研究... * ![](https://i.imgur.com/Mos4MN5.png) ### Decoder in transformer Decoder分為兩種: #### 1. Autoregressive(AT) * 使用語音辨識speech recognition作為範例 * vector seq.(eg.文字)先經過Encoder輸出vector seq.，再**經過Decoder產生語音辨識的結果** * AT Decoder如何產生文字呢？ ![](https://i.imgur.com/OAF45qp.png) * **循序的產生詞彙，直到end**，因為我們無法預知輸出的長度 * Decoder如何**讀取**Encoder的輸出？ * 稍後講 * 應用 * 語音合成:Tacotron model使用AT decoder #### 2. Non-autoregressive(NAT) * NAT Decoder如何產生文字呢？ * 一次產生所有output，由於我們無法預知輸出的長度，我們可以 1. 訓練一個classifier as predictor來輸出長度 2. 輸出非常長的seq.，忽略end之後的token * 好處 * **平行化** * 一個步驟即可輸出整個句子，可比AT decoder快 * 以前只有RNN/LSTM時候沒有人會用NAT；得益於self-attention/transformer的出現，NAT成為近期研究熱門 * 比較能夠控制輸出長度 * 缺點 * NAT decoder目前performance較AT差 * [multi-modelity problem](https://www.youtube.com/watch?v=jvyKmU4OM3c&ab_channel=Hung-yiLee) * 應用 * 語音合成:FastSpeech model使用NAT decoder