HW15 Solution - HackMD

# HW15 Solution :::warning 多選題（[2.](#Prob-2) [20.](#Prob-20) [21.](#Prob-21) [22.](#Prob-22) [24.](#Prob-24)）要全對才給分！ [Prob. 22](#Prob-22) AB AD BD ABD 都給分 HW15 測驗連結： https://cool.ntu.edu.tw/courses/4793/quizzes/12092 ::: ## Basic Concepts ###### \[Prob. 1\] #### What does meta learning want to learn? :white_circle: data distribution :white_circle: mapping function for a task :white_circle: mapping function for many tasks :black_circle: learning how to learn ###### \[Prob. 2\] #### Which of the following can be meta learned? (select 5 of them) :white_square_button: model initialization :white_square_button: model structure :white_square_button: model optimizer :white_square_button: the whole learning algorithm :white_large_square: model name :white_square_button: feature extractor :white_large_square: implementation toolkit > 正確答案是 model name 與 toolkit 以外的所有選項。 > 那兩個選項本身純粹是誘答， > 因為名字是人叫的、toolkit 是用來 implement 模型的與演算法等無關 ###### \[Prob. 3\] #### Which is NOT the reason we need meta learning? :white_circle: There maybe too many tasks to learn. :white_circle: We hope to customize the model into various scenarios. :black_circle: It is a powerful tip that does not require training and therefore should be applied everywhere. :white_circle: The amount of data may not be sufficient to apply the general learning algorithm. ###### \[Prob. 4\] #### Which of the following uses meta learning? :white_circle: Reptile \[Nichol et al., 2018\] (https://arxiv.org/abs/1803.02999) :white_circle: DARTS \[Liu et al., 2018\] (https://arxiv.org/abs/1806.09055) :white_circle: Siamese Network \[Koch et al., 2015\] (https://www.cs.cmu.edu/~rsalakhu/papers/oneshot1.pdf) :black_circle: All of above > Reptile、DARTS 與 Siamese Network **本身就是運用到 meta-learning 概念設計出來的模型 / 演算法** > - Reptile 是 learn to initialize （對應第 2 題 model initialization）同系列的有 MAML 等 > - DARTS 是 learn a model structure（對應第 2 題 model structure）同系列的包含整個 NAS (Network Architecture Search) > - Siamese Network 是 learn to compare（對應第 2 題 feature extractor）同系列的有 Prototypical Network, Matching Network 等 ###### \[Prob. 5\] #### In the meta-training phase, we adapt the parameters of the model on the support set in the inner-loop. Which loss is used to update the parameters in the outer-loop? :black_circle: The loss tested on the query set of the training tasks with the adapted parameters. :white_circle: The loss tested on the support set of the training tasks with the adapted parameters. :white_circle: The loss tested on the query set of the testing tasks with the adapted parameters. :white_circle: The loss tested on the support set of the testing tasks with the adapted parameters. > Use **meta-loss**, which is calculated from the query set of the training tasks. The testing tasks are never involved in the training process. ###### \[Prob. 6\] #### What does it mean that a classification problem is 𝑛 way 𝑘 shot? :white_circle: In a task, there are 𝑘 classes and 𝑛 training data for each class. :black_circle: In a task, there are 𝑛 classes and 𝑘 training data for each class. :white_circle: In a dataset, there are 𝑛 tasks and 𝑘 classes for each task. :white_circle: In a dataset, there are 𝑘 tasks and 𝑛 classes for each task. > $k$-shot $\equiv$ $k$ examples. Recall one-shot, few-shot, zero-shot. > $n$-way $\equiv$ $n$ classes (ways of choosing). ## Implementation ###### \[Prob. 7\] #### In the [block of code](https://colab.research.google.com/drive/1N73PnZmZ3jimFe8023lX2w9yNLJOuO5i#scrollTo=KjNxrWW_yNck&line=3&uniqifier=1) of the `OriginalMAML()` function in the sample code, which part of the code performs the updating of the parameters in the outer-loop? (Refer to the image) :::spoiler (View image) ![](https://i.imgur.com/gkqWHcs.png) ::: :white_circle: Line 6 ~ 7 :white_circle: Line 12 ~ 23 :white_circle: Line 24 ~ 32 :black_circle: Line 34 ~ 41 > Line 6 ~ 7: Getting data > Line 12 ~ 23: inner loop update > Line 24 ~ 32: collecting meta-test loss ###### \[Prob. 8\] #### In the [block of code](https://colab.research.google.com/drive/1N73PnZmZ3jimFe8023lX2w9yNLJOuO5i#scrollTo=KjNxrWW_yNck&line=3&uniqifier=1) of the `OriginalMAML()` function in the sample code, which part of the code performs the updating of the parameters in the inner-loop? (Refer to the image) :::spoiler (View image) ![](https://i.imgur.com/7oH96ue.png) ::: :white_circle: Line 6 ~ 7 :black_circle: Line 12 ~ 23 :white_circle: Line 24 ~ 32 :white_circle: Line 34 ~ 41 ###### \[Prob. 9\] #### In the ANIL (Almost No Inner Loop) algorithm, what parameters are updated in the inner-loop? ![](https://i.imgur.com/ja1oUm2.png) :white_circle: all the model parameters :black_circle: the task specific parameters :white_circle: all the model parameters except for the task specific parameters :white_circle: none of the above > According to [the paper proposing ANIL](https://arxiv.org/pdf/1909.09157.pdf), only the head parameters are updated, which makes training faster and more effective. <span id="tocheck"></span> :::danger Prob. 10 ~ 12 基本上就是要表示 **把 Line 17~23 換成 inner_update_alt1 的樣子就可以達成 FOMAML** 這一步，也就是把 create_graph 變成 False，（下面定義 meta 參數為 $\theta$， inner loop 參數為 $\phi$）  這樣在求 $\frac{\partial L}{\partial \theta}$ 時就可以以 $\frac{\partial L}{\partial \phi}$ 逼近。原始 MAML 有 create_graph 使得在 inner loop 算 gradient 的參數與算 outer loop 的不同因而區分出 $\theta$ 和 $\phi$。現在不 create_graph 就會使得 $\frac{\partial L}{\partial \theta} \approx \frac{\partial L}{\partial \phi}$ 因為求導時是 w.r.t 同一份參數。除此之外 outer loop 的更新==方式==是**完全一樣的**，變動的只有算微分的==對象==（partial 的分母）。 ::: ###### \[Prob. 10\] #### In the [block of code](https://colab.research.google.com/drive/1N73PnZmZ3jimFe8023lX2w9yNLJOuO5i#scrollTo=KjNxrWW_yNck&line=3&uniqifier=1) of the `OriginalMAML()` function in the sample code, which part of the code the MAML algorithm should we modify to achieve the first-order approximation version(FOMAML)? (Refer to the image)  ![](https://i.imgur.com/07H5zNU.png =70%x)  :white_circle: Between line 4 ~ 5 :black_circle: Lines 17 ~ 23 :white_circle: Between line 25 ~ 27 :white_circle: Line 39 ###### \[Prob. 11\] #### In the [block of code](https://colab.research.google.com/drive/1N73PnZmZ3jimFe8023lX2w9yNLJOuO5i#scrollTo=KjNxrWW_yNck&line=3&uniqifier=1) of the `OriginalMAML()` function in the sample code can be abstracted into the `MetaAlgorithm()` function in `MetaAlgorithmGenerator()` . <br /> Which function in the code the MAML algorithm should we modify to achieve the first-order approximation version(FOMAML)? (Refer to the image) :::spoiler (View image) ![](https://i.imgur.com/kgsYFkz.png) ::: :black_circle: inner_update :white_circle: collect_gradients :white_circle: calculate_accuracy :white_circle: outer_update ###### \[Prob. 12\] #### The [block of code](https://colab.research.google.com/drive/1N73PnZmZ3jimFe8023lX2w9yNLJOuO5i#scrollTo=KjNxrWW_yNck&line=3&uniqifier=1) of the `OriginalMAML()` function in the sample code can be abstracted into the `MetaAlgorithm()` function in `MetaAlgorithmGenerator()` . #### Which function can we use to achieve FOMAML? :::spoiler (View image) ![](https://i.imgur.com/kgsYFkz.png) ::: :black_circle: inner_update_alt1 :white_circle: inner_update_alt2 :white_circle: collect_gradients_alt :white_circle: outer_update_alt :::danger Prob. 13 ~ 15 基本上就是要表示 **把 Line 17~23 換成 inner_update_alt2 的樣子就可以達成 ANIL** 這一步，也就是**只取模型的後兩層來更新其他維持不動**以達成 feature reuse 的目的。除此之外演算法進行的步驟都與 MAML 一模一樣。 ::: ###### \[Prob. 13\] #### In the [block of code](https://colab.research.google.com/drive/1N73PnZmZ3jimFe8023lX2w9yNLJOuO5i#scrollTo=KjNxrWW_yNck&line=3&uniqifier=1) of the `OriginalMAML()` function in the sample code, which part of the code the MAML algorithm should we modify to achieve the ANIL algorithm? (Refer to the image) :::spoiler (View image) ![](https://i.imgur.com/FBqifn2.png) ::: :white_circle: Between line 4 ~ 5 :black_circle: Lines 17 ~ 23 :white_circle: Between line 25 ~ 27 :white_circle: Line 39 ###### \[Prob. 14\] #### The [block of code](https://colab.research.google.com/drive/1N73PnZmZ3jimFe8023lX2w9yNLJOuO5i#scrollTo=KjNxrWW_yNck&line=3&uniqifier=1) of the `OriginalMAML()` function in the sample code can be abstracted into the `MetaAlgorithm()` function in `MetaAlgorithmGenerator()` . #### Which function in the code the MAML algorithm should we modify to achieve the ANIL algorithm? (Refer to the image) :::spoiler (View image) ![](https://i.imgur.com/kgsYFkz.png) ::: :black_circle: inner_update :white_circle: collect_gradients :white_circle: calculate_accuracy :white_circle: outer_update ###### \[Prob. 15\] #### The [block of code](https://colab.research.google.com/drive/1N73PnZmZ3jimFe8023lX2w9yNLJOuO5i#scrollTo=KjNxrWW_yNck&line=3&uniqifier=1) of the `OriginalMAML()` function in the sample code can be abstracted into the `MetaAlgorithm()` function in `MetaAlgorithmGenerator()` . #### Which function can we use to achieve the ANIL algorithm? :::spoiler (View image) ![](https://i.imgur.com/kgsYFkz.png) ::: :white_circle: inner_update_alt1 :black_circle: inner_update_alt2 :white_circle: collect_gradients_alt :white_circle: outer_update_alt :::danger 這邊補充一下 `collect_gradients_alt` 與 `outer_update_alt` 的用途——實作 Reptile。因為 Reptile 是以模型參數的 difference 取代 gradient，因此才需要以 `collect_gradients_alt` 去取得模型更新前後的差異，然後將此「類似 gradient 的向量」放到 outer loop 去更新。同學若有興趣可以想想看怎麼實作 Reptile 演算法。另外由於變數對應的問題，若將輸入輸出變數不同的函數硬是放在一起可能會造成程式無法運作，例如 ```python inner_update = outer_update_alt ``` 就是一個不建議的作法（因為函數變數就已經幾乎無關了）。 ::: ###### \[Prob. 16\] #### During the meta training phase, the learning rate of the inner loop and outer loop must be the same. :white_circle: True &emsp; :black_circle: False ###### \[Prob. 17\] #### The adaptation steps of the inner loop during the meta training phase have to be the same as the adaptation steps during the meta testing phase. :white_circle: True &emsp; :black_circle: False > 這裡的 *adaptation step* 是指 **update 模型的次數** （也許 the number of steps 更精確但應不致歧義，且該名稱與慣例的變數名 `inner_step` 相同） > 本題敘述可以直接看畫底線的部分 > > The adaptation ++steps of++ the inner loop during the ++meta training++ phase **++have to be the same++**++  as++ the adaptation steps during the ++meta testing++ phase. > > > 要告訴同學的是在 meta train 時為了避免 overfit 因此常常 update 的次數會比較少，但 meta test 時由於已經是學好 meta algorithm 了因此可以update 的次數多一些以得到更好的 accuracy，沒有限制說 meta train 只 update 一次所以 meta test 也只能update 一次，這樣的意思。 > ## Advanced Tips > There are various modifications to MAML, for example MAML++. The following questions will be based on [this paper](https://arxiv.org/pdf/1810.09502.pdf). ###### \[Prob. 18\] #### Which one is not the issue of MAML? :white_circle: Different neural network architectures may lead to wide variances of results. :white_circle: It requires to update parameters during inferencing. This sometimes leads to huge computational costs compared to those non-adaptive methods. :black_circle: It usually overfits when adapting the model with the support set. :white_circle: Sometimes it requires difficult hyperparameter searches to stabilize training. > 答案在paper的abstract > ![](https://i.imgur.com/BikCPSx.png) ###### \[Prob. 19\] #### MAML++ is an improved variant of the MAML framework that offers the flexibility of MAML along with many improvements. Which of the following is not the improvement of MAML++? :white_circle: robust and stable training :white_circle: automatic learning of the inner loop hyperparameters :black_circle: improved computational efficiency in the inference phase, but not the training phase :white_circle: improves generalization performance > 答案在paper的introduction最後一段 > ![](https://i.imgur.com/bnF9Cv1.png) ###### \[Prob. 20\] #### The paper of MAML++ points out several problems of the original MAML. Which of the following is/are correct? (select 3 choices) :white_square_button: It may be very unstable during training. :white_large_square: The first order derivative computation requires a lot more time compared to the second order derivative computation. :white_square_button: The way that batch normalization is used may affect the generalization performance of MAML. :white_large_square: In every inner loop step, the learning rate for adaptation varies. :white_square_button: In every outer loop step, the learning rate for adaptation is fixed. > Section 3.1 ###### \[Prob. 21\] #### Batch Normalization is a key role in batch processing and calculation. Which of the following is/are correct? (select 4 choices) :white_square_button: Only using the statistics of a current batch for batch normalization may affect the generalization performance of MAML. :white_large_square: Applying accumulating running statistics in batch normalization results to batch normalization being less effective, since the biases learned have to accommodate for a variety of different means and standard deviations instead of a single mean and standard deviation. :white_square_button: Batch normalization using accumulated running statistics will eventually converge to some global mean and standard deviation. :white_square_button: Using accumulating running statistics instead of batch statistics, may greatly increase convergence speed, stability and generalization performance as the normalized features will result in smoother optimization landscape. :white_square_button: Batch normalization in MAML has a problem that batch normalization biases are not updated in the inner-loop; instead, the same biases are used throughout all iterations of base-models. > 答案在paper的section 3.1，(b)敘述的是不使用accumulating running statistics的結果 ###### \[Prob. 22\] #### Section 4 is about a stable, automated and improved MAML training method. Which of the following is/are correct? (Select ~~2~~ ==3== choices) :white_square_button: For a meta-batch, MAML updates the model parameters in the outer-loop after it has finished all the inner-loop steps. But with multi-Step loss optimization, it minimizes the target set loss computed by the base-network after every step towards a support set task. :white_square_button: Learning a set of biases per-step within the inner-loop update process may help fix the shared (across step) batch normalization bias problem of the original MAML training method. :white_large_square: Sharing running batch statistics across all update steps of the inner-loop is better than collecting statistics in a per-step regime because the latter leads to optimization issues and potentially slow down or altogether halt optimization. :white_square_button: Learning a learning rate and direction for each layer in the network is better than learning a learning rate and gradient direction for each parameter in the base-network since the latter causes increased number of parameters and increased computational overhead. > $c$反了 > You can get full points if you choose at least 2 of them. ## Applications :::info Speech separation is a long existing problem, also known as the cocktail party problem(For more information if you want, please refer to these two videos, [ss1](https://youtu.be/tovg5ZxNgIo), [ss2](https://youtu.be/G0O1A7lONSY)). The following questions will be based on [this paper](https://www.researchgate.net/publication/346089726_One_Shot_Learning_for_Speech_Separation). ::: :::warning ###### \[Prob. 23\] #### In section 3.1, how is a meta-task constructed? :white_circle: Mix two utterances spoken by the same speaker to form mixture speech. Take 9 mixtures and sample a support set and a query set from them. :black_circle: Mix two utterances spoken by two different speakers to form mixture speech. Take 9 mixtures and sample a support set and a query set from them. :white_circle: Mix two utterances spoken by the same speaker to form mixture speech. Moreover, also mix two utterances spoken by two different speakers to form mixture speech. Take 9 mixtures of these two kinds and sample a support set and a query set from them. :white_circle: Mix two utterances spoken by the same speaker to form mixture speech. Sample some mixtures of this kind for the support set. Mix two utterances spoken by two different speakers to form mixture speech. Sample some mixtures of this kind for the query set. ::: :::info Speech recognition is an important task and can also trained with meta-learning. (For more information related to speech recognition if you want, please refer to these videos, [asr1](https://youtu.be/AIKu43goh-8), [asr2](https://youtu.be/BdUeBa6NbXA), [asr3](https://youtu.be/CGuLuBaLIeI), [asr4](https://youtu.be/XWTGY_PNABo), [asr5](https://youtu.be/5SSVra6IJY4), [asr6](https://youtu.be/L519dCHUCog), [asr7](https://youtu.be/dymfkWtVUdo)) The following questions will be based on [this paper](https://arxiv.org/pdf/1910.12094.pdf). There is also a short presentation for [this paper (Meta-ASR)](https://youtu.be/goav0eXKPwg). Feel free to take a look. ::: :::warning ###### \[Prob. 24\] #### The idea of MAML is to learn initialization parameters from a set of tasks. In the setting of meta learning for low-resource ASR mentioned in section 2.2, which of the following is/are correct? (select 2 choices) :white_square_button: Different tasks have different language utterances. :white_large_square: The initialization parameters obtained by MAML should only adapt to one language well. This, helps the adapted model to achieve high performance on a specific language. :white_large_square: In one meta-task, the support set and query set can have different languages of spoken utterances. :white_square_button: MultiASR optimizes the model on all source languages directly, without considering how learning happens on the unseen language. ::: :::info Neural machine translation is a well-known task in the NLP field. Given an input sentence, a model translates the sentence into another language. The following question is based on [this paper (MetaNMT)](https://arxiv.org/abs/1808.08437). ::: :::warning ###### \[Prob. 25\] #### Which of the following statement is most likely not true? :white_circle: The machine translation task in this paper aims to translate sentences of different languages to English. :black_circle: According to the experiments in section 5(vs. Multilingual Transfer Learning), during the meta testing phase, a subset of the source tasks is sampled for fine-tuning the model. :white_circle: According to the experiments in section 5(Impact of Validation Tasks), the choice of a validation task affects the final performance. :white_circle: According to the experiments in section 5(Impact of Source Tasks), the choice of source languages has different implications for different target languages. :::