Meta-Learning (lilog)

# Meta-Learning (lilog) Model-based appraoch: The model-based approach aims at designing an internal architecture specifically for fast-learning. Unlike the metric-based approach, it doesn't model the classifier directly and thus makes no assumption about p(y|x). - memory augmented neuron networks (e.g Neural Turing Machines, Memory Networks, MANN). - meta-networks ### Meta Network - use neuron networks to generate networks weights (fast weights). - loss gradients(meta information) are used to populate models that learn fast weights. - **MANN** stands for Memory-Augmented Neural Network." Note that thi is different from the *internal memory* in vanilla RNN or LSTM. It leverages the methodology of memory retrieval and storage to achieve **fast learning.** - the concept is that we want to encode new information fast and thus to adapt to new tasks after only having seen a few samples. - consists of a **base learner** and a **meta learner**. Base learner is for learning the actual task, whereas meta learner learns the embedding of a pair of inputs, like a verification task in Siamese Network (representation learning). - the output here consists of two parts, from meta-leaner and base-learner, the "key memory" and "value memory," respectively. - key and value memory are the task-level fast weights and example-level fast weights respectively. ### MANN - The **NTM** has a controller that learns to read and write memory rows by soft attention, while the memory serves as a knowledge repository. - The attention mechanimsms include (1) content-based (2) location-based. They form the read and write operations in MANN. They may be called the **read attention** and **addressing attention** respectively. - Reading from memory can be achieved by content-based attention, which computes the similarity between vectors with softmax. - Writing to memory can be achieved by following the **cache replacement policy** (e.g LRUA for MANN). It prefers to write new content to either **least used memory location** or **most recently used memory location**. - LURA can be carried out in the following way: (1) usage weight: sum of current read and write vectors (2) read vector: consine similarity between a feature vector and weight matrix. (3) write vector: the interpolation between between the previous read weight and the least-used weight. (4) The least used weight is a binary vector where if the given usage weight is smaller than the n-th smallest usage weight in the vector, 0 otherwise. - use (2) and (3) to compute (1), and then use (1) to co,pute (4) - every memory row in the memory module (previous timestep) is updated with the write vector (new timestep)