# Machine Learning & Deep Learning Questions
szuyul@andrew.cmu.edu
linkedin.com/in/szuyul
## Classic methods
### PCA
Principal component analysis
Eigenvectors: direction / normal vectors. Perpendicular to each other (i.e. inner product = 0).
Eigenvalues: magnitude
covariance --> select those with small covariances
### Regression
Fit each point (training sample) to a line or polynomial, so that the distance/sum squared loss is the minimum.
Predicts the continuous variable, given a set of independent variables.
$y = a_0 + a_1 x + \epsilon$ where $\epsilon$ is the error term.
### Logistic regression
To predict categorical variable. Output is between 0 and 1.
Maximum likelihood estimation.
$log \frac{y}{1-y} = b_0 + b_1x_1 + ... + b_nx_n$
### Majority vote & weighted majority vote
No trainable weights. Decides the output label based on majority of ground truth labels (of each sample in the sample space).
### Decision Tree
Decide the traversing direction onto children nodes depending on the current sample's attributes. Leaves are predicted labels.
Prevent tree overfit: Don't expand beyond certain threshold (depth) / splitting criterion or threshold.
### Bayesian Models & Markov Property
Bayesian models: Graph structure of a sequence of "events" assigned with probabilities.
Markov assumption / property: Assume current $\hat{y}(t) = f(x, \hat{y}(t-1))$.i.e. Current state is only dependent of previous state.
### RANSAC (Random Sample Consensus)
```
best_err, best_model = 1e+10, None
for iter in range(iters):
possible_inliers = random.choice(data, k, replace=False)
# k is given number, for example we want 3 to fit for a plane
temp_model = (fit with possible inliers)
also_inliers = []
for i not in possible_inliers:
if i fits model:
also_inliers.add(i)
new_model, new_err = (fit with all inliers)
if new_err < best_err:
best_err = new_err
best_model = new_model
return best_model
```
### K Nearest Neighbors
Determine the label of input sample based on majority vote result of labels of its k nearest neighbors.
Weighted: for example weight majority votes based on distance to input sample.
### K-means clustering
k: number of centroids in the dataset. Randomly selected for initialization.
For each iteration, each data point is assigned to the closest centroid, and the updated centroid is given by re-calculating the center of current cluster members.
This is an unsupervised learning technique.
### Perceptron
Example of an "online learning algorithm". Input order of training data matters.
No means of "optimizing" but only adjusts the weights to fit the most recent input.
Prediction:
$\hat{y}^{(i)}=sign(w^Tx^{(i)}+b)\; \forall i$
$=sign(b+\sum^N_{n=1}(w_nx_n^{i}))\in (1, -1)$
$y^{(i)}\in (1, -1)$
```
if y_hat == y:
w = w, b = b
else if y_hat == 1 and y == -1:
w -= x, b -= 1
else if y_hat == -1 and y == 1:
w += x, b += 1
```
### Support Vector Machine
In constrast to perceptron, it finds the hyperplane which separates the (linear separable) data with the largest margin $\gamma$.
The closest points (normal distance to hyperplane is the margin) are called the support vectors.
Prediction:
$\hat{y}^{(i)}=w^Tx^{(i)}+b\; \forall i$
$y^{(i)}\in (1, -1)$
Objective function:
$max\;\gamma\; s.t.\; y^{(i)}(w^Tx^{(i)} + b)>=1$
Use SGD to optimize $w, b$.
---
## Neural Networks
### Layers
* Linear
Linear conbination of elements of the input vector.
usage: Generate hidden vectors or act as a classifier.
```
Arguments: input_dim, output_dim
#weights: output_dim * input_dim
#biases: output_dim
Forward pass:
// matmul
vector<float> z(out_c);
for (int i=0; i < out_c; i++){ // output dimensions
float temp = 0;
for (int j=0; j < in_c; j++){ // input dimensions
temp += W[j][i] * input[j];
}
temp += b[i];
z[i] = temp;
}
return z;
Backprop gradients:
W:
b:
x:
```
* Embedding
Similar to linear layer, but designed for discrete inputs (indices).
Extract a feature vector for specified input index i (fetches the ith row).
Linear(F.one_hot(x), bias=False) = Embedding(x)
```
Arguments: n_vocab, embed_dim
#weights: n_vocab * embed_dim
#biases: 0
Forward pass: z = weights[i]
```
* Convolution
Slides a kernel through the spatial domain of the input. For each step, the output (of the corresponding spatial location) is the summed element-wise product of the kernel weights (flipped in each direction) and receptive field.
"Filters": Kernel for each output channel.
Each kernel is applied to all input channels, and we sum up all kernel products of each input channel to get the output (for that output channel). Finally add bias.
Kernels are small so it is relatively ligh-weight compared to dense layers.
The "receptive field" gives convolution the capability of extracting features in a neighborhood in spatial domain.
(e.g. MNIST: Flatten -> linear approach vs convolution approach).
```
Arguments: in_c, out_c, (kx, ky), stride, padding_size
#weights: in_c * out_c * kx * ky
#biases : out_c
output size: (input_size - k + 2*padding) // stride + 1
Forward pass (1-d, 1 channel):
int in_size = input.size();
int out_size = (in_size - k + 2 * p) / s + 1;
vector<float> z(out_size);
for (int t = 0; t < out_size; t++){
for (int tt = 0; tt < k; tt++){
z[t] += W[tt] * input[t * s + tt];
}
z[t] += b[0];
}
return z;
```
* Depth-wise convolution and depth separable convolution
Depth-wise conv: Convolves along the depth dimension instead of spatial dimensions.
Depth separable convolution: Each filter is only applied on one (or specific multiple) input channels. Intuition is to extract from different feature maps.
* Maxpool and Avgpool
1D maxpool: Leetcode 239 (efficient way).
Returns the max/avgpool element in the receptive field.
Intuition: To reduce spatial resolution and keep the most dominant feature (only max).
```
Arguments: kernel_size, stride, padding
output_size = (input_size - kernel_size + 2*padding) // stride + 1
```
* 1x1 pool vs 1x1 convolution
* 1x1 pool (kernel_size = input_size): Takes the average/max value of the entire feature map. Depth is preserved.
* 1x1 convolution (kernel_size = 1x1): Only change the depth of input tensor, does not effect the spatial resolution.
Each output feature map is a linear combination of all input channels.
* BatchNorm
Normalize inputs of a layer. Will stabilize and speed up network training. Usually before ReLU but after Sigmoid/Tanh.
No trainable parameters but only keeps record of running mean and std.
Only active when training, set to inactive during validation / inference.
* Dropout
For given probability (dropout rate, 0=no dropout, 1=no data), randomly deactivate nodes during training. A form of "dimension reduction".
Prevents overfitting so that model generalizes better.
Same as batchnorm, dropout is turned off during validation / inference.

```
if random.random() <= p:
x = 0
else:
x = x
```
* RNN
* LSTM
---
### Activations
* ReLU
Most commonly used activation function. Creates sparsity from dense features and as well prevents early saturation (since it only saturates in 1 direction).
Simplicity in computation & accelerate training.
Advantage over Sigmoid & Tanh: The latters have very limited output range which will result in vanishing gradients.
```
x *= (x > 0)
```
* LeakyReLU
Instead of setting negative terms to 0, we give it a slope (scalar) to allow it to have small negative values.
Addresses the problem when by chance all neurons have negative outputs, the network is deactivated (dying ReLU problem).
```
if (x > 0):
x = x
else:
x = slope * x
```
* Sigmoid
S-shape activation function to scale all inputs to between 0 and 1.
Used to predict a probability between 0 and 1.
$\sigma(x) = \frac{1}{1+e^{-x}}$
* Tanh (Hyperbolic Tangent)
In contrast to sigmoid, tanh scales the inputs to be between -1 and 1.
Mostly used in recurrent models.
$Tanh(x) = \frac{e^{2x}-1}{e^{2x}+1}$
* Softmax
For an input vector, it scales all elements to be between 0 and 1, and also sum up to 1.
Used when calculating a probability distribution within a vector, e.g. classifier.
$x \in R^N, \;softmax(x_i) = \frac{e^{x_i}}{\sum^N_{j=1}e^{x_j}} \forall i$
* bmm (batch matrix-matrix product)
`input 1`$(b \times n \times m)$, `input 2`$(b \times m \times p)$, `output` $(b \times n \times p)$
---
### Loss
* L1 and L2 regularization
* Problem type: General
* Sum of L1 or L2 norm of all magnitudes of weights & biases in the model. Add to final loss value to penalize large weights, in order to prevent overfitting.
```
for param in model.parameters():
L1 += abs(param.data)
L2 += param.data ** 2
```
* MSE Loss
* Problem type: Regression
* The mean squared loss between target and prediction.
* Hinge Loss
* Cross Entropy Loss (CE, XE)
* Problem type: Multilabel Classification
* Combination of Softmax and NLL (Negative Log Likelihood) Loss.
* $loss (x, class_i)=-\log\frac{\exp(x[class_i])}{\sum_j \exp(x[class_j])}$
$=-\log x[class_i]+\log(\sum_j \exp(x[class_j]))$
i.e. maximize the log likelihood of the ground truth class.
* where predicted probability $\hat{y}_i=1 \;if\; y=class_i$ (one-hot vector)
* Binary Cross Entropy Loss (BCE)
* Problem type: Multi class classification
* There can be multiple labels present (1) in a training sample (i.e. multihot vector).
* $l_n=-(y_n\log(\sigma(x_n))+(1-y_n)\log(1-\sigma(x_n))) \forall \; n\in\{0, 1, ..., n\_classes-1\}$
i.e. maximize the log likelihood for the correct class prediction (presence, 0 or 1).
* Siamese Loss
* Problem type: Determine if two instances are the same
* This is a form of "metric learning".
* A pair of samples (e.g. photos) are passed into the feature extraction network and produce a pair of feature vectors, respectively. Based on the distance we determine if they are a positive pair (e.g. same identity/person) or a negative pair.
* Weakness: This only minimizes the intraclass (between instances of the same class / identity) distance but has no answer to the interclass (different classes) distance.
* Triplet Loss
* Problem type: Determine if two instances are the same
* Designed to tackle the problem of siamese loss, which is to minimize intraclass distance & maximize interclass distance.
* Therefore, during training, network takes in 2 image pairs -- 1 positive pair (same person) and 1 negative pair.
* Prerequisites: Sampling technique. The nature of the data will result in a highly biased dataset:
For example, when we have N classes there will be approximately N positive pairs and (N-1)N/2 negative pairs. If we train on all these, the neg/pos ratio grows linearly with sample count N.
Therefore we have to sample negative pairs (which are dominant in number) so that the neg/pos ratio is around an acceptable value, otherwise training will be very hard.
* CTC Loss and Levenshtein Distance
* Problem type: Compute loss between two sequences.
* Levenshtein distance is the validation counterpart -- the character-wise distance between a predicted and a ground truth sequence.
---
### Backprop and chain rule
* Linear (as an example)
---
### Optimizing
* SGD
* "Ministeps" through loss gradients of all samples.
* $x^{(i)}\in R^j$
$\nabla J(\theta)=(\hat{y}^{(i)}-y^{(i)})x_j^{(i)}$
$\theta =\theta - \alpha\nabla J(\theta)$
* Does not guarantee convergence on any convex function
```
I. Optimize f(x) = x^2
f'(x) = 2x, f"(x) = 2
Upper bound of learning rate to guarantee convergence = f"(x) = 2.
II. Optimize g(x) = x^4
g'(x) = 4x^3, g"(x) = 12x
We do not have an upper bound to guarantee convergence on convex function g(x).
The generalized 2nd derivative is called the Hessian.
```
* Batch GD
Disadvantage is that the loss gradients might get averaged out since we are computing the batch loss (and eventually get stuck in a local minima).
* Momentum
* Advantage: Memorizes and adds a portion of gradient from the last step, therefore even if we get stuck in a local minima, we can still make a move.
* Tackles noisy gradients (since we are using minibatches) and speeds up training.
* $V = \eta V-\alpha\nabla J(\theta)$
$\theta =\theta + V$
* Adam
* Learning rate adjustment
---
### Common problems
* Vanishing gradient
* Behavior: Observe / visualize the gradients of first layers.
* Cause: When network is too deep, gradients become very small through N backward passes, so training becomes very ineffective.
* Remedy: Add skip connections (Residual blocks / ResNets), add activations (ReLU etc. because they only saturate in 1 direction), reduce model depth, replace RNNs with LSTMs.
* Exploding gradient
* Behavior: Gradients increase exponentially. Performance is poor.
* Cause: There might be big values in inputs, which generate large outputs -> large backprop gradients -> large changes in weights.
* Remedy: Reduce learning rate, reduce model size, normalize inputs, add regularization terms, clipping gradients and weights, network weight initialization (not really, best is to change architecture).
* Overfitting
* Behavior: Training error keeps decreasing but validation error does not.
* Cause: The model fits to the training set too much and loses the ability of generalizing with unseen data.
* Remedy: Decrease network size, add dropout ("deactivate nodes"), add augmentation, add regularization terms, early stopping (when validation accuracy saturates).
* Underfitting
* Behavior: Both training and validation performances are poor.
* Cause: Model is not trained enough, or architecture is incapable of capturing the information.
* Remedy: Increase model architecture size, train for more epochs.
* Biased dataset
* Behavior: Model easily "saturates" and stops improving. Inference results are not quite what we expect.
* Cause: If using a simple loss function, e.g. cross entropy loss, the predictions stick to 0 because it gets a "fair" accuracy (if half the labels are 0, then predicting all zeros get a 50% accuracy), and the backprop loss hardly trains the model weights to predict anything rather than 0.
* Remedy: Redesign loss function (e.g. weighted loss, or add additional loss function), use sampling techniques on dataset (e.g. triplet loss samples a positive pair and a negative pair for each training instance, though the latter dominates in number.)
* Lack of data
* Behavior: Model overfits easily (assume still doing train-val splitting) and generalizes extremely poor on unseen samples. Performance may differ a lot during different runs with identical settings.
* Cause: Not enough data to train model.
* Remedy: Decrease model size / simplify architecture, add more data, do augmentation
* Data normalization
Equally weigh all features, by `x = (x - mean) / std`. Rescale to a certain range.
Stabilizes training & prevent specific features from being weighed heavier in the loss functions.
* Setting all weights to 0
All neurons will follow the same gradients (or no gradients at all), and do the exact same behavior. So it's a good thing to have them start at different values.
* Embedding - Linear Equivalence
Embedding: To select an entire "column" at the desired index.
linear on one-hot: Only the desired label is 1 while others are 0, therefore this operation also only selects one column of the weight matrix. Bias must be false here.
```
# in_c : total classes
# out_c: size of desired feature vector
x = torch.LongTensor(np.random.randint(...))
x_one_hot = F.one_hot(x, classes=in_c)
linear = nn.Linear(in_c, out_c, bias=False)
embedding = nn.Embedding(num_embeddings=out_c, num_classes=in_c)
linear(x_one_hot) = embedding(x)
```
---
### Other questions
* Relation between learning rate and batch size
Some optimizers are invariant of batch size, e.g. Minibatch SGD
Depends on loss reduction (mean, sum, etc.)
* Batch size too large: Generalize poorly (since the loss is a single scalar summing up the losses of each sample, therefore the affect of each sample might get "averaged out")
* Batch size too small: Not efficient for doing backprop once per sample. Sensitive for extreme samples.
* Learning rate too large: Though loss drops fast at the beginning, but saturates early too (step is too big to approach a small loss "gap").
* Learning rate too small: Training is slow and easily coverges at a local optimum.
* K-Fold cross validation
For evaluating models when the dataset is very small, or how model generalizes between different subsets.
The idea is to train and evaluate the model multiple "folds" on different train/val splits of the dataset, to observe the average performance of the model architecture (since it's easy to have biases when data size is small, i.e. picking a "hard" or "easy" subset which affects the score a lot).
The dataset is splitted into K "folds", i.e. subsets of identical sizes.
In fold 1, subset 1 is used as the validation set while the remaining is used for training.
For fold 2, the model's parameters are reset, and the same process is repeated on subset 2, and so on.
Finally, the stats of all subsets will be collected and viewed as the final validation performance (ability of generalizing) of the model "architecture".
* Xavier initialization
The idea is to initialize the weights such that the variance of the activations are the same across all layers, in order to prevent gradient from exploding or vanishing.
Biases are usually set to zero (not untrainable but zeros!).
* Log-Sum-Exp trick
Objective: Numerical stability.
* Multimodal machine learning
### Common practices
* Pretraining
Train an image backbone on a large image dataset, so that we have a deep network generalized to images, and the extracted feature (maps) could be applied to further tasks (classification, detection, etc.)
* Caching
When using a backbone network, it will be pretty efficient to store the files in the filesystem so that we don't have to extract feature vectors through all the deep layers every iteration.
A tradeoff between memory and speed.
* Knowledge distillation
---
## Metrics
### Binary Classification (T / F)
* Precision: `TP / (TP + FP)`, i.e. fraction of true positives of all predicted positives.
* Recall: `TP / (TP + FN)`, i.e. fraction of true positives found, of all positives within data.
* F1: Weighted average of precision and recall `2 * (P * R) / (P + R)`. Takes both true negatives and false positives into account.
* ROC Curve: Plot true positive rate (y-axis, i.e. recall) against false positive rate (x-axis, `FP / (FP + TN)`).
* No skill: Diagonal line.
* Perfect classifier: Go through top left corner.
### Multi-class Classification
* Classification accuracy (Cross entropy loss)
```
y_hat = model(x) # batch_size, num_classes
y = labels # batch_size, 1
preds = argmax(y_hat, dim=1) # get argmax along 1st dim (classes)
correct += (preds == y)
```
### Regression
* L1 or L2 norm.
### Detection

* IoU (Intersection over Union): Fraction of the intersection area over the union area of 2 bounding boxes (i.e. prediction and ground truth).
* Precision and recall in the detection case: Detections with confidence score above a certain IoU threshold is considered "True", otherwise "False".
* Precision and recall curve: Plot the corresponding precision scores at each fixed recall milestone (can be generated with different IoU thresholds).
* AP (Average precision): The area below the P-R curve.
* mAP: Take the mean AP scores across all classes.
### Tracking
* ID switches: Tracklet ids switches after certain occasion, e.g. occlusion.
* Broken tracklets: Tracklet identified as another one after certain occasion.
* False tracklets: False positives, i.e. nonexistent objects.
* Missed objects: False negatives.
* Lost tracks: Losing track of object in scene.
* Localization error: Either a state estimation or detection problem.
---
## Deep Neural Network Models (Vision)
### ResNets
The name resnet comes from "residual".
Residual connection is designed to tackle the vanishing gradient, by adding connections ("skip connections") between layers & providing another path of backpropagation through the depths.
A layer block with skip connections is called a "residual block".
---
## Deep Neural Network Models (Object Detection)
### RCNN Series

* RCNN
* Apply searching algorithm to generate "Proposals".
* Run a CNN over each proposal.
* 2 branches for the output -> bbox regression (localization) and SVM (one for each class).
* Fast RCNN
* Apply searching algorithm to generate proposals / RoIs.
* Run CNN over entire image for only once. Output is a feature map.
* Get the RoI projections on output feature map.
* RoI pooling layer: resize into uniform size -> FC layer.
* Classifier & bbox regressor.
* Faster RCNN
* Selective search is slow & no learning at all -> automate it with neural networks, which also has the benefit of learning.
* RPN (Region Proposal Network): Takes in the feature map & predict region proposals. Sliding window with anchors.
* RoI pooling to reshape the RoIs and feed into FC layers.
* Anchors: 3 different scales and 3 aspect ratios (1:1, 2:1, 1:2)
* Losses: RPN classification (binary), RPN localization, Fast RCNN classification, Fast RCNN localization.
### SPP Net (Spatial Pyramidal Pooling)

* Proposal projections of Fast RCNN
* Apply SPP layer on each proposal
* Each proposal is pooled to have (1, 4, 16) values respectively (grid). They are flattened, concatenated and passed into FC layers.
* Input size / scale invariant
* FC, regression and classification
### YOLO
Does detection and classification in one shot.
End-to-end trainable and does not require additional proposal methods (e.g. RoI).
* The input image is divided to an `S * S` grid (for the original case `S=7`, because the backbone output is a set of feature maps with `7 * 7` shape). Each grid is "responsible" of detecting the object if its center falls in that grid.
* The grid produces `B` bounding box proposals, each represented with a 5-dimensional vector `[x, y, w, h, conf]`. `conf` is the objectness score / confidence.
* Every grid also predicts a set of class probabilities for all `C` classes.
* In the original case, by selecting `B=2, C=20` we have the output tensor's dimensions:
`S * S * (5B + C) = 7 * 7 * 30`.
* The output is then decoded and rendered into bounding boxes. Finally a non-maximum suppression (NMS) is applied to remove redundant boxes.
The network consists of 2 parts:
* Backbone: DarkNet, pretrained on large image dataset.
* YOLO layer: A linear classifier. It is actually a linear layer, which takes in the flattened feature maps and produces the output tensor (reshaped). In YOLOv3 there are 3 YOLO layers, for detecting objects at different resolutions (hence performs better on small objects).
Box decoding:
For the output tensor from each grid, the outputs are normalized to between 0 and 1 -- respective to the in-grid coordinates. So we have to rescale them to global coordinates.
NMS: considers the IoU score & objectness confidence of each box, and remove possible redundant ones.
* Filter out the boxes with objectness scores below certain threshold.
* If two bounding boxes have the same class prediction & an IoU score above certain threshold, remove the one with lower confidence score.
Loss functions:
* Localization loss: MSE between predicted dimensions. Computes that of `x, y` and `w, h` respectively.
* Objectness loss: Penalizes difference between predicted and actual objectness score (1) and non-objectness score (0), for each grid box responsible for each object.
* Classification loss: Cross entropy loss.
---
## Deep Neural Network Models (Segmentation)
### Comparison
* Semantic segmentaion: Person, Person, Person, Person
* Instance segmentation: Person0, Person1, Person2, Person3
* Panoptic segmentation: Person0, Person1, Person2, Person3, Ceiling, Floor, Wall
### Mask RCNN
* Backbone: FPN (outperforms single CNNs because they are more scale robust). Outputs proposed regions.
* ROI Align layer: Replacement for ROI pooling.
* MRCNN: Generates pixel-wise mappings of labels (and bounding boxes). Similar to Fast RCNN but with the third output branch which generates segmentation masks.
---
## Deep Neural Network Models (GANs)
* Components
* Generator: Generates predictions to "confuse" the discriminator.
* Discriminator: Tries to distinguish between ground truth and synthetic images.
* Form an "hourglass" form (since the output is often upsampled to the original scale).
* Autoencoders: To learn a compressed form / representation of the input.
---
## Deep Neural Network Models (Pose Estimation)
### Convolutional Pose Machines
### OpenPose
---
## Deep Neural Network Models (NLP)
### Seq2seq
Transform a sequence to another, both can be arbitrary length.
* Encoder: Input sequence -> context vector.
* Decoder: Context vector -> transformed output.
### Self-Attention

Relates different positions of sequence, to compute representation of "the same sequence".
We have 3 sets of particular weights: keys, values and query.
To compute the output attention score, we do the following to the input (`@` here is the bmm operator):
* Pass the input through the `key_projection` module, to obtain the keys `K`.
* Pass the input through the `value_projection` module, to obtain the values `V`.
* Output from LSTM / decoder can be used as the query `Q`.
* Energy: `E = Q @ K`.
* Attention map: `energy > 0 : 1 ? -1e9` (mask out padding)
* Attention score: `A = softmax(map * E)`.
* Output context: `y = A @ V`.
### Multihead Attention
The idea is to divide the features into N groups (by depth), and learn individual sets of `{Q1, K1, V1}, ..., {Qn, Kn, Vn}` weights, in order to capture context of different feature maps.
The processed context are either summed or concatenated.
### Transformer

"Attention is all you need"
Only built on self-attention mechanism without the use of recurrent units.
$Attention (Q, K, V) = softmax(\frac{QK^T}{\sqrt{n}})V$
$Q \in R^m; K, V \in R^n$
i.e. attended context is the weighted sum of the `values`, while weights assigned to each `value` is determined by the dot product of the `query` with all `keys`.
* Encoder
Encoder generates attention-based representation of the input.
Stack of 6 identical layers, each has a multi-head self-attention layer & position-wise linear layer.
Skip connections for multi-head attention and linear layers.
* Decoder
Retrieves information from encoded representation.
Stack of 6 identical layers, 2 multi-head self-attention layers & 1 linear layer. All with skip connections.
1st attention is "masked" to prevent attending to subsequent positions (don't want to look into future of target sequence when predicting current position).
### BERT

BERT = Bidirectional Encoder Representations from Transformers.
Built on top of the transformer encoder.
15% of words in each sequence are replaced with the `<MASK>` token, and the model attempts to predict the original values of the masked words (given context of both sides) -> learns word context based on its surrounding words.
To predict the output words:
* Linear classifier on top of the encoder output.
* Multiply output vectors by the embedding matrix -> transform into vocab dimension.
* Calculate probability of each vocab word with softmax.
---
## Graph Neural Network Models
Takes a graph as the network input. (PyTorch-Geometric)
Graph components:
* Nodes: Initialized with some feature vector.
* Edges: Connections between nodes. May have individual weights or not.
* "Neighborhood": A target node, its connecting edges and neighboring nodes.
Output (general case): Predict the label for each node.
### General GNN (Graph Convolution)
For each iteration, the feature vector of each target node is updated with the following process (the nodes will eventually encode neighborhood information through iterations):
* Self features: Pass the feature vector through a linear set of weights.
* Edge features: Pass each (edge features @ connecting node features) through another set of linear weights. @ is any form of operator.
* Aggregator: We usually add an "aggregator" to preserve feature size consistency, for example, sum() or max().
$x_i' = W_nx_i+W_e\sum_{j \in N(i)}e_{ij}x_j$
Finally, when we finish the update and aggregation through many iterations, we have a feature vector for each node. We do the exact same thing as general classification tasks: linear classifier and softmax.
### GraphSage
GraphSage discards individual weights for each edge, but learns a set of generalized edge weights (which will be multiplied with the neighbor node weights) instead.
Greatly reduces memory usage and training time, but only applicable to cases in which we only focus on nodes and edge weights are not important.
$x_i' = W_nx_i+W_e\sum_{j \in N(i)}x_j$
### Graph Attention Networks
Adds an additional set of learnable attention features to the graph operator. Multihead is also supported here.
* $\alpha (i, i)$: attention weight of node $i$ itself.
* $\alpha (i, j)$: attention weight of "edge" connecting nodes $i, j$. No actual edge weights but represents this with the attention weight (how strong is the connection between nodes $i, j$ ?).
$x_i' = \alpha_{ii}W_nx_i+W_e\sum_{j \in N(i)}\alpha_{ij}x_j$