Week 8 Notes

Transformer Networks 🤖

Purpose of encoder: Low entropy info to high entropy, lower dimension representation/encoding

Purpose of decoder: Making sense of the representation to produce something

Positional encoding:

Using sin & cos:
- to be able to be used with longer lengths than training examples
- “We chose this function because we hypothesized it would allow the model to easily learn to attend by relative positions,” (from the Attention paper)

Encoder-Decoder Attention similarities:

Using decoder hidden states as queries, encoder attention scores as keys, encoder hidden states as “keys”

Multi-Head attention:

Self-attention is achieved through passing Q, K, V from itself to itself
At each sub-layer, there is normalisation between old Q, K, V and new. Therefore retaining dimensionality is needed.
Multi-headed-ness: For every head, it just the same thing but with different weights. This allows each attention head to pay attention to different parts/representations.
Why self-attention? Because this is how to encode the history up to the point - similar to the hidden states in an RNN.

Why residual connections?

The result of multi-head attention is (h * d_h)-dimension after concatenation. So d_h is normally d_model / h to be able to perform the addition.

Then you “put them back” into the original sequence in some way.

LayerNorm needs to take place to prevent values from exploding

Feed-Forward Network

Needed to “combine” the different inputs - allows more complex interactions between attention vectors

Decoder Layers:

Only different from encoder with masking. This is to allow only attention to scores from the previous words in the sequence. This is only for training.
Masking is only applied to self-attention sub-layer

CNNs

Why CNN?

Faster than RNN (due to being more parallelized)
CNN good at extracting signals

Convolution

For NLP: using 1-d convolution
Image recognition: 2-d convolution

Architecture in 2014 paper: n words * k vectors -> filter -> max-over-time pooling

Generating a feature map

Sentence with n words
window of length h
Result is a feature map in R^n-h+1

Why do we take the maximum value (in pooling)?

Reduce dimensionality
Helps to capture the strongest signal

Multi-channel approach

To prevent overfitting
How to use:

Character-level CNN

Image Not Showing Possible Reasons
The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported
Learn More →

Advantages: Model is robust against typos and misspelling, and can be used for strings like URLs

Disadvantages: Longer to train

Quantization

Encoding target characters as one-hot vectors

Larger datasets tend to perform better in CLCNN.

"No Free Lunch Theorem"

URL Example: Sum pooling used to "accumulate" rather than max-pooling

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`	在筆記中貼入程式碼
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.

Week 8 Notes

Transformer Networks 🤖

CNNs

Character-level CNN Image Not Showing Possible Reasons The image file may be corruptedThe server hosting the image is unavailableThe image path is incorrectThe image format is not supported Learn More →

Papers Referenced:

Character-level CNN

Image Not Showing Possible Reasons
The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported
Learn More →