# Question 1
## Part (a)
A
## Part (b)
E
## Part (c)
A
## Part (d)
B
## Part (e)
B (not sure)
## Part (f)
A
## Part (g)
A
## Part (h)
E
## Part (i)
Don't know
## Part (j)
Don't know
# Question 2
1. False
2. True (not sure)
3. False (not sure)
4. False
5. False
6. False
7. False
8. True (not sure)
9. False
10. False (not sure)
# Question 3
## Part (a)
1. Dropout (`nn.Dropout`)
2. Batch normalization (`nn.BatchNorm1d`)
## Part (b)
(Not sure what was used in the course.)
1. ReLU (`nn.ReLU`)
2. Sigmoid (`nn.Sigmoid`)
3. LeakyReLU (`nn.LeakyReLU`)
4. Tanh (`nn.Tanh`)
## Part (c)
(Not sure what was taught in the course.)
Deep learning is machine learning using deep neural network?
(From Google: "In practical terms, deep learning is just a subset of machine learning. In fact, deep learning technically is machine learning and functions in a similar way (hence why the terms are sometimes loosely interchanged). However, its capabilities are different.")
## Part (d)
1000
## Part (e)
144
## Part (f)
(3 * 3 * 16 + 1) * 32 = 4640 (Not sure)
# Question 4
```python
>>> x1 = torch.randn(10, 16)
>>> x1.shape
torch.Size([10, 16])
>>> x2 = torch.randn(20, 3, 16, 16) # NCHW
>>> x2.shape
torch.Size([20, 3, 16, 16])
>>> layer = nn.Linear(in_features=16, out_features=32)
>>> layer(x1).shape
torch.Size([10, 32])
>>> layer(x2).shape
torch.Size([20, 3, 16, 32])
>>> conv = nn.Conv2d(in_channels=3, out_channels=7, kernel_size=5, padding=0)
>>> conv(x2).shape
torch.Size([20, 7, 12, 12])
>>> conv2 = nn.Conv2d(in_channels=3, out_channels=7, kernel_size=5, padding=2)
>>> conv2(x2).shape
torch.Size([20, 7, 16, 16])
>>> pool = nn.MaxPool2d(kernel_size=2, stride=2)
>>> pool(x2).shape
torch.Size([20, 3, 8, 8])
>>> convt = nn.ConvTranspose2d(in_channels=3, out_channels=1, kernel_size=5,stride=2, padding=2)
>>> convt(x2).shape
torch.Size([20, 1, 31, 31])
```
# Question 5
## Part (a)
(Not sure what was taught in the course.)
```python
# images = mini batch data, with batch size 64
noise = torch.randn(64, 100) # batch size 64, noise size 100
inputs = torch.cat([images, generator(noise)])
outputs = discriminator(inputs)
labels = torch.cat([torch.zeros(images.shape[0]), # labels for real data
torch.ones(noise.shape[0])]) # labels for fake data
d_loss = criterion(outputs , labels)
d_loss.backward()
```
## Part (b)
(Not sure what was taught in the course.)
```python
# images = mini batch data, with batch size 64
noise = torch.randn(64, 100) # batch size 64, noise size 100
outputs = discriminator(generator(noise))
labels = torch.zeros(noise.shape[0])
g_loss = criterion(outputs, labels)
g_loss.backward()
```
## Part (c)
(Not sure)
It will be difficult for the generator to learn anything because the discriminator's gradient will vanish (i.e. changing the generator output doesn't really affect what the discriminator output because it's too accurate).
## Part (d)
Mode collapse is when the generator is only capable of generating samples with limited variety, ignoring other modes in the data.
## Part (e)
Autoencoders do not suffer from mode collapse, because they are trained to be able to reconstruct all data samples.
## Part (f)
(Not sure what was taught in the course.)
A targetted adversarial attack optimizes the input such that the model classifies the input into a specific class other than the correct one.
$$\text{argmin}_{x} L(f(x), \text{targetclass})$$
A non-targetted adversarial attack optimizes the input such that the model classifies the input into anything other than the correct one.
$$\text{argmax}_{x} L(f(x), \text{correctclass})$$
## Part (g)
A black-box adversarial attack is when the model internals (such as weights, architecture) are unknown.
A white-box adversarial attack is when the model internals (such as weights, architecture) are known.
# Question 6
## Part (a)
```python
torch.Size([4, 10])
```
## Part (b)
The output represents 4 tokens over 4 time steps (first dimension), where each token is a vector of 10 logits (second dimension), representing the distribution over all tokens. `out[i, j]` is the probability that the token at the `i`-th time step is the `j`-th token in the vocabulary.
## Part (c)
(Not sure)
This allows the model to generate more consistent tweets, since the model is always trained to generate the entire tweet. The previous method might only be able to generate a substring of a tweet.
## Part (d)
Pseudo-code:
```
token = '<START OF TWEET>'
tokens = [token]
hidden = initial_hidden
while token is not '<END OF TWEET>':
output, hidden = model(token, hidden)
token = sample(output)
tokens.append(token)
```
## Part (e)
K, R, F, C, E, M
# Question 7
## Part (a)
In reinforcement learning, the **environment (or simulation?)** provides the current observation, and a scalar **reward** at each timestep. A deterministic, policy-based agent takes one **state** as input and chooses one **action** at each time step. A **stochastic** agent incorporates some randomness in its choice, and strikes a better balance between **exploration** and **exploitation**. An **actor-critic** RL agent contains both a policy and a **value function**, but not a model (a model predicts the behaviour of the **environment**). The original AlphaGo was an example of such an RL agent.
## Part (b)
(Not sure what was taught in the course. Check your slides for answer)
$$G_t = \sum_{s=t}^{T} \gamma^{s - t} R(S_s)$$
where:
- $T$ is the final time step
- $S_s$ is the state at time $s$
- $R(S_s)$ is the reward obtained at state $S_s$
- $\gamma$ is the discount factor
## Part (c)
0 and 1.
## Part (d)
The reward hypothesis is that all of what we mean by goals and purposes can be well thought of as maximization of the expected value of the cumulative sum of a received scalar signal (reward).
## Part (e)
(I support the hypothesis, so I don't have counter arguments)
# Question 8
## Part (a)
1. There's activation function (ReLU) before the output of decoder.
2. The `forward` function applied `decoder` before `encoder`.
3. The encoder takes 1 input channel, but the decoder outputs 3 output channels.
## Part (b)
I don't know
## Part (c)
Yes.
## Part (d)
I don't know. It's because autoencoder is not considered to be a generative model (variational autoencoder are, but they are different). However the question ask about generating samples from autoencoder, and I am very confused.
# Question 9
I don't know ethics and fairness (sorry)