---
tags: hw3, conceptual
---
# HW3 Conceptual: CNNs
:::info
Conceptual questions and Programming assignment due **Wednesday, March 12th, 2025 at 10:00 PM EST**
:::
Answer the following questions, showing your work where necessary. Please explain your answers and work.
:::warning
Do **NOT** include your name anywhere within this submission. Points will be deducted if you do so.
:::

*Muffin or chihuahua? Or a chihuahua named "Muffin"? The age-old question no CNN can answer.*
## Conceptual Questions
### 1. Consider the three following $23 \times 23$ images of the digit 3.

- a. Which neural net is more fit to identify the digit in each image: a convolutional neural net or a multilayer perceptron (a neural network with multiple fully-connected layers and nonlinear layers)? Explain your reasoning. (2-3 sentences)
- b. Will a convolutional layer with standard max-pooling (e.g $2 \times 2$ pooling) produce the same or different outputs for all of the images? Why/why not? How does this relate to translational invariance/equivariance? (hint: remember that the image is $23 \times 23$) (3-4 sentences)
- c. Let’s say you built a convolutional neural network to classify these images with two layers: a convolution layer and a fully connected (linear) layer. What are their roles in the network respectively? (2-3 sentences).
### 2. Weight Sharing Convolutions
Convolutional layers in CNNs repeatedly apply a kernel (or filter) to an input and produce an output. Standard convolutional neural networks leverage "weight sharing" convolutions, meaning a single convolutional layer uses the same kernel (and therefore the same weights) across the entire input. This problem will ask you to compare standard convolutions with "non-weight-sharing" convolutional layers (where a different kernel is used for each spatial location in the input) and fully connected layers.
If a standard convolutional layer produces an output matrix that is $5\times 5$, that means the kernel was applied 25 times across the input. A non-weight-sharing convolutional layer would have 25 different kernels, one for each time a kernel is applied to the input.
#### Parameters
Use the following parameters for the "base case":
1. $28 \times 28$ image with one channel (e.g., MNIST image)
2. $3 \times 3$ filter(s) with 1 channel in output
3. Stride of 1
4. Padding = "SAME"
These are the hyperparameter settings of the convolutional layer unless otherwise specified by the problem.
#### Questions
a. For a standard convolutional layer (given the hyperparameters above):
* What will be the output size?
* How much padding will be added to the input image?
* If we were to add a linear layer after this convolution, we would need to flatten the output into a vector. How many elements would be in this vector?
b. How many learnable weights are in the standard convolutional layer described above?
c. How many learnable weights would be used in a non-weight-sharing convolution?
d. If instead of using filters, we used a fully-connected layer to go from our input to output, how many weights would be used (keeping the output size the same as question 1)?
e. How many multiplication operations occur during the forward pass of each of these layers (i.e., standard convolution, non-weight-sharing, and fully-connected)?
f. What is the size of the Jacobians $\partial z/\partial W$, where z is the intermediate outputs of the layer and W represents the weights. Report an answer for each of the layers?
g. What would happen to these numbers (answers to 5 and 6) if we doubled the dimensions of the input (i.e., a $56 \times 56$ image)? What if we doubled again (i.e., $112 \times 112$)? What is the pattern between input size and number of operations and the size of the Jacobians? Express your answer using Big O notation.
h. How much faster is a convolutional layer than a non-weight-sharing layer and a fully connected layer? For the forward pass, you can compare the total number of multiplications. For the backward pass, you can simply compare the size of the Jacobians as the input size (and output size) increases.
:::spoiler But can't much of this happen in parallel?
Yes, but we can also assume that GPU acceleration can benefit convolutions just as much as it can benefit fully-connected layers.
:::
i. Why might speed of a layer be an advantage for Deep Learning models (other than the fact we can make faster predictions).
### 3. The following questions refer to CNNs in different dimensions.
- a. So far in this class, we’ve only explored 2D CNNs for image recognition and classification. However, 1D CNNs are also popular in many fields, with the network convolving linearly in only one direction. Give a scenario where a 1D CNN could be useful, and explain how the CNN can extract relevant features in a 1D setting. We’re looking for specific examples! (3-4 sentences)
- https://www.tensorflow.org/api_docs/python/tf/nn/conv1d
- b. Suppose you want your computer to read Twitter (or, as now called, X) data. Explain how you could leverage 1D CNNs to classify different emotions from input tweets. How would you train your model? What would your CNN kernel convolve over? How would you take into account variable tweet sizes? (3-4 sentences)
### 4. (Optional) Have feedback for this assignment? Found something confusing? We’d love to hear from you!
## Extra Questions
These were questions for 2470 students when 2470 and 1470 were co-taught. You may find them helpful and informative to do, but they are not worth any extra credit.
### 1. Given an image $I$ and convolutional kernel $K$, prove (for the discrete case) that convolution is equivariant under translation. It’s fine to do this just for 1D convolution.
:::info
**Hint:** Refer to the "Are CNNs Translation Invariant?" slide from the lecture.
:::
### 2. Suppose you have a CNN that begins by taking an input image of size $28 \times 28 \times 3$ and passing through a convolution layer that convolves the image using 3 filters of dimensions $2 \times 2 \times 3$ with valid padding.
- a. How many learnable parameters does this convolution layer have?
- b. Suppose that you instead decided to use a fully connected layer to replicate the behavior of this convolutional layer. How many parameters would that fully connected layer have?
- c. Read about [cutout](https://arxiv.org/pdf/1708.04552.pdf)
- i. What is cutout? Why is it useful?
- ii. What are some similar methods? What makes them similar?
- iii. What were the cutout sizes for CIFAR-10 and CIFAR-100? How did the researchers decide on their cutout size? Why do you think the cutout size differed for CIFAR-10 vs CIFAR-100?
<style>
.alert {
color: inherit
}
.markdown-body {
font-family: Inter
}
h3 {
font-size: 1.1em !important;
font-weight: 500 !important;
}
</style>
<!--## Ethical Implications
In August of 2021, Apple introduced new features that scan iPhones and iCloud for images of child abuse. The model behind the image detection, neuralMatch, was trained using 200,000 images from the National Center for Missing & Exploited Children. Human reviewers would check any positive detections of child abuse imagery and alert law enforcement if confirmed.
Please listen to [this 10-minute podcast](https://open.spotify.com/episode/7ihBfIGI9hyJxLzIrRQjk9?si=p25LINH2Se6qpEwdleStMw&dl_branch=1) to learn about the contexts in which neuralMatch is being deployed, and skim [Apple’s technical summary](https://www.apple.com/child-safety/pdf/CSAM_Detection_Technical_Summary.pdf) of CSAM detection.
### 1. According to Apple’s technical summary, how are CNNs used in this specific application?
:::info
Hint: review the System Overview and Technology Overview: NeuralHash sections. (3-5 sentences of your own words)
:::
### 2. Drawing on the podcast and paper, discuss one technology-driven (implemented using technology/software) and one human-driven method (manually implemented using humans) Apple is using to protect user’s privacy while identifying known CSAM images. (4-6 sentences)
### 3. As discussed in the podcast episode, Apple must balance its long-standing commitment to user privacy with increasing external pressures to act on broader sociotechnological issues like child safety. Do you think Apple should or should not deploy this set of features? What implementation measures or external factors (legal, technical, political, etc.) would cause you to change your mind? Please clearly state your position and be specific in your reasoning. (3-5 sentences)
-->