ZFNet/DeconvNet: Summary and Implementation

This post is divided into 2 sections: Summary and Implementation.

We are going to have an in-depth review of Visualizing and Understanding Convolutional Networks paper which introduces the ZFNet and DeconvNet architecture.

The implementation uses Pytorch as framework. To see full implementation,
please refer to this repository.

Also, if you want to read other "Summary and Implementation", feel free to
check them at my blog.

I) Summary

DISCLAIMER:

We will use the weights/biases of a pretained model from this github repository.
We are only going to implement 1st version of ZFNet.
We will remove Local Response Normalization.

The paper Visualizing and Understanding Convolutional Networks introduces the notion of Deconvnet which enables us to visualize each layer.
By visualizing each layer, we can get more insight about what the model is learning and thus, make some adjustements to make it more optimize
That's how ZFnet was created, an AlexNet fine-tuned version based on visualization results.

ZFNet architecture:

5 Convolutional layers.
3 Fully connected layers.
3 Overlapping Max pooling layers.
ReLU as activation function for hidden layer.
Softmax as activation function for output layer.
60,000,000 trainable parameters.
Cross-entropy as cost function
Mini-batch gradient descent with Momentum optimizer.
Local Response Normalization (Removing it seems to give better results)

ZFNet differences:

1st version:
- Conv1 filters: Change from (11x11 stride 4) to (7x7 stride 2).
  - By using ZFNet, top-5 validation error rate is 16.5%.
  - By using AlexNet, top-5 validation error rate is 18.2%.
2nd version:
- Conv2 filters: Use of 512 filters instead of 384.
- Conv4 filters: Use of 1024 filters instead of 384.
- Conv5 filters: Use of 512 filters instead of 256.
  - By using ZFnet, top-5 validation error rate is 16.0%.
  - By using AlexNet, top-5 validation error rate is 18.2%.

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

To visualize each layer, we need to reconstruct an approximate version of the picture.
To do so, we first need to feed our main convnet with an image so that it can record the location of the local max in each pooling region (called switches)

Then, the switches are then used in the unpooling layer to map back pixels to input space.

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

II) Implementation

1) Architecture build

class ZFNet(nn.Module):
    
    def __init__(self):
        super(ZFNet, self).__init__()
        
        # CONV PART.
        self.features = nn.Sequential(OrderedDict([
            ('conv1', nn.Conv2d(3, 96, kernel_size=7, stride=2, padding=1)),
            ('act1', nn.ReLU()),
            ('pool1', nn.MaxPool2d(kernel_size=3, stride=2, padding=1, return_indices=True)),
            ('conv2', nn.Conv2d(96, 256, kernel_size=5, stride=2, padding=0)),
            ('act2', nn.ReLU()),
            ('pool2', nn.MaxPool2d(kernel_size=3, stride=2, padding=1, return_indices=True)),
            ('conv3', nn.Conv2d(256, 384, kernel_size=3, stride=1, padding=1)),
            ('act3', nn.ReLU()),
            ('conv4', nn.Conv2d(384, 384, kernel_size=3, stride=1, padding=1)),
            ('act4', nn.ReLU()),
            ('conv5', nn.Conv2d(384, 256, kernel_size=3, stride=1, padding=1)),
            ('act5', nn.ReLU()),
            ('pool5', nn.MaxPool2d(kernel_size=3, stride=2, padding=0, return_indices=True))
        ]))
    
        self.feature_outputs = [0]*len(self.features)
        self.switch_indices = dict()
        self.sizes = dict()


        self.classifier = nn.Sequential(OrderedDict([
            ('fc6', nn.Linear(9216, 4096)),
            ('act6', nn.ReLU()),
            ('fc7', nn.Linear(4096, 4096)),
            ('act7', nn.ReLU()),
            ('fc8', nn.Linear(4096, 1000))
        ]))
    
        # DECONV PART.
        self.deconv_pool5 = nn.MaxUnpool2d(kernel_size=3,
                                           stride=2,
                                           padding=0)
        self.deconv_act5 = nn.ReLU()
        self.deconv_conv5 = nn.ConvTranspose2d(256,
                                               384,
                                               kernel_size=3,
                                               stride=1,
                                               padding=1,
                                               bias=False)
        
        self.deconv_act4 = nn.ReLU()
        self.deconv_conv4 = nn.ConvTranspose2d(384,
                                               384,
                                               kernel_size=3,
                                               stride=1,
                                               padding=1,
                                               bias=False)
        
        self.deconv_act3 = nn.ReLU()
        self.deconv_conv3 = nn.ConvTranspose2d(384,
                                               256,
                                               kernel_size=3,
                                               stride=1,
                                               padding=1,
                                               bias=False)
        
        self.deconv_pool2 = nn.MaxUnpool2d(kernel_size=3,
                                           stride=2,
                                           padding=1)
        self.deconv_act2 = nn.ReLU()
        self.deconv_conv2 = nn.ConvTranspose2d(256,
                                               96,
                                               kernel_size=5,
                                               stride=2,
                                               padding=0,
                                               bias=False)
        
        self.deconv_pool1 = nn.MaxUnpool2d(kernel_size=3,
                                           stride=2,
                                           padding=1)
        self.deconv_act1 = nn.ReLU()
        self.deconv_conv1 = nn.ConvTranspose2d(96,
                                               3,
                                               kernel_size=7,
                                               stride=2,
                                               padding=1,
                                               bias=False)
        
    def forward(self, x):
        
        for i, layer in enumerate(self.features):
            if isinstance(layer, nn.MaxPool2d):
                x, indices = layer(x)
                self.feature_outputs[i] = x
                self.switch_indices[i] = indices
            else:
                x = layer(x)
                self.feature_outputs[i] = x
            
        x = x.view(x.size(0), -1)
        x = self.classifier(x)
        return x
    
    def forward_deconv(self, x, layer):
        if layer < 1 or layer > 5:
            raise Exception("ZFnet -> forward_deconv(): layer value should be between [1, 5]")
        
        x = self.deconv_pool5(x,
                              self.switch_indices[12],
                              output_size=self.feature_outputs[-2].shape[-2:])
        x = self.deconv_act5(x)
        x = self.deconv_conv5(x)
        
        if layer == 1:
            return x
        
        x = self.deconv_act4(x)
        x = self.deconv_conv4(x)
        
        if layer == 2:
            return x
        
        x = self.deconv_act3(x)
        x = self.deconv_conv3(x)
        
        if layer == 3:
            return x
        
        x = self.deconv_pool2(x,
                              self.switch_indices[5],
                              output_size=self.feature_outputs[4].shape[-2:])
        x = self.deconv_act2(x)
        x = self.deconv_conv2(x)
     
        if layer == 4:
            return x
        
        x = self.deconv_pool1(x,
                              self.switch_indices[2],
                              output_size=self.feature_outputs[1].shape[-2:])
        x = self.deconv_act1(x)
        x = self.deconv_conv1(x)
        
        if layer == 5:
            return x

2) Evaluating

fig2 = plt.figure(figsize=(30,10))

model.eval()
with torch.no_grad():
    
    for i, image in enumerate(test_loader):
        probs = torch.nn.Softmax(dim=-1)(model(image))
        
        probability, class_idx = torch.max(probs, 1)
        class_name = class_names[class_idx]
         
        fig2.add_subplot(1,4,i+1)
        plt.imshow(cv2.cvtColor(custom_dataset.imgsToDisplay[i], cv2.COLOR_BGR2RGB))
        plt.title("Class: " + class_name + ", probability: %.4f" % probability, fontsize=13)
        plt.axis('off')

        plt.text(0, 240, 'Top-5 Accuracy:')
        x, y = 10, 260
        
        for idx in np.argsort(probs.numpy())[0][-5::][::-1]:
            s = '- {}, probability: {:.4f}'.format(class_names[idx], probs[0, idx])
            plt.text(x, y, s=s, fontsize=10)
            y += 20
        print()

3) Visualization of each layer

fig2 = plt.figure(figsize=(60,60))

model.eval()
count = 0
with torch.no_grad():
    for i, image in enumerate(test_loader):
        probs = torch.nn.Softmax(dim=-1)(model(image))
        for j in range(1,6):
            count += 1
            ax = fig2.add_subplot(4,5, count)
            ax.set_title("Layer {}".format(j), fontsize= 30)
            plt.axis('off')
            # Channel 3 of the image.
            plt.imshow(model.forward_deconv(model.feature_outputs[12], j).detach().numpy()[0, 2, :])

ZFNet/DeconvNet: Summary and Implementation

I) Summary

II) Implementation

1) Architecture build

2) Evaluating

3) Visualization of each layer

Read more

Convolutional Neural Network with Numpy (Fast)

Convolutional Neural Network with Numpy (Slow)

AlexNet: Summary and Implementation

VggNet: Summary and Implementation