---
tags: machine-learning
---
# EfficientNet: Summary and Implementation
<div style="text-align: center">
<img src="https://raw.githubusercontent.com/valoxe/image-storage-1/master/research-paper-summary/efficientnet/0.png?token=AMAXSKPEEO6OMLU6FV7JV2K6WMIHS">
</div>
>This post is divided into 2 sections: Summary and Implementation.
>
>We are going to have an in-depth review of [EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks][paper] paper which introduces the EfficientNet architecture.
>
> The implementation uses Keras as framework. To see full implementation,
> please refer to this [repository].
>
> Also, if you want to read other "Summary and Implementation", feel free to
> check them at my [blog](https://ferdinandmom.engineer/deep-learning/).
# I) Summary
- The paper [EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks](https://arxiv.org/pdf/1905.11946.pdf) introduces a new principle method to scale up ConvNets.
- ==To get a better accuracy, CNN needs to have a careful balance between depth width and resolution. However, process of scaling up ConvNets has never been understood==.
- Most common way was to scale up ConvNets by their depth or width.
- Another less common way was to scale up models by image resolution. So far, we only scale one dimension at the time.
- **Depth**: Deeper ConvNets capture more complex features and generalize well. However, more difficult to train due to vanishing gradient. Although techniques such as "skip connections" and "batch normalization" are alleviating the training problem, the accuracy gain diminishes for very deep network.
- **Width**: Wider networks tend to capture more fined-grained features and are easierto train. However, accuracy for such network tends to quickly saturate.
- **Resolution**: With higher resolution input images, ConvNets can potentially capture more fine-grained patterns. However, for very high resolutions, the accuracy gains disminishes.
![](https://raw.githubusercontent.com/valoxe/image-storage-1/master/research-paper-summary/efficientnet/1.png?token=AMAXSKPSCCM5XCEYI7RBMHS6WMIHU)
---
- We can then think about scaling multiple dimension at one time. It is possible to scale two or three dimensions arbitrarily, requiring manual tuning which often yields to sub-optimal accuracy and efficiency.
- In this paper, they are trying to address the following issue:
==**"Is there a principled method to scale up ConvNets that can achieve better accuracy and efficiency ?"**==
- Their empirical study shows that it is **critical to balance all dimensions of network** (width/depth/resolution) at the same time.
- Such balance can be achieved by **scaling** each of them by a **constant ratio**.
- This method is called **"compound scaling method"**, which consists of **uniformly scales the network width, depth and resolution with a set of fixed scaling coefficients**.
- The intuition comes from the following fact:
- If the input image is bigger (resolution), then there is more complex-features and fine-grained patterns. To capture more complex-feature, the network needs bigger receptive field which is achieved by adding more layers (depth). To capture more fine-grained patterns, the network needs more channels.
- The following validates the intuition.
<div style="text-align: center">
<img src="https://raw.githubusercontent.com/valoxe/image-storage-1/master/research-paper-summary/efficientnet/2.png?token=AMAXSKPQZIC6SB7OYKECD6K6WMIHW">
</div>
---
- The "compound scaling method" uses a compound coefficient $\phi$ to uniformly scales network width, depth and resolution in a principled way:
<div style="text-align: center">
<img src="https://raw.githubusercontent.com/valoxe/image-storage-1/master/research-paper-summary/efficientnet/3.png?token=AMAXSKN4D75BA6WGDW4RUXK6WMIHW">
</div>
where $\alpha$, $\beta$, $\gamma$ are constants that can be determined by a small grid search.
- In this paper, the constraint $\alpha$ · $\beta^2$ · $\gamma^2$ ≈ 2 such that for any new $\phi$, the total FLOPS will approximately 3 increase by $2^\phi$.
- To use this method, a baseline model is needed which is, here, called "EfficientNet B0". It was design by a neural architecture search (NAS)
- We then apply the compound scaling method to scale it up with 2 steps:
<div style="text-align: center">
<img src="https://raw.githubusercontent.com/valoxe/image-storage-1/master/research-paper-summary/efficientnet/4.png?token=AMAXSKNQVDSDZDXGZ7BGWKK6WMIHY">
</div>
---
Here is EfficientNet-B0 architecture:
![](https://raw.githubusercontent.com/valoxe/image-storage-1/master/research-paper-summary/efficientnet/5.png?token=AMAXSKMPEJLLS6GXOSOEWES6WMIH2)
---
![](https://raw.githubusercontent.com/valoxe/image-storage-1/master/research-paper-summary/efficientnet/6.png?token=AMAXSKJ6ZPSDRH4GNGRIZSC6WMIH2)
---
![](https://raw.githubusercontent.com/valoxe/image-storage-1/master/research-paper-summary/efficientnet/7.png?token=AMAXSKMHUTBYCU3GEKVHII26WMIH2)
# II) Implementation
### 1) Architecture build
```python
def EfficientNet_B0(channels,
expansion_coefs,
repeats,
strides,
kernel_sizes,
d_coef,
w_coef,
r_coef,
dropout_rate,
include_top,
se_ratio = 0.25,
classes=1000):
inputs = Input(shape=(224, 224, 3))
stage1 = ConvBlock(inputs,
filters=32,
kernel_size=3,
stride=2)
stage2 = MBConvBlock(stage1,
scaled_channels(channels[0], w_coef),
scaled_channels(channels[1], w_coef),
kernel_sizes[0],
expansion_coefs[0],
se_ratio,
strides[0],
scaled_repeats(repeats[0], d_coef),
dropout_rate=dropout_rate)
stage3 = MBConvBlock(stage2,
scaled_channels(channels[1], w_coef),
scaled_channels(channels[2], w_coef),
kernel_sizes[1],
expansion_coefs[1],
se_ratio,
strides[1],
scaled_repeats(repeats[1], d_coef),
dropout_rate=dropout_rate)
stage4 = MBConvBlock(stage3,
scaled_channels(channels[2], w_coef),
scaled_channels(channels[3], w_coef),
kernel_sizes[2],
expansion_coefs[2],
se_ratio,
strides[2],
scaled_repeats(repeats[2], d_coef),
dropout_rate=dropout_rate)
stage5 = MBConvBlock(stage4,
scaled_channels(channels[3], w_coef),
scaled_channels(channels[4], w_coef),
kernel_sizes[3],
expansion_coefs[3],
se_ratio,
strides[3],
scaled_repeats(repeats[3], d_coef),
dropout_rate=dropout_rate)
stage6 = MBConvBlock(stage5,
scaled_channels(channels[4], w_coef),
scaled_channels(channels[5], w_coef),
kernel_sizes[4],
expansion_coefs[4],
se_ratio,
strides[4],
scaled_repeats(repeats[4], d_coef),
dropout_rate=dropout_rate)
stage7 = MBConvBlock(stage6,
scaled_channels(channels[5], w_coef),
scaled_channels(channels[6], w_coef),
kernel_sizes[5],
expansion_coefs[5],
se_ratio,
strides[5],
scaled_repeats(repeats[5], d_coef),
dropout_rate=dropout_rate)
stage8 = MBConvBlock(stage7,
scaled_channels(channels[6], w_coef),
scaled_channels(channels[7], w_coef),
kernel_sizes[6],
expansion_coefs[6],
se_ratio,
strides[6],
scaled_repeats(repeats[6], d_coef),
dropout_rate=dropout_rate)
stage9 = ConvBlock(stage8,
filters=scaled_channels(channels[8], w_coef),
kernel_size=1,
padding='same')
if include_top:
stage9 = GlobalAveragePooling2D()(stage9)
if dropout_rate > 0:
stage9 = Dropout(dropout_rate)(stage9)
stage9 = Dense(classes,
activation='softmax',
kernel_initializer=DENSE_KERNEL_INITIALIZER)(stage9)
model = Model(inputs, stage9)
return model
```
### 2) Evaluating
```python
from keras.applications.imagenet_utils import decode_predictions, preprocess_input
# ImageNet mean and std.
mean=[0.485, 0.456, 0.406]
std=[0.229, 0.224, 0.225]
# preprocess input
image_size = conv_base.input_shape[1]
x = center_crop_and_resize(image, image_size=image_size)
x /= 255
x = (x - mean) / std
x = np.expand_dims(x, 0)
y = conv_base.predict(x)
decode_predictions(y)
```
![panda]
```
[[('n02510455', 'giant_panda', 0.7587868),
('n02134084', 'ice_bear', 0.008354766),
('n02132136', 'brown_bear', 0.0072072325),
('n02509815', 'lesser_panda', 0.0041302308),
('n02120079', 'Arctic_fox', 0.004021083)]]
```
[paper]:https://arxiv.org/pdf/1905.11946.pdf
[repository]: https://github.com/3outeille/Research-Paper-Summary/tree/master/src/architecture/efficientnet/tensorflow_2
[panda]: https://raw.githubusercontent.com/valoxe/image-storage-1/master/research-paper-summary/efficientnet/8.png?token=AMAXSKK366AA2P4J65ECMES6WMI4G