This post is divided into 2 sections: Summary and Implementation.
We are going to have an in-depth review of EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks paper which introduces the EfficientNet architecture.
The implementation uses Keras as framework. To see full implementation,
please refer to this repository.Also, if you want to read other "Summary and Implementation", feel free to
check them at my blog.
Depth: Deeper ConvNets capture more complex features and generalize well. However, more difficult to train due to vanishing gradient. Although techniques such as "skip connections" and "batch normalization" are alleviating the training problem, the accuracy gain diminishes for very deep network.
Width: Wider networks tend to capture more fined-grained features and are easierto train. However, accuracy for such network tends to quickly saturate.
Resolution: With higher resolution input images, ConvNets can potentially capture more fine-grained patterns. However, for very high resolutions, the accuracy gains disminishes.
We can then think about scaling multiple dimension at one time. It is possible to scale two or three dimensions arbitrarily, requiring manual tuning which often yields to sub-optimal accuracy and efficiency.
In this paper, they are trying to address the following issue:
"Is there a principled method to scale up ConvNets that can achieve better accuracy and efficiency ?"
Their empirical study shows that it is critical to balance all dimensions of network (width/depth/resolution) at the same time.
Such balance can be achieved by scaling each of them by a constant ratio.
This method is called "compound scaling method", which consists of uniformly scales the network width, depth and resolution with a set of fixed scaling coefficients.
The intuition comes from the following fact:
If the input image is bigger (resolution), then there is more complex-features and fine-grained patterns. To capture more complex-feature, the network needs bigger receptive field which is achieved by adding more layers (depth). To capture more fine-grained patterns, the network needs more channels.
The following validates the intuition.
where
Here is EfficientNet-B0 architecture:
def EfficientNet_B0(channels,
expansion_coefs,
repeats,
strides,
kernel_sizes,
d_coef,
w_coef,
r_coef,
dropout_rate,
include_top,
se_ratio = 0.25,
classes=1000):
inputs = Input(shape=(224, 224, 3))
stage1 = ConvBlock(inputs,
filters=32,
kernel_size=3,
stride=2)
stage2 = MBConvBlock(stage1,
scaled_channels(channels[0], w_coef),
scaled_channels(channels[1], w_coef),
kernel_sizes[0],
expansion_coefs[0],
se_ratio,
strides[0],
scaled_repeats(repeats[0], d_coef),
dropout_rate=dropout_rate)
stage3 = MBConvBlock(stage2,
scaled_channels(channels[1], w_coef),
scaled_channels(channels[2], w_coef),
kernel_sizes[1],
expansion_coefs[1],
se_ratio,
strides[1],
scaled_repeats(repeats[1], d_coef),
dropout_rate=dropout_rate)
stage4 = MBConvBlock(stage3,
scaled_channels(channels[2], w_coef),
scaled_channels(channels[3], w_coef),
kernel_sizes[2],
expansion_coefs[2],
se_ratio,
strides[2],
scaled_repeats(repeats[2], d_coef),
dropout_rate=dropout_rate)
stage5 = MBConvBlock(stage4,
scaled_channels(channels[3], w_coef),
scaled_channels(channels[4], w_coef),
kernel_sizes[3],
expansion_coefs[3],
se_ratio,
strides[3],
scaled_repeats(repeats[3], d_coef),
dropout_rate=dropout_rate)
stage6 = MBConvBlock(stage5,
scaled_channels(channels[4], w_coef),
scaled_channels(channels[5], w_coef),
kernel_sizes[4],
expansion_coefs[4],
se_ratio,
strides[4],
scaled_repeats(repeats[4], d_coef),
dropout_rate=dropout_rate)
stage7 = MBConvBlock(stage6,
scaled_channels(channels[5], w_coef),
scaled_channels(channels[6], w_coef),
kernel_sizes[5],
expansion_coefs[5],
se_ratio,
strides[5],
scaled_repeats(repeats[5], d_coef),
dropout_rate=dropout_rate)
stage8 = MBConvBlock(stage7,
scaled_channels(channels[6], w_coef),
scaled_channels(channels[7], w_coef),
kernel_sizes[6],
expansion_coefs[6],
se_ratio,
strides[6],
scaled_repeats(repeats[6], d_coef),
dropout_rate=dropout_rate)
stage9 = ConvBlock(stage8,
filters=scaled_channels(channels[8], w_coef),
kernel_size=1,
padding='same')
if include_top:
stage9 = GlobalAveragePooling2D()(stage9)
if dropout_rate > 0:
stage9 = Dropout(dropout_rate)(stage9)
stage9 = Dense(classes,
activation='softmax',
kernel_initializer=DENSE_KERNEL_INITIALIZER)(stage9)
model = Model(inputs, stage9)
return model
from keras.applications.imagenet_utils import decode_predictions, preprocess_input
# ImageNet mean and std.
mean=[0.485, 0.456, 0.406]
std=[0.229, 0.224, 0.225]
# preprocess input
image_size = conv_base.input_shape[1]
x = center_crop_and_resize(image, image_size=image_size)
x /= 255
x = (x - mean) / std
x = np.expand_dims(x, 0)
y = conv_base.predict(x)
decode_predictions(y)
[[('n02510455', 'giant_panda', 0.7587868),
('n02134084', 'ice_bear', 0.008354766),
('n02132136', 'brown_bear', 0.0072072325),
('n02509815', 'lesser_panda', 0.0041302308),
('n02120079', 'Arctic_fox', 0.004021083)]]