Try   HackMD

EfficientNet: Summary and Implementation

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

This post is divided into 2 sections: Summary and Implementation.

We are going to have an in-depth review of EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks paper which introduces the EfficientNet architecture.

The implementation uses Keras as framework. To see full implementation,
please refer to this repository.

Also, if you want to read other "Summary and Implementation", feel free to
check them at my blog.

I) Summary

  • The paper EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks introduces a new principle method to scale up ConvNets.
  • To get a better accuracy, CNN needs to have a careful balance between depth width and resolution. However, process of scaling up ConvNets has never been understood.
  • Most common way was to scale up ConvNets by their depth or width.
  • Another less common way was to scale up models by image resolution. So far, we only scale one dimension at the time.
    • Depth: Deeper ConvNets capture more complex features and generalize well. However, more difficult to train due to vanishing gradient. Although techniques such as "skip connections" and "batch normalization" are alleviating the training problem, the accuracy gain diminishes for very deep network.

    • Width: Wider networks tend to capture more fined-grained features and are easierto train. However, accuracy for such network tends to quickly saturate.

    • Resolution: With higher resolution input images, ConvNets can potentially capture more fine-grained patterns. However, for very high resolutions, the accuracy gains disminishes.

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →


  • We can then think about scaling multiple dimension at one time. It is possible to scale two or three dimensions arbitrarily, requiring manual tuning which often yields to sub-optimal accuracy and efficiency.

  • In this paper, they are trying to address the following issue:
    "Is there a principled method to scale up ConvNets that can achieve better accuracy and efficiency ?"

  • Their empirical study shows that it is critical to balance all dimensions of network (width/depth/resolution) at the same time.

  • Such balance can be achieved by scaling each of them by a constant ratio.

  • This method is called "compound scaling method", which consists of uniformly scales the network width, depth and resolution with a set of fixed scaling coefficients.

  • The intuition comes from the following fact:

    • If the input image is bigger (resolution), then there is more complex-features and fine-grained patterns. To capture more complex-feature, the network needs bigger receptive field which is achieved by adding more layers (depth). To capture more fine-grained patterns, the network needs more channels.

    • The following validates the intuition.

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

  • The "compound scaling method" uses a compound coefficient
    ϕ
    to uniformly scales network width, depth and resolution in a principled way:
Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

where

α,
β
,
γ
are constants that can be determined by a small grid search.

  • In this paper, the constraint
    α
    ·
    β2
    ·
    γ2
    ≈ 2 such that for any new
    ϕ
    , the total FLOPS will approximately 3 increase by
    2ϕ
    .
  • To use this method, a baseline model is needed which is, here, called "EfficientNet B0". It was design by a neural architecture search (NAS)
  • We then apply the compound scaling method to scale it up with 2 steps:
Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Here is EfficientNet-B0 architecture:

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →


Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →


Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

II) Implementation

1) Architecture build

def EfficientNet_B0(channels,
                    expansion_coefs,
                    repeats,
                    strides,
                    kernel_sizes,
                    d_coef,
                    w_coef,
                    r_coef,
                    dropout_rate,
                    include_top,
                    se_ratio = 0.25,
                    classes=1000):
   
    inputs = Input(shape=(224, 224, 3))
    
    stage1 = ConvBlock(inputs,
                       filters=32,
                       kernel_size=3,
                       stride=2)
    
    stage2 = MBConvBlock(stage1, 
                         scaled_channels(channels[0], w_coef),
                         scaled_channels(channels[1], w_coef),
                         kernel_sizes[0],
                         expansion_coefs[0],
                         se_ratio,
                         strides[0],
                         scaled_repeats(repeats[0], d_coef),
                         dropout_rate=dropout_rate)
    
    stage3 = MBConvBlock(stage2, 
                         scaled_channels(channels[1], w_coef),
                         scaled_channels(channels[2], w_coef),
                         kernel_sizes[1],
                         expansion_coefs[1],
                         se_ratio,
                         strides[1],
                         scaled_repeats(repeats[1], d_coef),
                         dropout_rate=dropout_rate)
    
    stage4 = MBConvBlock(stage3, 
                         scaled_channels(channels[2], w_coef),
                         scaled_channels(channels[3], w_coef),
                         kernel_sizes[2],
                         expansion_coefs[2],
                         se_ratio,
                         strides[2],
                         scaled_repeats(repeats[2], d_coef),
                         dropout_rate=dropout_rate)
    
    stage5 = MBConvBlock(stage4, 
                         scaled_channels(channels[3], w_coef),
                         scaled_channels(channels[4], w_coef),
                         kernel_sizes[3],
                         expansion_coefs[3],
                         se_ratio,
                         strides[3],
                         scaled_repeats(repeats[3], d_coef),
                         dropout_rate=dropout_rate)

    stage6 = MBConvBlock(stage5, 
                         scaled_channels(channels[4], w_coef),
                         scaled_channels(channels[5], w_coef),
                         kernel_sizes[4],
                         expansion_coefs[4],
                         se_ratio,
                         strides[4],
                         scaled_repeats(repeats[4], d_coef),
                         dropout_rate=dropout_rate)
    
    stage7 = MBConvBlock(stage6, 
                         scaled_channels(channels[5], w_coef),
                         scaled_channels(channels[6], w_coef),
                         kernel_sizes[5],
                         expansion_coefs[5],
                         se_ratio,
                         strides[5],
                         scaled_repeats(repeats[5], d_coef),
                         dropout_rate=dropout_rate)
    
    stage8 = MBConvBlock(stage7, 
                         scaled_channels(channels[6], w_coef),
                         scaled_channels(channels[7], w_coef),
                         kernel_sizes[6],
                         expansion_coefs[6],
                         se_ratio,
                         strides[6],
                         scaled_repeats(repeats[6], d_coef),
                         dropout_rate=dropout_rate)
       
    stage9 = ConvBlock(stage8,
                       filters=scaled_channels(channels[8], w_coef),
                       kernel_size=1,
                       padding='same')
    
    if include_top:
        stage9 = GlobalAveragePooling2D()(stage9)

        if dropout_rate > 0:
            stage9 = Dropout(dropout_rate)(stage9)

        stage9 = Dense(classes, 
                       activation='softmax',
                       kernel_initializer=DENSE_KERNEL_INITIALIZER)(stage9)

    model = Model(inputs, stage9)

    return model

2) Evaluating

from keras.applications.imagenet_utils import decode_predictions, preprocess_input

# ImageNet mean and std.
mean=[0.485, 0.456, 0.406]
std=[0.229, 0.224, 0.225]

# preprocess input
image_size = conv_base.input_shape[1]
x = center_crop_and_resize(image, image_size=image_size)
x /= 255
x = (x - mean) / std
x = np.expand_dims(x, 0)

y = conv_base.predict(x)
decode_predictions(y)

panda

[[('n02510455', 'giant_panda', 0.7587868),
  ('n02134084', 'ice_bear', 0.008354766),
  ('n02132136', 'brown_bear', 0.0072072325),
  ('n02509815', 'lesser_panda', 0.0041302308),
  ('n02120079', 'Arctic_fox', 0.004021083)]]