EfficientNetV2: EfficientNet Strikes Again

EfficientNetV2: Smaller Models and Faster Training

1. INTRODUCTION

The paper introduces a new improvised version of EfficientNets, which are faster and have higher accuracy and parameter efficiency. With their improvised architecture they use progressive learning approach and adaptive regularization to train it. In this article, we will discuss the research paper in great detail.

2. Architecture Design

The paper claims that large image size used by EfficientNets results in significant memory usage, causing batch size to decrease and thereby increasing training time. To avoid this issue, they try to limit the image size to 480. They claim that depthwise convolutions caused training bottleneck in efficientNets. They are slow in early layers, but effective in later layers. So, they use Fused-MBConv blocks that replaces 1x1 convoluutional and depthwise 3x3 convolutions with regular 3x3 convolutions. They used neural architecture search to get optimal architecture having mixture of MBConv and Fused-MBConv.

Equally scaling up every layer cause sub-optimal solutions in EfficientNets. The paper uses non linear scaling by gradually adding more layers in the later stages. The later stage adds less to training speed. Also, later stages capture complex structures and hence required more in number. The NAS aims to jointly optimize accuracy, parameter efficiency and training efficiency. They use EfficientNet-B4 as a backbone. The NAS search for two convolutional blocks {MBConv, Fused-MBConv}, number of layers, kernel size {3x3, 5x5} and expansion ratio {1, 4, 6}. This search space is small compared to space used while efficinetNet. Thus, we can use NAS on bigger model, choosing EfficientNet-B4 as baseline. The resultant EfficientNetV2 model uses mixture of FUSED-MBConv and MBConv with FUSED-MBConv in early layers. EfficientNetV2 prefers smaller expansion ratio. EfficientNetV2 uses smaller 3x3 kernels, but more layers are added to compensate reduced receptive field resulted from smaller kernel size.

3. Progressive Learning

In Progressive learning, we train the model in few phases, where each phase has different image size. However, previous use of progressive learning have caused drop in accuracy. The paper attributes the drop in accuracy to unbalanced use of regularization. They believe that larger models requires stronger regularization to combat overfitting , while smaller models requires weaker regularization. They divide the training into M steps: for each stage 1< i < M , the model is trained with image size Si and regularization magnitude Φ i = {φ ki }. The last stage M uses image size Se and regularization Φe. They interpolate to find vallues of ith stage.

They use dropout to reduce co-adaptation by randomly dropping channels and adjust the dropout rate γ accordingly. RandAugment is used with adjustable magnitude. Mixup is a cross-image data augmentation technique which is also used. Given two images with labels (x i , y i ) and (x j , y j ), it combines them with mixup ratio λ: x̃ i = λx j + (1 − λ)x i and y ˜ i = λy j + (1 − λ)y i .

4. Conclusion

The family of models introduce in this paper significantly outperforms EfficientNets with smaller image size. Along with EfficientNetV2-S, three more models EfficientNetV2-M/L/XL were introduced by scaling over EfficientNetV2-S.

5. References

[1]. EfficientNetV2: Smaller Models and Faster Training Mingxing Tan 1 Quoc V. Le 1