Feature Extraction via Residual Networks

ResNets or residual networks, introduced the concept of the residual. This can be understood looking at a small residual network of three stages. The striking difference between ResNets and earlier architectures are the skip connections. Shortcut connections are those skipping one or more layers. The shortcut connections simply perform identity mapping, and their outputs are added to the outputs of the stacked layers. Identity shortcut connections add neither extra parameter nor computational complexity. The entire network can still be trained end-to-end by SGD with backpropagation..

34 layers deep ResNet architecture vs. earlier architectures

Performance advantages of ResNets

We now show that ensemble learning performance advantages are present in residual networks.

To do so, we use a simple 3-block network where each layer consists of a residual module

f_i

and a skip connection bypassing

f_i

. Since layers in residual networks can comprise multiple convolutional layers, we refer to them as residual blocks. With

y_{i-1}

as is input, the output of the i-th block is recursively defined as

y_i = f_i(y_{i−1}) + y_{i−1}

where

f_i(x)

is some sequence of convolutions, batch normalization, and Rectified Linear Units (ReLU) as nonlinearities. In the figure above we have three blocks. Each

f_i(x)

is defined by

f_i(x) = W_i^{(1)} * ReLU(B (W_i^{(2)} * RELU(B(x))))

where

W_i^{(1)}

and

W_i^{(2)}

are weight matrices, · denotes convolution,

B(x)

is batch normalization and

RELU(x) ≡ \max(x, 0)

. Other formulations are typically composed of the same operations, but may differ in their order. This paper analyses the unrolled network and from that analysis we conclude on mainly three advantages of the architecture:

We see many diverse paths to the gradient as it flows from the $\hat y$ to the trainable parameter tensors of each layer.
We see elements of ensemble learning in the formation of the hypothesis $\hat y$ .
We can eliminate layers from the architecture (blocks) without having to redimension the network, allowing us to trade performance for latancy during inference.

ResNets are a defacto choice as featurization / backbone networks and are also extensively used in real time applications. Key references: (Zagoruyko & Komodakis, 2016; He et al., 2015; Szegedy et al., 2016; Veit et al., 2016; Xie et al., 2016)

References

He, K., Zhang, X., Ren, S., Sun, J. (2015). Deep Residual Learning for Image Recognition.
Szegedy, C., Ioffe, S., Vanhoucke, V., Alemi, A. (2016). Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning.
Veit, A., Wilber, M., Belongie, S. (2016). Residual Networks Behave Like Ensembles of Relatively Shallow Networks.
Xie, S., Girshick, R., Dollár, P., Tu, Z., He, K. (2016). Aggregated Residual Transformations for Deep Neural Networks.
Zagoruyko, S., Komodakis, N. (2016). Wide Residual Networks.

Edit this page on GitHub or file an issue.

Visualizing What ConvNets Learn

Batch Normalization in ResNets

​Performance advantages of ResNets

​References

Performance advantages of ResNets

References