Skip to main content
ResNets or residual networks, introduced the concept of the residual. This can be understood looking at a small residual network of three stages. The striking difference between ResNets and earlier architectures are the skip connections. Shortcut connections are those skipping one or more layers. The shortcut connections simply perform identity mapping, and their outputs are added to the outputs of the stacked layers. Identity shortcut connections add neither extra parameter nor computational complexity. The entire network can still be trained end-to-end by SGD with backpropagation.. 34 layers deep ResNet architecture vs. earlier architectures

Performance advantages of ResNets

We now show that ensemble learning performance advantages are present in residual networks. Unrolling the residual architecture. To do so, we use a simple 3-block network where each layer consists of a residual module fif_i and a skip connection bypassing fif_i. Since layers in residual networks can comprise multiple convolutional layers, we refer to them as residual blocks. With yi1y_{i-1} as is input, the output of the i-th block is recursively defined as yi=fi(yi1)+yi1y_i = f_i(y_{i−1}) + y_{i−1} where fi(x)f_i(x) is some sequence of convolutions, batch normalization, and Rectified Linear Units (ReLU) as nonlinearities. In the figure above we have three blocks. Each fi(x)f_i(x) is defined by fi(x)=Wi(1)ReLU(B(Wi(2)RELU(B(x))))f_i(x) = W_i^{(1)} * ReLU(B (W_i^{(2)} * RELU(B(x)))) where Wi(1)W_i^{(1)} and Wi(2)W_i^{(2)} are weight matrices, · denotes convolution, B(x)B(x) is batch normalization and RELU(x)max(x,0)RELU(x) ≡ \max(x, 0). Other formulations are typically composed of the same operations, but may differ in their order. This paper analyses the unrolled network and from that analysis we conclude on mainly three advantages of the architecture:
  1. We see many diverse paths to the gradient as it flows from the y^\hat y to the trainable parameter tensors of each layer.
  2. We see elements of ensemble learning in the formation of the hypothesis y^\hat y.
  3. We can eliminate layers from the architecture (blocks) without having to redimension the network, allowing us to trade performance for latancy during inference.
ResNets are a defacto choice as featurization / backbone networks and are also extensively used in real time applications. Key references: (Zagoruyko & Komodakis, 2016; He et al., 2015; Szegedy et al., 2016; Veit et al., 2016; Xie et al., 2016)

References

  • He, K., Zhang, X., Ren, S., Sun, J. (2015). Deep Residual Learning for Image Recognition.
  • Szegedy, C., Ioffe, S., Vanhoucke, V., Alemi, A. (2016). Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning.
  • Veit, A., Wilber, M., Belongie, S. (2016). Residual Networks Behave Like Ensembles of Relatively Shallow Networks.
  • Xie, S., Girshick, R., Dollár, P., Tu, Z., He, K. (2016). Aggregated Residual Transformations for Deep Neural Networks.
  • Zagoruyko, S., Komodakis, N. (2016). Wide Residual Networks.