In the figure below we plot the top-5 classification error as a function of depth in CNN architectures. Notice the big jump due to the introduction of the ResNet architecture.


The ResNet Architecture


Performance advantages of ResNets
Hinton showed that using dropout, dropping out individual neurons during training, leads to a network that is equivalent to averaging over an ensemble of exponentially many networks.We now show that similar ensemble learning performance advantages are present in residual networks. To do so, we use a simple 3-block network where each layer consists of a residual module and a skip connection bypassing . Since layers in residual networks can comprise multiple convolutional layers, we refer to them as residual blocks. With as is input, the output of the i-th block is recursively defined as where is some sequence of convolutions, batch normalization, and Rectified Linear Units (ReLU) as nonlinearities. In the figure above we have three blocks. Each is defined by where and are weight matrices, · denotes convolution, is batch normalization and . Other formulations are typically composed of the same operations, but may differ in their order. This paper analyses the unrolled network and from that analysis we conclude on mainly three advantages of the architecture:- We see many diverse paths to the gradient as it flows from the to the trainable parameter tensors of each layer.
- We see elements of ensemble learning in the formation of the hypothesis .
- We can eliminate layers from the architecture (blocks) without having to redimension the network, allowing us to trade performance for latancy during inference.

