Residual neural network

Adding more layers to deep networks often makes them worse, not better. This is called the degradation problem. Residual blocks solve this issue. They use special connections called skip or shortcut connections. These connections bypass some layers. They add the input directly to the output of those layers. This is called adding an identity mapping. The formula is usually output equals stacked layers plus input. There are different types of residual blocks. Basic blocks have two layers. Bottleneck blocks have three layers for efficiency. Pre-activation blocks apply changes before the main layers. Residual blocks are used in computer vision tasks like image classification. They are also standard in modern Transformer models like GPT-2 and BERT. The idea comes from older neural network ideas and biology. Related concepts include solving the vanishing gradient problem. Residual blocks let networks learn identity easily. This prevents performance from getting worse with more layers. They allow training extremely deep networks, hundreds or thousands of layers deep.