Multilayer Perceptron (MLP)

Notice that at its core the output of the multihead self attention is a weighted sum of the input tokens. This is a linear combination of the value vectors and the attention block does not have the capacity to learn non-linear relationships. One cat argue that the attention weights and softmax add some non-linearity but this is not enough and the model will not be able to learn complex dependencies. To address this we add an MLP to the output of the multihead self attention.

Y = \mathtt{LayerNorm}(\hat Z)

\tilde X = \mathtt{MLP}(Y) + \hat Z

where the MLP uses skip connection and for some implementations a RELU or GELU activation function shown below is used.

Notably, each token is processed by the MLP independently of the other tokens in the input.

class FeedForward(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.linear_1 = nn.Linear(config.hidden_size, config.intermediate_size)
        self.linear_2 = nn.Linear(config.intermediate_size, config.hidden_size)
        self.gelu = nn.GELU()
        self.dropout = nn.Dropout(config.hidden_dropout_prob)

    def forward(self, x):
        x = self.linear_1(x)
        x = self.gelu(x)
        x = self.linear_2(x)
        x = self.dropout(x)
        return x

Resources

Key references: (Tsai et al., 2019; Ramachandran et al., 2017; Chen et al., 2020; Dosovitskiy et al., 2020; Jetley et al., 2018)

References

Chen, T., Kornblith, S., Norouzi, M., Hinton, G. (2020). A simple framework for contrastive learning of visual representations.
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., et al. (2020). An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale.
Jetley, S., Lord, N., Lee, N., Torr, P. (2018). Learn To Pay Attention.
Ramachandran, P., Zoph, B., Le, Q. (2017). Searching for Activation Functions.
Tsai, Y., Bai, S., Yamada, M., Morency, L., Salakhutdinov, R. (2019). Transformer Dissection: A Unified Understanding of Transformer’s Attention via the Lens of Kernel.

Edit this page on GitHub or file an issue.

Large Language Models

NLP Foundations

Recurrent Neural Networks

Language Models

Neural Machine Translation

Transformers

Multilayer Perceptron (MLP)

Resources

References

Large Language Models

NLP Foundations

Recurrent Neural Networks

Language Models

Neural Machine Translation

Transformers

​Resources

​References

Resources

References