
- During encoding the output of bidirectional LSTM encoder can provide the contextual representation of each input word . Let the encoder hidden vectors be denoted as where is the length of the input sentence.
- During decoding we compute the RNN decoder hidden states using a recursive relationship,

- For each hidden state from the source sentence (key), is the sequence index of the encoder hidden state, we compute a score
- The score values are normalized using a softmax layer to produce the attention weight vector . All the weights for a given decoder time step add up to 1.
- The context vector is then the attention weighted average of the hidden state vectors (values) from the original sentence.



