Attention
May 10, 2024
Attention
Instead of analyzing data sequentially as in an RNN or its variants, most modern approaches analyze sequential information as a whole. This strategy helps mitigate or resolve many of the common issues faced by RNNs. The key concept behind Attention is that for each encoder state, there is a corresponding key vector and for each decoder state query vector . These vectors interact through a function to generate a context vector, which is then passed to the decoding states.
- Compute Key Vectors: For each of the encoder states compute the key vectors
- Compute Attention Scores: For each of the encoder states with the respective decoding state. Usually the score is calculated as , which represents the ratio of the Euclidean distance between the key vector and the query vector to the dimension of the source hidden state .
- Soft-max the Attention Scores: Apply the softmax function to the attention scores to normalize them.
- Compute Weighted Attention Scores: Multiply each hidden state by the soft-maxed attention scores to get the weighted attention scores.
- Compute Context Vector: Sum the weighted attention scores to get the context vector.
In a basic attention framework the goal is to learn and .
Transformer
In a Transformer Network, the architecture completely discards the recurrent components. Instead, the encoder features self-attention layers, where each input vector at the encoder stage computes an attention score relative to all other encoder inputs. This is based on the premise that every word in a sentence may depend on all other words in the sentence. Notably, in a basic attention framework, the primary goal is to learn the matrices and , but a Transformer also learns a matrix where is the value vector.
- Compute Key Vectors: For each of the encoder states compute the key vectors
- Compute Attention Scores: For each of the encoder states with the respective decoding state. Usually the score is calculated as , which represents the ratio of the Euclidean distance between the key vector and the query vector to the dimension of the source hidden state .
- Soft-max the Attention Scores: Apply the softmax function to the attention scores to normalize them.
- Compute Weighted Value Vectors: Multiply each value vectors by the soft-maxed attention scores to get the weighted value vectors.
- Compute Output: Sum the weighted value vectors to get the output.
Layer Stacking in Transformer
This process can be repeated multiple times, allowing the self-attention layers to stack atop one another. In this configuration, the output computed in one self-attention layer becomes the input for the next, facilitating complex transformations and deeper understanding.
Tricks
- Positional Encoding
- Residual Connections
- Layer Normalization
- Masked Attention: Use a Lower Triangular Matrix to "mask" - i.e., prevent positions from attending to subsequent positions.