Recurrent Neural Network
May 09, 2024
Recurrent Neural Network
If incoming data is sequential then it has the structure . Suppose the task is to classify the part of speech (e.g., noun, verb, adjective etc.) in an incoming sentence. Then the appropriate (and naive) approach is to use a Recurrent Neural Network (RNN). RNN has the following components.
- Matrix that maps to a latent space .
- Matrix that maps latent vector to the output .
- Matrix that maps the latent vector in the previous state to the current latent vector .
Note that are all the same matrices across all states. Therefore, for a given state
This is definitely not the best thing to do (because words in a sequence is not independent), but the structure can be thought of as independent soft-max functions over states.
Consequently, minimizing the NLL through SGD (backpropagation through time) would the method to train the RNN. Note that usual SGD won't work well, so in practice Adam or gradient clipping is used.
RNNs can have more than one layer (Deep RNN) and the network can also be bi-directional (Bi-Directional RNN); just like in english sentences it makes sense that the latter part of the sequence dictates what the word in the previous part means. However, with a Bi-Directional RNN the network must be acyclic.
However, there are major drawbacks with this approach:
- Memory issue: backpropagation through time requires many intermediate calculation, and due to the structure of RNN can't train with distributed systems.
- Vanishing or exploding gradients: implies that will converge to if the largest singular value of is less than (diverge to if it is greater than ).
- Binding sequence length: input data and output data must have the same length.
Sequence-to-Sequence
Sequence-to-Sequence (seq2seq) resolves the binding sequence length problem in an RNN. It has an encoding stage and a decoding stage. At the encoding stage there are latent states and at the decoding stage the output for state is returned by
Note that at the encoding stage there is no output matrix . Thus, in a seq2seq, similar to RNN, the training involves minimizing the NLL
However, there are still lingering issues:
- This loss function attempts to get every single word of the sentence correct. Therefore, if the sentence is slightly altered, but the meaning or context is the same, the model may misinterpret it. That is, the model fails at getting the entire sentence correct.
- Memory issue
- Vanishing or exploding gradients
Note that seq2seq can be adopted to be multi-modal. That is, the model can take a video as an input and output a text (e.g., caption), vice versa, or any other input-output pair where the datastructures are different.
LSTM
Long Short Term Memory (LSTM) mitigates (but doesn't completely resolve) the vanishing or exploding gradient problem. LSTM has memory cells such that at certain states it can load previous information.