Recurrent Neural Network

May 09, 2024

Recurrent Neural Network

If incoming data xx is sequential then it has the structure x1,,xTx_1, \dots , x_T. Suppose the task is to classify the part of speech (e.g., noun, verb, adjective etc.) in an incoming sentence. Then the appropriate (and naive) approach is to use a Recurrent Neural Network (RNN). RNN has the following components.

  • Matrix WW that maps xtx_t to a latent space ztz_t.
  • Matrix VV that maps latent vector ztz_t to the output yty_t.
  • Matrix UU that maps the latent vector in the previous state zt1z_{t-1} to the current latent vector ztz_t.

Note that W,V,UW,V,U are all the same matrices across all states. Therefore, for a given state tt

y^t=Vh(Wxt+Uzt1)\hat{y}_t = Vh(Wx_t + Uz_{t-1})

This is definitely not the best thing to do (because words in a sequence is not independent), but the structure can be thought of as independent soft-max functions over states.

P(y1:Tx1:T,W,V,U)=t=1TP(ytxt,W,V,U)P(y_{1:T}\mid x_{1:T}, W, V, U) = \prod_{t=1}^TP(y_t \mid x_t, W, V, U)

Consequently, minimizing the NLL through SGD (backpropagation through time) would the method to train the RNN. Note that usual SGD won't work well, so in practice Adam or gradient clipping is used.

W^,V^,U^argminW,V,U{i=1nt=1TilogP(yitxit,W,V,U)}\hat{W},\hat{V},\hat{U} \in \underset{W,V,U}{\arg \min} \left\{-\sum_{i=1}^n\sum_{t=1}^{T_i} \log{P{(y_{it}\mid x_{it}, W, V, U)}}\right\}

RNNs can have more than one layer (Deep RNN) and the network can also be bi-directional (Bi-Directional RNN); just like in english sentences it makes sense that the latter part of the sequence dictates what the word in the previous part means. However, with a Bi-Directional RNN the network must be acyclic.

However, there are major drawbacks with this approach:

  • Memory issue: backpropagation through time requires many intermediate calculation, and due to the structure of RNN can't train with distributed systems.
  • Vanishing or exploding gradients: zt=Utz0z_t = U^tz_0 implies that ztz_t will converge to 00 if the largest singular value of UU is less than 11 (diverge to \infty if it is greater than 11).
  • Binding sequence length: input data and output data must have the same length.

Sequence-to-Sequence

Sequence-to-Sequence (seq2seq) resolves the binding sequence length problem in an RNN. It has an encoding stage and a decoding stage. At the encoding stage there are tt latent states z1,,ztz_1, \dots, z_t and at the decoding stage the output for state τ\tau is returned by

y^τ=Vdh(Wdxt+τ+Udzt+τ1)\hat{y}_\tau = V_dh(W_dx_{t+\tau} + U_dz_{t+\tau-1})

Note that at the encoding stage there is no output matrix VV. Thus, in a seq2seq, similar to RNN, the training involves minimizing the NLL

W^e,U^e,W^d,V^d,U^dargminWe,Ue,Wd,Vd,Ud{i=1nt=1TilogP(yitxit,We,Ue,Wd,Vd,Ud)}\hat{W}_e, \hat{U}_e, \hat{W}_d,\hat{V}_d,\hat{U}_d \in \underset{W_e,U_e,W_d,V_d,U_d}{\arg \min} \left\{-\sum_{i=1}^n\sum_{t=1}^{T_i} \log{P{(y_{it}\mid x_{it}, W_e,U_e,W_d,V_d,U_d)}}\right\}

However, there are still lingering issues:

  • This loss function attempts to get every single word of the sentence correct. Therefore, if the sentence is slightly altered, but the meaning or context is the same, the model may misinterpret it. That is, the model fails at getting the entire sentence correct.
  • Memory issue
  • Vanishing or exploding gradients

Note that seq2seq can be adopted to be multi-modal. That is, the model can take a video as an input and output a text (e.g., caption), vice versa, or any other input-output pair where the datastructures are different.

LSTM

Long Short Term Memory (LSTM) mitigates (but doesn't completely resolve) the vanishing or exploding gradient problem. LSTM has memory cells such that at certain states it can load previous information.