Vanishing Gradients and Fancy RNNs Notes

CS224n: Natural Language Processing with Deep Learning Lecture 7 Notes

  • Core idea
    • Apply the same weights W repeatedly
  • Advantages
    • Can process any length input
    • Computation for step t can (in theory) use information from many steps back
    • Model size doesn’t increase for longer input
    • Same weights applied on every timestep, so there is symmetry in how inputs are processed.
  • Disadvantages
    • Recurrent computation is slow
    • In practice, difficult to access information from many steps back
  • Training
    • Get a big corpus of text which is a sequence of words x_1,..x_t
    • Feed into RNN-LM; compute output distribution y^(t) for every step t
    • Loss function on step t is cross-entropy between predicted probability distribution y^(t), and the true next word y(t) (one-hot for x^(t+1))
    • Average this to get overall loss for entire training set
    • However: Computing loss and gradients across entire corpus x_1,..x_t is too expensive!
    • In practice, consider x_1,..x_t as a sentence (or a document)
    • Recall: Stochastic Gradient Descent allows us to compute loss and gradients for small chunk of data, and update.
    • Compute loss for a sentence (actually a batch of sentences), compute gradients and update weights. Repeat.
  • Vanishing(Exploding) gradient problem
    • Why is vanishing gradient a problem?
      • Gradient signal from faraway is lost because it’s much smaller than gradient signal from close-by.So model weights are only updated only with respect to near effects, not long-term effects.
      • Another explanation
        • Gradient can be viewed as a measure of the effect of the past on the future
      • If the gradient becomes vanishingly small over longer distances (step t to step t+n), then we can’t tell whether
        • There’s no dependency between step t and t+n in the data
        • We have wrong parameters to capture the true dependency between t and t+n
    • Why is exploding gradient a problem?
      • If the gradient becomes too big, then the SGD update step becomes too big
      • This can cause bad updates
        • we take too large a step and reach a bad parameter configuration (with large loss)
      • In the worst case, this will result in Inf or NaN in your network (then you have to restart training from an earlier checkpoint)
      • solution
        • Gradient clipping
          • if the norm of the gradient is greater than some threshold, scale it down before applying SGD update
        • Intuition
          • take a step in the same direction, but a smaller step
    • motivates
      • It’s too difficult for the RNN to learn to preserve information over many timesteps
      • Two new types of RNN:
        • LSTM
          • A type of RNN proposed by Hochreiter and Schmidhuber in 1997 as a solution to the vanishing gradients problem.
          • On step t, there is a hidden state h(t) and a cell state c(t)
            • Both are vectors length n
            • The cell stores long-term information
            • • The LSTM can erase, write and read information from the cell
          • The selection of which information is erased/written/read is controlled by three corresponding gates
            • The gates are also vectors length n
            • On each timestep, each element of the gates can be open (1), closed (0), or somewhere in-between
            • The gates are `dynamic`
              • their value is computed based on the current context
          • Define
            • We have a sequence of inputs x(t) , and we will compute a sequence of hidden states h(t) and cell states c(t).
            • On timestep t:
          • How does LSTM solve vanishing gradients?
            • The LSTM architecture makes it easier for the RNN to preserve information over many timesteps
              • e.g. if the forget gate is set to remember everything on every timestep, then the info in the cell is preserved indefinitely
              • By contrast, it’s harder for vanilla RNN to learn a recurrent weight matrix Wh that preserves info in hidden state
            • LSTM doesn’t guarantee that there is no vanishing/exploding gradient, but it does provide an easier way for the model to learn long-distance dependencies
          • Real-world success
            • In 2013-2015, LSTMs started achieving state-of-the-art results
              • Successful tasks include: handwriting recognition, speech recognition, machine translation, parsing, image captioning
              • LSTM became the dominant approach
            • Now (2019), other approaches (e.g. Transformers) have become more dominant for certain tasks
              • For example in WMT (a MT conference + competition)
                • In WMT 2016, the summary report contains ”RNN” 44 times
                • In WMT 2018, the report contains “RNN” 9 times and “Transformer” 63 times
        • GRU
          • Proposed by Cho et al. in 2014 as a simpler alternative to the LSTM
          • Define
            • On each timestep t we have input and hidden state h(t) (no cell state)
              • Update gate
                • controls what parts of hidden state are updated vs preserved
              • Reset gate
                • controls what parts of previous hidden state are used to compute new content
              • New hidden state content
                • reset gate selects useful parts of prev hidden state. Use this and current input to compute new hidden content
              • Hidden state
                • update gate simultaneously controls what is kept from previous hidden state, and what is updated to new hidden state content
          • How does this solve vanishing gradient?
            • Like LSTM, GRU makes it easier to retain info long-term (e.g. by setting update gate to 0)
      • LSTM vs GRU
        • Researchers have proposed many gated RNN variants, but LSTM and GRU are the most widely-used
        • The biggest difference is that GRU is quicker to compute and has fewer parameters
        • There is no conclusive evidence that one consistently performs better than the other
        • LSTM is a good default choice (especially if your data has particularly long dependencies, or you have lots of training data)
        • Rule of thumb
          • start with LSTM, but switch to GRU if you want something more efficient
    • Other fixes for vanishing (or exploding) gradient
      • Gradient clipping
      • Skip connections
    • Effect on RNN-LM
      • Due to vanishing gradient, RNN-LMs are better at learning from `sequential recency than syntactic recency`, so they make this type of error more often than we’d like [Linzen et al 2016]
    • Is vanishing/exploding gradient just a RNN problem?
      • No! It can be a problem for all neural architectures (including `feed-forward` and `convolutional`), especially deep ones.
        • Due to chain rule / choice of nonlinearity function, gradient can become vanishingly small as it backpropagates
        • Thus lower layers are learnt very slowly (hard to train)
        • Solution
        • Conclusion
          • Though vanishing/exploding gradients are a general problem, `RNNs are particularly unstable` due to the repeated multiplication by the `same` weight matrix [Bengio et al, 1994]
  • More fancy RNN variants
    • Bidirectional RNNs
      • motivation
        • Task
          • Sentiment Classification
        • These contextual representations only contain information about the left context
          • What about right context?
        • Notation
          • RNN_{FW}
            • This is a general notation to mean “compute one forward step of the RNN” – it could be a vanilla, LSTM or GRU computation
          • RNN_{BW}
            • This is a general notation to mean “compute one backward step of the RNN” – it could be a vanilla, LSTM or GRU computation
          • h_{(t)}
            • We regard this as “the hidden state” of a bidirectional RNN. This is what we pass on to the next parts of the network.
      • Note
        • bidirectional RNNs are only applicable if you have access to the `entire input sequence`
          • They are not applicable to Language Modeling, because in LM you only have left context available
        • If you do have entire input sequence (e.g. any kind of encoding), `bidirectionality is powerful` (you should use it by default)
        • For example
          • BERT (Bidirectional Encoder Representations from Transformers) is a powerful pretrained contextual representation system `built on bidirectionality`
    • Multi-layer RNNs aka stacked RNNs
      • RNNs are already “deep” on one dimension (they unroll over many timesteps)
        • We can also make them “deep” in another dimension by `applying multiple RNNs` – this is a multi-layer RNN.
      • This allows the network to compute `more complex representations`
        • The lower RNNs should compute lower-level features and the higher RNNs should compute higher-level features
      • In practice
        • High-performing RNNs are often multi-layer (but aren’t as deep as convolutional or feed-forward networks)
        • For example
          • In a 2017 paper, Britz et al find that for Neural Machine Translation, 2 to 4 layers is best for the encoder RNN, and 4 layers is best for the decoder RNN
            • However, skip-connections/dense-connections are needed to train deeper RNNs (e.g. 8 layers)
          • Transformer-based networks (e.g. BERT) can be up to 24 layers
            • You will learn about Transformers later; they have a lot of skipping-like connections
  • In summary
    • Lots of new information today! What are the practical takeaways?
      • LSTMs are powerful but GRUs are faster
      • Clip your gradients
      • Use bidirectionality when possible
      • Multi-layer RNNs are powerful, but you might need skip/dense-connections if it’s deep

reference:https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1194/slides/cs224n-2019-lecture07-fancy-rnn.pdf