CS224n: Natural Language Processing with Deep Learning Lecture 7 Notes

Core idea
- Apply the same weights W repeatedly
Advantages
- Can process any length input
- Computation for step t can (in theory) use information from many steps back
- Model size doesn’t increase for longer input
- Same weights applied on every timestep, so there is symmetry in how inputs are processed.
Disadvantages
- Recurrent computation is slow
- In practice, difficult to access information from many steps back
Training
- Get a big corpus of text which is a sequence of words x_1,..x_t
- Feed into RNN-LM; compute output distribution y^(t) for every step t
- Loss function on step t is cross-entropy between predicted probability distribution y^(t), and the true next word y(t) (one-hot for x^(t+1))
- Average this to get overall loss for entire training set
- However: Computing loss and gradients across entire corpus x_1,..x_t is too expensive!
- In practice, consider x_1,..x_t as a sentence (or a document)
- Recall: Stochastic Gradient Descent allows us to compute loss and gradients for small chunk of data, and update.
- Compute loss for a sentence (actually a batch of sentences), compute gradients and update weights. Repeat.
Vanishing(Exploding) gradient problem
- Why is vanishing gradient a problem?
  - Gradient signal from faraway is lost because it’s much smaller than gradient signal from close-by.So model weights are only updated only with respect to near effects, not long-term effects.
  - Another explanation
    - Gradient can be viewed as a measure of the effect of the past on the future
  - If the gradient becomes vanishingly small over longer distances (step t to step t+n), then we can’t tell whether
    - There’s no dependency between step t and t+n in the data
    - We have wrong parameters to capture the true dependency between t and t+n
- Why is exploding gradient a problem?
  - If the gradient becomes too big, then the SGD update step becomes too big
  - This can cause bad updates
    - we take too large a step and reach a bad parameter configuration (with large loss)
  - In the worst case, this will result in Inf or NaN in your network (then you have to restart training from an earlier checkpoint)
  - solution
    - Gradient clipping
      - if the norm of the gradient is greater than some threshold, scale it down before applying SGD update
    - Intuition
      - take a step in the same direction, but a smaller step
- motivates
  - It’s too difficult for the RNN to learn to preserve information over many timesteps
  - Two new types of RNN:
    - LSTM
      - A type of RNN proposed by Hochreiter and Schmidhuber in 1997 as a solution to the vanishing gradients problem.
      - On step t, there is a hidden state h(t) and a cell state c(t)
        
        Both are vectors length n
        
        The cell stores long-term information
        
        • The LSTM can erase, write and read information from the cell
      - The selection of which information is erased/written/read is controlled by three corresponding gates
        
        The gates are also vectors length n
        
        On each timestep, each element of the gates can be open (1), closed (0), or somewhere in-between
        
        The gates are `dynamic`
        
        their value is computed based on the current context
      - Define
        
        We have a sequence of inputs x(t) , and we will compute a sequence of hidden states h(t) and cell states c(t).
        
        On timestep t:
      - How does LSTM solve vanishing gradients?
        
        The LSTM architecture makes it easier for the RNN to preserve information over many timesteps
        
        e.g. if the forget gate is set to remember everything on every timestep, then the info in the cell is preserved indefinitely
        
        By contrast, it’s harder for vanilla RNN to learn a recurrent weight matrix Wh that preserves info in hidden state
        
        LSTM doesn’t guarantee that there is no vanishing/exploding gradient, but it does provide an easier way for the model to learn long-distance dependencies
      - Real-world success
        
        In 2013-2015, LSTMs started achieving state-of-the-art results
        
        Successful tasks include: handwriting recognition, speech recognition, machine translation, parsing, image captioning
        
        LSTM became the dominant approach
        
        Now (2019), other approaches (e.g. Transformers) have become more dominant for certain tasks
        
        For example in WMT (a MT conference + competition)
        
        In WMT 2016, the summary report contains ”RNN” 44 times
        
        In WMT 2018, the report contains “RNN” 9 times and “Transformer” 63 times
    - GRU
      - Proposed by Cho et al. in 2014 as a simpler alternative to the LSTM
      - Define
        
        On each timestep t we have input and hidden state h(t) (no cell state)
        
        Update gate
        
        controls what parts of hidden state are updated vs preserved
        
        Reset gate
        
        controls what parts of previous hidden state are used to compute new content
        
        New hidden state content
        
        reset gate selects useful parts of prev hidden state. Use this and current input to compute new hidden content
        
        Hidden state
        
        update gate simultaneously controls what is kept from previous hidden state, and what is updated to new hidden state content
      - How does this solve vanishing gradient?
        
        Like LSTM, GRU makes it easier to retain info long-term (e.g. by setting update gate to 0)
  - LSTM vs GRU
    - Researchers have proposed many gated RNN variants, but LSTM and GRU are the most widely-used
    - The biggest difference is that GRU is quicker to compute and has fewer parameters
    - There is no conclusive evidence that one consistently performs better than the other
    - LSTM is a good default choice (especially if your data has particularly long dependencies, or you have lots of training data)
    - Rule of thumb
      - start with LSTM, but switch to GRU if you want something more efficient
- Other fixes for vanishing (or exploding) gradient
  - Gradient clipping
  - Skip connections
- Effect on RNN-LM
  - Due to vanishing gradient, RNN-LMs are better at learning from `sequential recency than syntactic recency`, so they make this type of error more often than we’d like [Linzen et al 2016]
- Is vanishing/exploding gradient just a RNN problem?
  - No! It can be a problem for all neural architectures (including `feed-forward` and `convolutional`), especially deep ones.
    - Due to chain rule / choice of nonlinearity function, gradient can become vanishingly small as it backpropagates
    - Thus lower layers are learnt very slowly (hard to train)
    - Solution
    - Conclusion
      - Though vanishing/exploding gradients are a general problem, `RNNs are particularly unstable` due to the repeated multiplication by the `same` weight matrix [Bengio et al, 1994]
More fancy RNN variants
- Bidirectional RNNs
  - motivation
    - Task
      - Sentiment Classification
    - These contextual representations only contain information about the left context
      - What about right context?
    - Notation
      - RNN_{FW}
        
        This is a general notation to mean “compute one forward step of the RNN” – it could be a vanilla, LSTM or GRU computation
      - RNN_{BW}
        
        This is a general notation to mean “compute one backward step of the RNN” – it could be a vanilla, LSTM or GRU computation
      - h_{(t)}
        
        We regard this as “the hidden state” of a bidirectional RNN. This is what we pass on to the next parts of the network.
  - Note
    - bidirectional RNNs are only applicable if you have access to the `entire input sequence`
      - They are not applicable to Language Modeling, because in LM you only have left context available
    - If you do have entire input sequence (e.g. any kind of encoding), `bidirectionality is powerful` (you should use it by default)
    - For example
      - BERT (Bidirectional Encoder Representations from Transformers) is a powerful pretrained contextual representation system `built on bidirectionality`
- Multi-layer RNNs aka stacked RNNs
  - RNNs are already “deep” on one dimension (they unroll over many timesteps)
    - We can also make them “deep” in another dimension by `applying multiple RNNs` – this is a multi-layer RNN.
  - This allows the network to compute `more complex representations`
    - The lower RNNs should compute lower-level features and the higher RNNs should compute higher-level features
  - In practice
    - High-performing RNNs are often multi-layer (but aren’t as deep as convolutional or feed-forward networks)
    - For example
      - In a 2017 paper, Britz et al find that for Neural Machine Translation, 2 to 4 layers is best for the encoder RNN, and 4 layers is best for the decoder RNN
        
        However, skip-connections/dense-connections are needed to train deeper RNNs (e.g. 8 layers)
      - Transformer-based networks (e.g. BERT) can be up to 24 layers
        
        You will learn about Transformers later; they have a lot of skipping-like connections
In summary
- Lots of new information today! What are the practical takeaways?
  - LSTMs are powerful but GRUs are faster
  - Clip your gradients
  - Use bidirectionality when possible
  - Multi-layer RNNs are powerful, but you might need skip/dense-connections if it’s deep

reference:https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1194/slides/cs224n-2019-lecture07-fancy-rnn.pdf

回首頁

Ray Sin Learning notes

喜歡分析資料、計算數字、善用工具的統計學家

Vanishing Gradients and Fancy RNNs Notes

CS224n: Natural Language Processing with Deep Learning Lecture 7 Notes

系列文章