CS224n: Natural Language Processing with Deep Learning Lecture 7 Notes
- Core idea
- Apply the same weights W repeatedly
- Advantages
- Can process any length input
- Computation for step t can (in theory) use information from many steps back
- Model size doesn’t increase for longer input
- Same weights applied on every timestep, so there is symmetry in how inputs are processed.
- Disadvantages
- Recurrent computation is slow
- In practice, difficult to access information from many steps back
- Training
- Get a big corpus of text which is a sequence of words x_1,..x_t
- Feed into RNN-LM; compute output distribution y^(t) for every step t
- Loss function on step t is cross-entropy between predicted probability distribution y^(t), and the true next word y(t) (one-hot for x^(t+1))
- Average this to get overall loss for entire training set
- However: Computing loss and gradients across entire corpus x_1,..x_t is too expensive!
- In practice, consider x_1,..x_t as a sentence (or a document)
- Recall: Stochastic Gradient Descent allows us to compute loss and gradients for small chunk of data, and update.
- Compute loss for a sentence (actually a batch of sentences), compute gradients and update weights. Repeat.
- Vanishing(Exploding) gradient problem
- Why is vanishing gradient a problem?
- Gradient signal from faraway is lost because it’s much smaller than gradient signal from close-by.So model weights are only updated only with respect to near effects, not long-term effects.
- Another explanation
- Gradient can be viewed as a measure of the effect of the past on the future
- If the gradient becomes vanishingly small over longer distances (step t to step t+n), then we can’t tell whether
- There’s no dependency between step t and t+n in the data
- We have wrong parameters to capture the true dependency between t and t+n
- Why is exploding gradient a problem?
- If the gradient becomes too big, then the SGD update step becomes too big
- This can cause bad updates
- we take too large a step and reach a bad parameter configuration (with large loss)
- In the worst case, this will result in Inf or NaN in your network (then you have to restart training from an earlier checkpoint)
- solution
- Gradient clipping
- if the norm of the gradient is greater than some threshold, scale it down before applying SGD update
- Intuition
- take a step in the same direction, but a smaller step
- motivates
- It’s too difficult for the RNN to learn to preserve information over many timesteps
- Two new types of RNN:
- LSTM
- A type of RNN proposed by Hochreiter and Schmidhuber in 1997 as a solution to the vanishing gradients problem.
- On step t, there is a hidden state h(t) and a cell state c(t)
- Both are vectors length n
- The cell stores long-term information
- • The LSTM can erase, write and read information from the cell
- The selection of which information is erased/written/read is controlled by three corresponding gates
- The gates are also vectors length n
- On each timestep, each element of the gates can be open (1), closed (0), or somewhere in-between
- The gates are `dynamic`
- their value is computed based on the current context
- Define
- We have a sequence of inputs x(t) , and we will compute a sequence of hidden states h(t) and cell states c(t).
- On timestep t:
- How does LSTM solve vanishing gradients?
- The LSTM architecture makes it easier for the RNN to preserve information over many timesteps
- e.g. if the forget gate is set to remember everything on every timestep, then the info in the cell is preserved indefinitely
- By contrast, it’s harder for vanilla RNN to learn a recurrent weight matrix Wh that preserves info in hidden state
- LSTM doesn’t guarantee that there is no vanishing/exploding gradient, but it does provide an easier way for the model to learn long-distance dependencies
- Real-world success
- In 2013-2015, LSTMs started achieving state-of-the-art results
- Successful tasks include: handwriting recognition, speech recognition, machine translation, parsing, image captioning
- LSTM became the dominant approach
- Now (2019), other approaches (e.g. Transformers) have become more dominant for certain tasks
- For example in WMT (a MT conference + competition)
- In WMT 2016, the summary report contains ”RNN” 44 times
- In WMT 2018, the report contains “RNN” 9 times and “Transformer” 63 times
- GRU
- Proposed by Cho et al. in 2014 as a simpler alternative to the LSTM
- Define
- On each timestep t we have input and hidden state h(t) (no cell state)
- Update gate
- controls what parts of hidden state are updated vs preserved
- Reset gate
- controls what parts of previous hidden state are used to compute new content
- New hidden state content
- reset gate selects useful parts of prev hidden state. Use this and current input to compute new hidden content
- Hidden state
- update gate simultaneously controls what is kept from previous hidden state, and what is updated to new hidden state content
- How does this solve vanishing gradient?
- Like LSTM, GRU makes it easier to retain info long-term (e.g. by setting update gate to 0)
- LSTM vs GRU
- Researchers have proposed many gated RNN variants, but LSTM and GRU are the most widely-used
- The biggest difference is that GRU is quicker to compute and has fewer parameters
- There is no conclusive evidence that one consistently performs better than the other
- LSTM is a good default choice (especially if your data has particularly long dependencies, or you have lots of training data)
- Rule of thumb
- start with LSTM, but switch to GRU if you want something more efficient
- Other fixes for vanishing (or exploding) gradient
- Gradient clipping
- Skip connections
- Effect on RNN-LM
- Due to vanishing gradient, RNN-LMs are better at learning from `sequential recency than syntactic recency`, so they make this type of error more often than we’d like [Linzen et al 2016]
- Is vanishing/exploding gradient just a RNN problem?
- No! It can be a problem for all neural architectures (including `feed-forward` and `convolutional`), especially deep ones.
- Due to chain rule / choice of nonlinearity function, gradient can become vanishingly small as it backpropagates
- Thus lower layers are learnt very slowly (hard to train)
- Solution
- Conclusion
- Though vanishing/exploding gradients are a general problem, `RNNs are particularly unstable` due to the repeated multiplication by the `same` weight matrix [Bengio et al, 1994]
- More fancy RNN variants
- Bidirectional RNNs
- motivation
- Task
- These contextual representations only contain information about the left context
- What about right context?
- Notation
- RNN_{FW}
- This is a general notation to mean “compute one forward step of the RNN” – it could be a vanilla, LSTM or GRU computation
- RNN_{BW}
- This is a general notation to mean “compute one backward step of the RNN” – it could be a vanilla, LSTM or GRU computation
- h_{(t)}
- We regard this as “the hidden state” of a bidirectional RNN. This is what we pass on to the next parts of the network.
- Note
- bidirectional RNNs are only applicable if you have access to the `entire input sequence`
- They are not applicable to Language Modeling, because in LM you only have left context available
- If you do have entire input sequence (e.g. any kind of encoding), `bidirectionality is powerful` (you should use it by default)
- For example
- BERT (Bidirectional Encoder Representations from Transformers) is a powerful pretrained contextual representation system `built on bidirectionality`
- Multi-layer RNNs aka stacked RNNs
- RNNs are already “deep” on one dimension (they unroll over many timesteps)
- We can also make them “deep” in another dimension by `applying multiple RNNs` – this is a multi-layer RNN.
- This allows the network to compute `more complex representations`
- The lower RNNs should compute lower-level features and the higher RNNs should compute higher-level features
- In practice
- High-performing RNNs are often multi-layer (but aren’t as deep as convolutional or feed-forward networks)
- For example
- In a 2017 paper, Britz et al find that for Neural Machine Translation, 2 to 4 layers is best for the encoder RNN, and 4 layers is best for the decoder RNN
- However, skip-connections/dense-connections are needed to train deeper RNNs (e.g. 8 layers)
- Transformer-based networks (e.g. BERT) can be up to 24 layers
- You will learn about Transformers later; they have a lot of skipping-like connections
- In summary
- Lots of new information today! What are the practical takeaways?
- LSTMs are powerful but GRUs are faster
- Clip your gradients
- Use bidirectionality when possible
- Multi-layer RNNs are powerful, but you might need skip/dense-connections if it’s deep
reference:https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1194/slides/cs224n-2019-lecture07-fancy-rnn.pdf