2024-07-20

Seq2seq

This post covers Seq2Seq Encoder-Decoder Architecture. Seq2Seq architecture is heavily used in areas such as Machine Translation, Natural Language Processing, Chat Bots, etc. This post heavily referenced Josh Starmer’s Sequence-to-Sequence (seq2seq) Encoder-Decoder Neural Networks, Clearly Explained!!! video.

Related Issues
Intuition
Building Encoders
Building Decoders
Training

Issue of different length: In machine translation process, there is a problem of unmatching number of input vectors and desired output vectors.
In other words, in most of the cases, sequences of one type of thing would not be in 1-to-1 relationship with sequences of another type of thing.

Intuition

encoder-decoder-architecture Encoder-Decoder Architecture Figure¹

I first learn the representation of the English sentence.
The entire representation is in the output of the blue RNN cells.
Then pass that representation vector into decoder architecture - white cells.
Decoder parts starts with encoder’s input and the EOS signal.

As you can recognize, that one output vector of encoder architecture is doing a lot of heavy lifting. That one vector is trying to represent the entirely of the English sentence. Currently, it’s not a long sentence, but imagine that sentence length grows. This could be a challenging problem.

Building Encoders

Materials we need

One-hot vectors of words
Word2Vec embedding layers (~ 200K tokens with 1K embedding dimensions)
LSTM units (~ 4 layers with 1K LSTM cells per layer)
- LSTM units can be doubled or stacked.
- Doubled LSTM cells have their own, separate sets of weights and biases.
- Stacked LSTM cells are additional layers. The output values from the unrolled LSTM units in the first layer(the short-term memories, or hidden states) are used as the inputs to the unrolled LSTM units in the second layer.

Architecture of Encoders

encoder-decoder-in-depth Encoder-Decoder Architecture Detailed Figure²

In essence, the Encoder encodes the input sentence into a collection of long and short term memories, which is called Context Vector.

Building Decoders

Decodrs are named decoders because they decodes input context vector. Then, how do they manage this task?

Materials we need

New set of LSTMs - these LSTMs have their own and separate weights and biases, different from the ones in the Encoder.
Embedding layers that provide the input to the LSTM cells in the first layer. However, embedding layers here contains target language sets. (~ 80K tokens)
Fully connected layer - additionally transforms weights and biases values from the LSTM layers.
- Number of inputs: matches input from LSTM layers. (~ 1K inputs <=> 1K LSTM cells in the 4th layer)
- Number of outputs: matches input from embedding layers. (~ 80K outputs <=> size of the output vocabulary)
- Additionally adjusted weights and biases in between inputs and outputs.
Softmax function
- Input: Output from FC layer.
- Output: Target word representation.

When to stop?

We unroll LSTM until the next predicted token from Softmax function is (end of senstence).

Working order of Decoders

Context vectors are used to initialize the LSTMs in the Decoder.
Another input to the decoder’s LSTMs comes from the word embedding layer that starts with .
But subsequently uses whatever word was predicted by the previously unrolled LSTM layer
Decoders keep predicts words until it predicts the token, or hits some maximum output length.

Training Encoder-Decoder

Teacher Forcing

While training, instead of using predicted token for subsequently unrolled LSTMs, we use the known and correct token, no matter of the model’s current prediction.
Also the model stops predicting even if it doesn’t hit by its predction, rather it stops where **the known phase ends**.

Figure from this article ↩
Figure from this article ↩

Seq2seq

Table of contents

Related Issues

Intuition

Building Encoders

Materials we need

Architecture of Encoders

Building Decoders

Materials we need

When to stop?

Working order of Decoders

Training Encoder-Decoder

Teacher Forcing