# Encoder Decoder Models

# Seq 2 Seq Models

# Applications

- Google Translate
- Auto Reply
- Auto Suggestion

Seminal paper — ILya Sutskever

Machine Translation ==> Seq to Seq Models

# Image Captioning

# Mathematically

The above statement means find the y which has the highest probability given the input x. This is a very complex probability and there is no closed form solution to this.

In the xi and yi pairs, xi can be in one language and yi will be in another language.

For Machine Translation the loss used can be minimizing cross-entropy or maximizing log likelihood.

# Practical Application

Seq2Seq Model used for machine translation by Google translate as core algorithm for a while

In gmail it was used for auto-reply.

**Image-Text captioning**

Open Google-chrome and select image search. Type in any description and hit enter. Images/Pictures satisfying the description will be displayed.

Google is good at text search, hence by using the encode-decoder models it has converted image search into a text search.

**Encoder Decoder Block Diagram**

**Choe et all**

The main change in this paper is that the context vector is passed as input to all the decoder LSTM cells

LSTM’s can take one fresh input and an input from the previous time step

In this case we have three inputs

- Fresh input
- Input from the previous time step
- Context vector

Hence a new LSTM was developed that can take three inputs.

The problem with this approach was that the LSTM was not optimized leading to lesser adoption in the industry

This architecture did not produce exceptional results when compared to Ilya Sutseker’s approach.

**Image ==> Caption Karapthy et all**

The **encoder** is a CNN with the **decoder** being an LSTM/GRU

The last layer of a CNN basically a non-linear layer is not added. A softmax or tanh layer is not present as the last layer. It is okay not to have this layer since we are not doing any classification.

The o/p of the last layer of the CNN will be the context vector which will be passed as input to the decoder. This will be the input **t-1 **to the first LSTM.

The final vector is the essence of the image. This is like image context vector.

This vector encapsulates the whole information that is in the image as learnt by the CNN model.

Any CNN model can be used the more complex the CNN better will be the results. But this would mean using more compute power and more training data.

The sequence begins with the predefined word START or all zeroes and again ends with predefined word EOS. We stop consuming the input when we get EOS and stop generating output when the model generates the output EOS.

# References

2. **Sequence to Sequence Learning with Neural Networks**

**Ilya Sutskever****, ****Oriol Vinyals****, ****Quoc V. Le**

3. Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation

Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, Yoshua Bengio

https://arxiv.org/abs/1406.1078

4. Deep Visual-Semantic Alignments for Generating Image Descriptions