Transformer Models: A Beginner Guide

6 min readJun 25, 2024

Transformers are a model architecture in Machine Learning used for processing and generating sequential data, such as text.

To put it simply:

“A transformer is a type of artificial intelligence model that learns to understand and generate human-like text by analyzing patterns in large amounts of text data.”

Do you know? Transformer models power Natural Language Processing, letting computers understand language like never before.

Revolutionizing NLP: How Transformers are Used for Various Tasks

Transformers have revolutionized natural language processing (NLP) and have been extended to other areas like computer vision and audio processing. Transformer models are used to solve all kinds of NLP tasks. Some functionalities that are available in thepipeline() function of Transformer Library in Python are:

feature-extraction (get the vector representation of a text)
fill-mask (fill the blanks of a given text)
ner (named entity recognition)
question-answering
sentiment-analysis
summarization
text-generation
translation
zero-shot-classification (allows you to specify which labels to use for the classification, so you don’t have to rely on the labels of the pre-trained model)

Check out these pipelines implementation here.

Transformers: A Powerful Architecture for Language Models

The Transformer models such as GPT, BERT, BART, and T5 are trained as language models using vast quantities of raw text. They utilize self-supervised learning, where the training objectives are derived automatically from the input data, eliminating the need for human-labeled datasets.

Above, I’ve outlined some applications of Transformer Models in text processing. Curious about how these models work? Let’s break it down in a simple way for you.

Inside Transformer Models: Understanding Their Workings

Before diving into Transformer models architecture first understand some terminologies:

Transfer Learning

In transfer learning, you can leverage knowledge (features, weights, etc) from previously trained models for training newer models and even tackle problems like having fewer data for the newer task!

Pretraining is the act of training a model from scratch: the weights are randomly initialized, and the training starts without any prior knowledge. This pretraining is usually done on very large amounts of data. Therefore, it requires a very large corpus of data, and training can take up to several weeks.

Fine-tuning, on the other hand, is the training done after a model has been pretrained. The pretrained model was already trained on a dataset that has some similarities with the fine-tuning dataset. The fine-tuning process is thus able to take advantage of the knowledge acquired by the initial model during pretraining (for instance, with NLP problems, the pretrained model will have some kind of statistical understanding of the language you are using for your task).

Attention Layer

A key feature of Transformer models is that they are built with special layers called attention layers. This layer will tell the model to pay specific attention to certain words in the sentence you passed it (and more or less ignore the others) when dealing with the representation of each word.

Attention layers are used in both the encoder and decoder.

Transformer Basic Architecture

Without going into complexities, here how you can understand the Transformer Architecture. If we start considering a Transformer for language translation as a simple black box, it would take a sentence in one language, English for instance, as an input and output its translation in English.

If we dive a little bit, we observe that this black box(Transformer) is composed of two main parts:

The encoder takes in our input and outputs a matrix representation of that input. For instance, the English sentence “How are you?”
The decoder takes in that encoded representation and iteratively generates an output. In our example, the translated sentence “¿Cómo estás?”

The Transformer architecture was originally designed for translation. During training, the encoder receives inputs (sentences) in a certain language, while the decoder receives the same sentences in the desired target language. In the encoder, the attention layers can use all the words in a sentence. The decoder, however, works sequentially and can only pay attention to the words in the sentence that it has already translated.

Does all Transformer Models follow this architecture?
The answer to this question is NO. Not all Transformer models follow this exact architecture. The basic Transformer architecture, as described, consists of an encoder and a decoder designed for sequence-to-sequence tasks like translation. However, there are several variants of the Transformer architecture tailored for different NLP tasks.

Categories

Broadly, Transformers can be grouped into three categories:

BERT-like (Encoder-only models)
GPT-like (Decoder-only models)
BART/T5-like (Encoder-decoder models)

1. Encoder models

Encoder models use only the encoder of a Transformer model. At each stage, the attention layers can access all the words in the initial sentence. These models are often characterized as having “bi-directional” attention, and are often called auto-encoding models.

Encoder models are best suited for tasks requiring an understanding of the full sentence, such as sentence classification, named entity recognition (and more generally word classification), and extractive question answering.

BERT has Encoder-only architecture. Input is text and output is sequence of embeddings.

2. Decoder models

Decoder models use only the decoder of a Transformer model. At each stage, for a given word the attention layers can only access the words positioned before it in the sentence. These models are often called auto-regressive models.

The pretraining of decoder models usually revolves around predicting the next word in the sentence. These models are best suited for tasks involving text generation.

GPT-2 has Decoder-only architecture. Input is text and output is the next word (token), which is then appended to the input.

3. Encoder-decoder/ Sequence-to-sequence models

Encoder-decoder models (also called sequence-to-sequence models) use both parts of the Transformer architecture. At each stage, the attention layers of the encoder can access all the words in the initial sentence, whereas the attention layers of the decoder can only access the words positioned before a given word in the input.

Sequence-to-sequence models are best suited for tasks revolving around generating new sentences depending on a given input, such as summarization, translation, or generative question answering.

T5 has Encoder-Decoder or Full-Transformer. Input is text and output is the next word (token), which is then appended to the decoder-input.

Limitations

Pretrained models are powerful yet inherently limited. They are trained on extensive datasets that may contain a mix of high-quality and problematic content from the internet. Consequently, there’s a risk of the model producing biased content, including sexist, racist, or homophobic material. Merely fine-tuning the model with specific data does not eradicate these biases.