T5

Introduction

T5 stands for "Text-to-Text Transfer Transformer". It was released by Google on 2020. As the name suggests, it's a tranformer based encoder-decoder model used for text generation. For more details about other text generation models, refer the Text Generation chapter.
In contrast to other famous Transformer based models like BERT or GPT, which is made up of either the encoder or decoder part of the Transformer, T5 paper showcase that using the complete encoder-decoder architecture is better than only using decoder. Apart from this, the paper also curated and released Colossal Clean Crawled Corpus (C4) - a huge crawled and cleaned dataset for pre-training language model using self-supervised learning. (raffel2020exploring)

Transformer architecture. Left part is the encoder, right part is the decoder. T5's architecture is very similar to this one. (*vaswani2017attention raffel2020exploring*)

Due to this nature of T5, for training or finetuning, the model requires a pair of input and output sequences/text. Some example tasks that can be performed using T5 is shown below,

T5 text-to-text framework examples. See: Google Blog

Based on the original T5, there has been several variations explored such as, (refer T5 @ HuggingFace )
- T5v1.1: T5v1.1 is an improved version of T5 with some architectural tweaks, and is pre-trained on C4 only without mixing in the supervised tasks.
- mT5: mT5 is a multilingual T5 model. It is pre-trained on the mC4 corpus, which includes 101 languages.
- byT5: byT5 is a T5 model pre-trained on byte sequences rather than SentencePiece subword token sequences.
- Long-T5: For use case where we need to process longer input (Refer)

Note

As mentioned above, for multi-lingual purpose refer mT5 which was trained on >100 languages. That said, original T5 was also trained on translation task from English to German, French and Romanian. So the output can sometimes contains tokens from these languages!

Paper details

Transformer Architecture

To compare different architectures suitable for languague models, T5 authors considered basically three varieties,
- Encoder-Decoder: A standard encoder-decoder architecture uses fully visible masking in the encoder and the encoder-decoder attention, with causal masking (attention mask) in the decoder. Masking is done to make sure the output at a position doesn't attend to future output for prediction.
- Language model: A language model consists of a single Transformer layer stack and is fed the concatenation of the input and target, using a causal mask throughout. As usual with LMs, the output only attends to the past input or output.
- Prefix LM: Adding a prefix to a language model corresponds to allowing fully-visible masking over a portion of the input. It is very similar to LM, just that any output will attend to a certain portion of the input that contains prefix could could contain task specific information like translate English to German:.

T5 found the transformer based architecture to perform better than others.

Pre-training Strategy

T5 is trained with multi-task learning methodology, where the idea is to club multiple tasks while pre-training the model. These multiple tasks are further clubbed into two groups based on how they are trained,
- Unsupervised training:
  - this includes training on the C4 dataset using the classic language model training tasks with maximum likelihood objective.
  - For unsupervised tasks like MLM, T5 has 100 special tokens <extra_id_0> to <extra_id_99> which can be used to format the input and output text. For the sentence "My name is Mohit Mayank" where I want to mask "name is", the input is My <extra_id_0> Mohit Mayank and required output will be <extra_id_0> name is <extra_id_1>.
- Supervised training:
  - this includes adding several NLP based tasks like question-answering, summarization, classification, etc. The model is trained with the curated training data in a supervised fashion, but all these tasks are transformed to work with text-in text-out format as suitable for encoder-decoder models.
  - The data for supervised training is created separately for input and output text. For input it looks like {task_prefix}: {input_sequences}</s> . Similarly for the output text it looks like <pad> {output_sequence}</s>. One example could be: translate English to German: The house is wonderful.</s> for input and <pad> Das Haus ist wunderbar.</s> for output. This is then passed to the model to compute loss on the output part and then backpropagated to decrease the loss.

Note

T5 authors also released checkpoint models which are only unsupervised trained on the C4 dataset. More details and model is available here.

T5 also compared different unsupervised objectives i.e. different training stratigies for unsupervised training which could lead to better performance. A visual guide to the search space is shown below.

A flow chart of the exploration of unsupervised objectives by the T5 paper. (*raffel2020exploring*)

To begin with, there were three High-level approaches, (1) Language modeling: where you take predict the next word based on historical words, (2) BERT-style: where you mask certain words and model predict those masked words, or (3) Deshuffling: where you shuffle the sentence and model predicts the unshuffled correct sentence as output. After experimenting with these BERT-style approach gave the best result and hence was selecte for next level of analysis.
Next, different corruption startegies were tried. Originally BERT proposed MASD based approach but for language model setting, T5 authors observed Replace span works better. In replace span you mask consecutive tokens and language model predicts the masked spans.

Span corruption used for unsupervised training in T5. (*raffel2020exploring*)

Finally, different rate of corruption rate and corruption span length were experimentally selected.

Performance score

T5 paper reports following performance score on different datasets,

Code

T5 Inference

Running T5 is super easy using HuggingFace. Let's do it,

# install packages
!pip install transformers

# import
from transformers import T5Tokenizer, T5ForConditionalGeneration

# load the tokenizers and model
tokenizer = T5Tokenizer.from_pretrained("t5-small") # vocab size is 32100.
model = T5ForConditionalGeneration.from_pretrained("t5-small")

# for a phrase get the tokenised input ids
input_ids = tokenizer("translate English to German: I am going to the party.", return_tensors="pt").input_ids
# use the input ids to generte output
outputs = model.generate(input_ids)
# decode the output token ids to text
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
## Output --> 
## Ich werde zur Partei gehen.

T5 finetuning

Before we dive into finetuning, here are some tips if you are going to use PyTorch or Keras.
- We can use high learning rate for AdamW optimizer in range of 1e-4 and 3e-4. Btw T5 was originally pre-trained with AdaFactor optimizer.
- We can add task specific prefix like translate English to German: or summarize: in the input sequence if your task is similar to the ones T5 was originally pre-trained with.
- We should replace the PAD token ids 0 with -100 so that it is ignored from loss computation. Btw PAD token is used as start sequence token for the labels (text that is to be generated).
To keep things simpler, we can use SimpleT5, an excellent package that abstract a lot of technicalities. For dataset, we can go with Tweet sentiment data, that can be downloaded from here
Some differences from training other text generation models (due to the SimpleT5 package),
- We don't need the Dataset class, as SimpleT5 works directly on pandas dataframe. Hence we load the data, do some initial pre-processing, split the data and return the pandas dataframe. (no need to tokenize, create Dataset class, isn't this great!?)
- One more point to note is that we do not need to create prompt formats for this package. This way we can separate out the input tweet and sentiment label into different columns, here source_text and target_text, respectively (Line 29 and 30).

# import
import pandas as pd
from simplet5 import SimpleT5
from sklearn.metrics import f1_score
from sklearn.model_selection import train_test_split

# Data loading
# -------

# Data load function
def load_sentiment_dataset(random_seed = 1):
    # load dataset and sample 10k reviews.
    file_path="../input/sentiment140/training.1600000.processed.noemoticon.csv"
    df = pd.read_csv(file_path, encoding='ISO-8859-1', header=None)
    df = df[[0, 5]]
    df.columns = ['label', 'text']
    df = df.sample(10000, random_state=1)

    # modify the label  
    map_label = {0:'negative', 4: 'positive'}
    df['label'] = df['label'].apply(lambda x: map_label[x])

    # divide into test and train
    X_train, X_test, y_train, y_test = \
              train_test_split(df['text'].tolist(), df['label'].tolist(),
              shuffle=True, test_size=0.05, random_state=random_seed, stratify=df['label'])

    # transform train and test data into pandas dataframe
    train_data = pd.DataFrame({'source_text': X_train, 'target_text': y_train})    
    test_data = pd.DataFrame({'source_text': X_test, 'target_text': y_test})    

    # return
    return train_data, test_data

# load
train_df, test_df = load_sentiment_dataset()  

# Train
# -------

# load model
model = SimpleT5()
model.from_pretrained(model_type="t5", model_name="t5-base")

# train model
model.train(train_df=train_df,
            eval_df=test_df, 
            source_max_token_len=300, 
            target_max_token_len=200, 
            batch_size=8, 
            max_epochs=2, 
            outputdir = "outputs",
            use_gpu=True
            )

# Test
# -------

# load the best model
last_epoch_model = "..." # put the name here
model.load_model("t5", last_epoch_model, use_gpu=True)

# for each test data perform prediction
predictions = []
for index, row in test_df.iterrows():
    prediction = model.predict(row['source_text'])[0]
    predictions.append(prediction)

# computer performance
df = test_df.copy()
df['predicted_label'] = predictions
df['original_label'] = df['target_text']
print(f1_score(df['original_label'], df['predicted_label'], average='macro'))

References

[1] Exploring Transfer Learning with T5: the Text-To-Text Transfer Transformer - Link

[2] T5 finetuning tips - HF forum