MT5

Overview

The mT5 model was presented in mT5: A massively multilingual pre-trained text-to-text transformer by Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, Colin Raffel.

The abstract from the paper is the following:

The recent “Text-to-Text Transfer Transformer” (T5) leveraged a unified text-to-text format and scale to attain state-of-the-art results on a wide variety of English-language NLP tasks. In this paper, we introduce mT5, a multilingual variant of T5 that was pre-trained on a new Common Crawl-based dataset covering 101 languages. We describe the design and modified training of mT5 and demonstrate its state-of-the-art performance on many multilingual benchmarks. All of the code and model checkpoints

Note: mT5 was only pre-trained on mC4 excluding any supervised training. Therefore, this model has to be fine-tuned before it is useable on a downstream task, unlike the original T5 model. Since mT5 was pre-trained unsupervisedly, there’s no real advantage to using a task prefix during single-task fine-tuning. If you are doing multi-task fine-tuning, you should use a prefix.

Google has released the following variants:

Paper👆 Official Code👆

MT5Config

class tf_transformers.models.mt5.MT5Config(vocab_size=250112, embedding_size=512, num_hidden_layers=8, num_attention_heads=6, attention_head_size=64, intermediate_size=1024, hidden_act='gelu', intermediate_act='gelu', hidden_dropout_prob=0.1, attention_probs_dropout_prob=0.1, max_position_embeddings=- 1, type_vocab_size=- 1, initializer_range=0.02, layer_norm_epsilon=1e-06, position_embedding_type='relative', bidirectional=True, positional_buckets=32, decoder_start_token_id=0)[source]

This is the configuration class to store the configuration of a MT5Model. It is used to instantiate an MT5 model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the MT5 base architecture.

Configuration objects inherit from TransformerConfig and can be used to control the model outputs. Read the documentation from TransformerConfig for more information.

Parameters
  • vocab_size (int, optional, defaults to 30000) – Vocabulary size of the MT5 model. Defines the number of different tokens that can be represented by the inputs_ids passed when calling MT5Model or MT5Encoder.

  • embedding_size (int, optional, defaults to 128) – Dimensionality of vocabulary embeddings.

  • num_hidden_layers (int, optional, defaults to 12) – Number of hidden layers in the Transformer encoder.

  • num_attention_heads (int, optional, defaults to 12) – Number of attention heads for each attention layer in the Transformer encoder.

  • attention_head_size (int) – Size of attention heads in each layer. Normally (embedding_size//num_attention_heads).

  • intermediate_size (int, optional, defaults to 3072) – The dimensionality of the “intermediate” (often named feed-forward) layer in the Transformer encoder.

  • inner_group_num (int, optional, defaults to 1) – The number of inner repetition of attention and ffn.

  • hidden_act (str or Callable, optional, defaults to "gelu") – The non-linear activation function (function or string) in the encoder and pooler. If string, "gelu", "relu", "silu" and many are supported.

  • hidden_dropout_prob (float, optional, defaults to 0) – The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.

  • max_position_embeddings (int, optional, defaults to 512) – The maximum sequence length that this model might ever be used with. Typically set this to something large (e.g., 512 or 1024 or 2048).

  • type_vocab_size (int, optional, defaults to 2) – The vocabulary size of the token_type_ids passed when calling MT5Model or TFMT5Model.

  • initializer_range (float, optional, defaults to 0.02) – The standard deviation of the truncated_normal_initializer for initializing all weight matrices.

  • layer_norm_epsilon (float, optional, defaults to 1e-12) – The epsilon used by the layer normalization layers.

  • classifier_dropout_prob (float, optional, defaults to 0.1) – The dropout ratio for attached classifiers.

  • position_embedding_type (str, optional, defaults to "absolute") – Type of position embedding. Choose one of "absolute", "relative_key", "relative_key_query". For positional embeddings use "absolute". For more information on "relative_key", please refer to Self-Attention with Relative Position Representations (Shaw et al.). For more information on "relative_key_query", please refer to Method 4 in Improve Transformer Models with Better Relative Position Embeddings (Huang et al.).

  • bidirectional (bool, optional, defaults to True) – For relative positional embeddings, Encoder has bidirectional=True, while Decoder has bidirectional=False.

  • positional_buckets (int, optional, defaults to 32) – The number of buckets to use for each attention layer. For relative positional embeddings.

Examples:

>>> from tf_transformers.models import MT5Config, MT5Model
>>> # Initializing an bert-base-uncased style configuration
>>> configuration = MT5Config()

>>> # Initializing an MT5 different style configuration
>>> configuration_new = MT5Config(
...      embedding_size=768,
...      num_attention_heads=12,
...      intermediate_size=3072,
...  )

>>> # Initializing a model from the original configuration
>>> model = MT5Model.from_config(configuration)

>>> # Accessing the model configuration
>>> configuration = model._config_dict # This has more details than original configuration

MT5Model

MT5Encoder