MT5¶
Overview¶
The mT5 model was presented in mT5: A massively multilingual pre-trained text-to-text transformer by Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, Colin Raffel.
The abstract from the paper is the following:
The recent “Text-to-Text Transfer Transformer” (T5) leveraged a unified text-to-text format and scale to attain state-of-the-art results on a wide variety of English-language NLP tasks. In this paper, we introduce mT5, a multilingual variant of T5 that was pre-trained on a new Common Crawl-based dataset covering 101 languages. We describe the design and modified training of mT5 and demonstrate its state-of-the-art performance on many multilingual benchmarks. All of the code and model checkpoints
Note: mT5 was only pre-trained on mC4 excluding any supervised training. Therefore, this model has to be fine-tuned before it is useable on a downstream task, unlike the original T5 model. Since mT5 was pre-trained unsupervisedly, there’s no real advantage to using a task prefix during single-task fine-tuning. If you are doing multi-task fine-tuning, you should use a prefix.
Google has released the following variants:
MT5Config¶
-
class
tf_transformers.models.mt5.
MT5Config
(vocab_size=250112, embedding_size=512, num_hidden_layers=8, num_attention_heads=6, attention_head_size=64, intermediate_size=1024, hidden_act='gelu', intermediate_act='gelu', hidden_dropout_prob=0.1, attention_probs_dropout_prob=0.1, max_position_embeddings=- 1, type_vocab_size=- 1, initializer_range=0.02, layer_norm_epsilon=1e-06, position_embedding_type='relative', bidirectional=True, positional_buckets=32, decoder_start_token_id=0)[source]¶ This is the configuration class to store the configuration of a
MT5Model
. It is used to instantiate an MT5 model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the MT5 base architecture.Configuration objects inherit from
TransformerConfig
and can be used to control the model outputs. Read the documentation fromTransformerConfig
for more information.- Parameters
vocab_size (
int
, optional, defaults to 30000) – Vocabulary size of the MT5 model. Defines the number of different tokens that can be represented by theinputs_ids
passed when callingMT5Model
orMT5Encoder
.embedding_size (
int
, optional, defaults to 128) – Dimensionality of vocabulary embeddings.num_hidden_layers (
int
, optional, defaults to 12) – Number of hidden layers in the Transformer encoder.num_attention_heads (
int
, optional, defaults to 12) – Number of attention heads for each attention layer in the Transformer encoder.attention_head_size (
int
) – Size of attention heads in each layer. Normally (embedding_size//num_attention_heads).intermediate_size (
int
, optional, defaults to 3072) – The dimensionality of the “intermediate” (often named feed-forward) layer in the Transformer encoder.inner_group_num (
int
, optional, defaults to 1) – The number of inner repetition of attention and ffn.hidden_act (
str
orCallable
, optional, defaults to"gelu"
) – The non-linear activation function (function or string) in the encoder and pooler. If string,"gelu"
,"relu"
,"silu"
and many are supported.hidden_dropout_prob (
float
, optional, defaults to 0) – The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.max_position_embeddings (
int
, optional, defaults to 512) – The maximum sequence length that this model might ever be used with. Typically set this to something large (e.g., 512 or 1024 or 2048).type_vocab_size (
int
, optional, defaults to 2) – The vocabulary size of thetoken_type_ids
passed when callingMT5Model
orTFMT5Model
.initializer_range (
float
, optional, defaults to 0.02) – The standard deviation of the truncated_normal_initializer for initializing all weight matrices.layer_norm_epsilon (
float
, optional, defaults to 1e-12) – The epsilon used by the layer normalization layers.classifier_dropout_prob (
float
, optional, defaults to 0.1) – The dropout ratio for attached classifiers.position_embedding_type (
str
, optional, defaults to"absolute"
) – Type of position embedding. Choose one of"absolute"
,"relative_key"
,"relative_key_query"
. For positional embeddings use"absolute"
. For more information on"relative_key"
, please refer to Self-Attention with Relative Position Representations (Shaw et al.). For more information on"relative_key_query"
, please refer to Method 4 in Improve Transformer Models with Better Relative Position Embeddings (Huang et al.).bidirectional (
bool
, optional, defaults to True) – For relative positional embeddings, Encoder hasbidirectional=True
, while Decoder hasbidirectional=False
.positional_buckets (
int
, optional, defaults to 32) – The number of buckets to use for each attention layer. For relative positional embeddings.
Examples:
>>> from tf_transformers.models import MT5Config, MT5Model >>> # Initializing an bert-base-uncased style configuration >>> configuration = MT5Config() >>> # Initializing an MT5 different style configuration >>> configuration_new = MT5Config( ... embedding_size=768, ... num_attention_heads=12, ... intermediate_size=3072, ... ) >>> # Initializing a model from the original configuration >>> model = MT5Model.from_config(configuration) >>> # Accessing the model configuration >>> configuration = model._config_dict # This has more details than original configuration