BART¶
Overview¶
The Bart model was proposed in BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension by Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov and Luke Zettlemoyer on 29 Oct, 2019.
According to the abstract,
Bart uses a standard seq2seq/machine translation architecture with a bidirectional encoder (like BERT) and a left-to-right decoder (like GPT).
The pretraining task involves randomly shuffling the order of the original sentences and a novel in-filling scheme, where spans of text are replaced with a single mask token.
BART is particularly effective when fine tuned for text generation but also works well for comprehension tasks. It matches the performance of RoBERTa with comparable training resources on GLUE and SQuAD, achieves new state-of-the-art results on a range of abstractive dialogue, question answering, and summarization tasks, with gains of up to 6 ROUGE.
Examples¶
Examples and scripts for fine-tuning BART and other models for sequence to sequence tasks can be found in examples/pytorch/summarization/.
An example of how to train
BartForConditionalGeneration
with a Hugging Facedatasets
object can be found in this forum discussion.Distilled checkpoints are described in this paper.
Implementation Notes¶
Bart doesn’t use
token_type_ids
for sequence classification. UseBartTokenizer
orencode()
to get the proper splitting.The forward pass of
BartModel
will create thedecoder_input_ids
if they are not passed. This is different than some other modeling APIs. A typical use case of this feature is mask filling.Model predictions are intended to be identical to the original implementation when
force_bos_token_to_be_generated=True
. This only works, however, if the string you pass tofairseq.encode()
starts with a space.generate()
should be used for conditional generation tasks like summarization, see the example in that docstrings.
BartConfig¶
-
class
tf_transformers.models.bart.
BartConfig
(vocab_size=50265, embedding_size=768, num_hidden_layers=12, num_attention_heads=64, attention_head_size=64, intermediate_size=3072, hidden_act='gelu', intermediate_act='gelu', hidden_dropout_prob=0, attention_probs_dropout_prob=0, max_position_embeddings=1024, type_vocab_size=- 1, initializer_range=0.02, layer_norm_epsilon=1e-05, position_embedding_type='absolute', decoder_start_token_id=2)[source]¶ This is the configuration class to store the configuration of a
BartModel
. It is used to instantiate an ALBERT model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the ALBERT base architecture.Configuration objects inherit from
TransformerConfig
and can be used to control the model outputs. Read the documentation fromTransformerConfig
for more information.- Parameters
vocab_size (
int
, optional, defaults to 30522) – Vocabulary size of the ALBERT model. Defines the number of different tokens that can be represented by theinputs_ids
passed when callingBartModel
orBartEncoder
.embedding_size (
int
, optional, defaults to 128) – Dimensionality of vocabulary embeddings.embedding_projection_size (
int
) – Dimensionality of the encoder layers and the pooler layer. Useful for Bart.num_hidden_layers (
int
, optional, defaults to 12) – Number of hidden layers in the Transformer encoder.num_attention_heads (
int
, optional, defaults to 12) – Number of attention heads for each attention layer in the Transformer encoder.attention_head_size (
int
) – Size of attention heads in each layer. Normally (embedding_size//num_attention_heads).intermediate_size (
int
, optional, defaults to 3072) – The dimensionality of the “intermediate” (often named feed-forward) layer in the Transformer encoder.hidden_act (
str
orCallable
, optional, defaults to"gelu"
) – The non-linear activation function (function or string) in the encoder and pooler. If string,"gelu"
,"relu"
,"silu"
and many are supported.hidden_dropout_prob (
float
, optional, defaults to 0) – The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.max_position_embeddings (
int
, optional, defaults to 512) – The maximum sequence length that this model might ever be used with. Typically set this to something large (e.g., 512 or 1024 or 2048).type_vocab_size (
int
, optional, defaults to 2) – The vocabulary size of thetoken_type_ids
passed when callingBartModel
orTFBartModel
.initializer_range (
float
, optional, defaults to 0.02) – The standard deviation of the truncated_normal_initializer for initializing all weight matrices.layer_norm_epsilon (
float
, optional, defaults to 1e-12) – The epsilon used by the layer normalization layers.classifier_dropout_prob (
float
, optional, defaults to 0.1) – The dropout ratio for attached classifiers.position_embedding_type (
str
, optional, defaults to"absolute"
) – Type of position embedding. Choose one of"absolute"
,"relative_key"
,"relative_key_query"
. For positional embeddings use"absolute"
. For more information on"relative_key"
, please refer to Self-Attention with Relative Position Representations (Shaw et al.). For more information on"relative_key_query"
, please refer to Method 4 in Improve Transformer Models with Better Relative Position Embeddings (Huang et al.).num_hidden_groups (
int
, optional, defaults to 1) – Number of groups for the hidden layers, parameters in the same group are shared.
Examples:
>>> from tf_transformers.models import BartConfig, BartModel >>> # Initializing an bert-base-uncased style configuration >>> configuration = BartConfig() >>> # Initializing an Bart different style configuration >>> configuration_new = BartConfig( ... embedding_size=768, ... num_attention_heads=12, ... intermediate_size=3072, ... ) >>> # Initializing a model from the original configuration >>> model = BartModel.from_config(configuration) >>> # Accessing the model configuration >>> configuration = model._config_dict # This has more details than original configuration