OpenAI GPT2

Overview

OpenAI GPT-2 model was proposed in Language Models are Unsupervised Multitask Learners by Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei and Ilya Sutskever. It’s a causal (unidirectional) transformer pretrained using language modeling on a very large corpus of ~40 GB of text data.

The abstract from the paper is the following:

GPT-2 is a large transformer-based language model with 1.5 billion parameters, trained on a dataset[1] of 8 million web pages. GPT-2 is trained with a simple objective: predict the next word, given all of the previous words within some text. The diversity of the dataset causes this simple goal to contain naturally occurring demonstrations of many tasks across diverse domains. GPT-2 is a direct scale-up of GPT, with more than 10X the parameters and trained on more than 10X the amount of data.

Tips:

  • GPT-2 is a model with absolute position embeddings so it’s usually advised to pad the inputs on the right rather than the left.

  • GPT-2 was trained with a causal language modeling (CLM) objective and is therefore powerful at predicting the next token in a sequence. Leveraging this feature allows GPT-2 to generate syntactically coherent text.

  • GPT-2 TFT implementation makes use of efficient caching, making it 80x faster to similar TF2 counterparts.

Paper👆 Official Code👆

GPT2Config

class tf_transformers.models.gpt2.GPT2Config(vocab_size=50257, embedding_size=768, num_hidden_layers=12, num_attention_heads=64, attention_head_size=64, intermediate_size=3072, hidden_act='gelu', intermediate_act='gelu', hidden_dropout_prob=0, attention_probs_dropout_prob=0, max_position_embeddings=1024, type_vocab_size=- 1, initializer_range=0.02, layer_norm_epsilon=1e-12, position_embedding_type='absolute')[source]

This is the configuration class to store the configuration of a GPT2Model. It is used to instantiate an GPT2 model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the GPT2 base architecture.

Configuration objects inherit from TransformerConfig and can be used to control the model outputs. Read the documentation from TransformerConfig for more information.

Parameters
  • vocab_size (int, optional, defaults to 50257) – Vocabulary size of the GPT2 model. Defines the number of different tokens that can be represented by the inputs_ids passed when calling GPT2Model or GPT2Encoder.

  • embedding_size (int, optional, defaults to 768) – Dimensionality of vocabulary embeddings.

  • num_hidden_layers (int, optional, defaults to 12) – Number of hidden layers in the Transformer encoder.

  • num_attention_heads (int, optional, defaults to 12) – Number of attention heads for each attention layer in the Transformer encoder.

  • attention_head_size (int) – Size of attention heads in each layer. Normally (embedding_size//num_attention_heads).

  • intermediate_size (int, optional, defaults to 3072) – The dimensionality of the “intermediate” (often named feed-forward) layer in the Transformer encoder.

  • hidden_act (str or Callable, optional, defaults to "gelu") – The non-linear activation function (function or string) in the encoder and pooler. If string, "gelu", "relu", "silu" and many are supported.

  • hidden_dropout_prob (float, optional, defaults to 0) – The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.

  • max_position_embeddings (int, optional, defaults to 1024) – The maximum sequence length that this model might ever be used with. Typically set this to something large (e.g., 512 or 1024 or 2048).

  • type_vocab_size (int, optional, defaults to 2) – The vocabulary size of the token_type_ids passed when calling GPT2Model or TFGPT2Model.

  • initializer_range (float, optional, defaults to 0.02) – The standard deviation of the truncated_normal_initializer for initializing all weight matrices.

  • layer_norm_epsilon (float, optional, defaults to 1e-12) – The epsilon used by the layer normalization layers.

  • classifier_dropout_prob (float, optional, defaults to 0.1) – The dropout ratio for attached classifiers.

  • position_embedding_type (str, optional, defaults to "absolute") – Type of position embedding. Choose one of "absolute", "relative_key", "relative_key_query". For positional embeddings use "absolute". For more information on "relative_key", please refer to Self-Attention with Relative Position Representations (Shaw et al.). For more information on "relative_key_query", please refer to Method 4 in Improve Transformer Models with Better Relative Position Embeddings (Huang et al.).

  • num_hidden_groups (int, optional, defaults to 1) – Number of groups for the hidden layers, parameters in the same group are shared.

Examples:

>>> from tf_transformers.models import GPT2Config, GPT2Model
>>> # Initializing an bert-base-uncased style configuration
>>> configuration = GPT2Config()

>>> # Initializing an Bert different style configuration
>>> configuration_new = GPT2Config(
...      embedding_size=768,
...      num_attention_heads=12,
...      intermediate_size=3072,
...  )

>>> # Initializing a model from the original configuration
>>> model = GPT2Model.from_config(configuration)

>>> # Accessing the model configuration
>>> configuration = model._config_dict # This has more details than original configuration

GPT2Model

GPT2Encoder