BigBird Roberta Tokenizer

Overview

This page includes information about how to use AlbertTokenizer with tensorflow-text. This tokenizer works in sync with Dataset and so is useful for on the fly tokenization. This tokenizer is more of like GPT2 vocab extension in sentencepiece with 100 extra reserved IDS.

>>> from tf_transformers.models import  BigBirdTokenizerTFText
>>> tokenizer = BigBirdTokenizerTFText.from_pretrained("google/bigbird-roberta-large")
>>> text = ['The following statements are true about sentences in English:',
            '',
            'A new sentence begins with a capital letter.']
>>> inputs = {'text': text}
>>> outputs = tokenizer(inputs) # Ragged Tensor Output

# Dynamic Padding
>>> tokenizer = BigBirdTokenizerTFText.from_pretrained("google/bigbird-roberta-large",
dynamic_padding=True)
>>> text = ['The following statements are true about sentences in English:',
            '',
            'A new sentence begins with a capital letter.']
>>> inputs = {'text': text}
>>> outputs = tokenizer(inputs) # Dict of tf.Tensor

# Static Padding
>>> tokenizer = BigBirdTokenizerTFText.from_pretrained("google/bigbird-roberta-large",
pack_model_inputs=True)
>>> text = ['The following statements are true about sentences in English:',
            '',
            'A new sentence begins with a capital letter.']
>>> inputs = {'text': text}
>>> outputs = tokenizer(inputs) # Dict of tf.Tensor

# To Add Special Tokens
>>> tokenizer = BigBirdTokenizerTFText.from_pretrained("google/bigbird-roberta-large",
add_special_tokens=True)

BigBirdRobertaTokenizerTFText

BigBirdRobertaTokenizerLayer