BigBird Roberta Tokenizer¶
Overview¶
This page includes information about how to use AlbertTokenizer with tensorflow-text.
This tokenizer works in sync with Dataset
and so is useful for on the fly tokenization.
This tokenizer is more of like GPT2 vocab extension in sentencepiece with 100 extra reserved IDS.
>>> from tf_transformers.models import BigBirdTokenizerTFText
>>> tokenizer = BigBirdTokenizerTFText.from_pretrained("google/bigbird-roberta-large")
>>> text = ['The following statements are true about sentences in English:',
'',
'A new sentence begins with a capital letter.']
>>> inputs = {'text': text}
>>> outputs = tokenizer(inputs) # Ragged Tensor Output
# Dynamic Padding
>>> tokenizer = BigBirdTokenizerTFText.from_pretrained("google/bigbird-roberta-large",
dynamic_padding=True)
>>> text = ['The following statements are true about sentences in English:',
'',
'A new sentence begins with a capital letter.']
>>> inputs = {'text': text}
>>> outputs = tokenizer(inputs) # Dict of tf.Tensor
# Static Padding
>>> tokenizer = BigBirdTokenizerTFText.from_pretrained("google/bigbird-roberta-large",
pack_model_inputs=True)
>>> text = ['The following statements are true about sentences in English:',
'',
'A new sentence begins with a capital letter.']
>>> inputs = {'text': text}
>>> outputs = tokenizer(inputs) # Dict of tf.Tensor
# To Add Special Tokens
>>> tokenizer = BigBirdTokenizerTFText.from_pretrained("google/bigbird-roberta-large",
add_special_tokens=True)