description: Tokenizes a tensor of UTF-8 string tokens into subword pieces.
Tokenizes a tensor of UTF-8 string tokens into subword pieces.
Inherits From: TokenizerWithOffsets,
Tokenizer,
SplitterWithOffsets,
Splitter, Detokenizer
text.FastWordpieceTokenizer(
vocab=None,
suffix_indicator='##',
max_bytes_per_word=100,
token_out_type=dtypes.int64,
unknown_token='[UNK]',
no_pretokenization=False,
support_detokenization=False,
model_buffer=None
)
It employs the linear (as opposed to quadratic) WordPiece algorithm (see the paper).
Differences compared to the classic WordpieceTokenizer are as follows (as of 11/2021):
-
unknown_tokencannot be None or empty. That means if a word is too long or cannot be tokenized, FastWordpieceTokenizer always returnsunknown_token. In constrast, the original WordpieceTokenizer would return the original word ifunknown_tokenis empty or None. -
unknown_tokenmust be included in the vocabulary. -
When
unknown_tokenis returned, in tokenize_with_offsets(), the result end_offset is set to be the length of the original input word. In contrast, whenunknown_tokenis returned by the original WordpieceTokenizer, the end_offset is set to be the length of theunknown_tokenstring. -
split_unknown_charactersis not supported. -
max_chars_per_tokenis not used or needed. -
By default the input is assumed to be general text (i.e., sentences), and FastWordpieceTokenizer first splits it on whitespaces and punctuations and then applies the Wordpiece tokenization (see the parameter
no_pretokenization). If the input already contains single words only, please setno_pretokenization=Trueto be consistent with the classic WordpieceTokenizer.
detokenize(
input
)
Detokenizes a tensor of int64 or int32 subword ids into sentences.
Detokenize and tokenize an input string returns itself when the input string is
normalized and the tokenized wordpieces don't contain <unk>.
>>> vocab = ["they", "##'", "##re", "the", "great", "##est", "[UNK]",
... "'", "re", "ok"]
>>> tokenizer = FastWordpieceTokenizer(vocab, support_detokenization=True)
>>> ids = tf.ragged.constant([[0, 1, 2, 3, 4, 5], [9]])
>>> tokenizer.detokenize(ids)
<tf.Tensor: shape=(2,), dtype=string,
... numpy=array([b"they're the greatest", b'ok'], dtype=object)>
>>> ragged_ids = tf.ragged.constant([[[0, 1, 2, 3, 4, 5], [9]], [[4, 5]]])
>>> tokenizer.detokenize(ragged_ids)
<tf.RaggedTensor [[b"they're the greatest", b'ok'], [b'greatest']]>
| Args | |
|---|---|
| `input` | An N-dimensional `Tensor` or `RaggedTensor` of int64 or int32. |
| Returns | |
|---|---|
| A `RaggedTensor` of sentences that has N - 1 dimension when N > 1. Otherwise, a string tensor. |
split(
input
)
Alias for
Tokenizer.tokenize.
split_with_offsets(
input
)
Alias for
TokenizerWithOffsets.tokenize_with_offsets.
tokenize(
input
)
Tokenizes a tensor of UTF-8 string tokens further into subword tokens.
>>> vocab = ["they", "##'", "##re", "the", "great", "##est", "[UNK]"]
>>> tokenizer = FastWordpieceTokenizer(vocab, token_out_type=tf.string,
... no_pretokenization=True)
>>> tokens = [["they're", "the", "greatest"]]
>>> tokenizer.tokenize(tokens)
<tf.RaggedTensor [[[b'they', b"##'", b'##re'], [b'the'],
[b'great', b'##est']]]>
>>> vocab = ["they", "##'", "##re", "the", "great", "##est", "[UNK]",
... "'", "re"]
>>> tokenizer = FastWordpieceTokenizer(vocab, token_out_type=tf.string)
>>> tokens = [["they're the greatest", "the greatest"]]
>>> tokenizer.tokenize(tokens)
<tf.RaggedTensor [[[b'they', b"'", b're', b'the', b'great', b'##est'],
[b'the', b'great', b'##est']]]>
| Args | |
|---|---|
| `input` | An N-dimensional `Tensor` or `RaggedTensor` of UTF-8 strings. |
| Returns | |
|---|---|
| A `RaggedTensor` of tokens where `tokens[i, j]` is the j-th token (i.e., wordpiece) for `input[i]` (i.e., the i-th input word). This token is either the actual token string content, or the corresponding integer id, i.e., the index of that token string in the vocabulary. This choice is controlled by the `token_out_type` parameter passed to the initializer method. |
tokenize_with_offsets(
input
)
Tokenizes a tensor of UTF-8 string tokens further into subword tokens.
>>> vocab = ["they", "##'", "##re", "the", "great", "##est", "[UNK]"]
>>> tokenizer = FastWordpieceTokenizer(vocab, token_out_type=tf.string,
... no_pretokenization=True)
>>> tokens = [["they're", "the", "greatest"]]
>>> subtokens, starts, ends = tokenizer.tokenize_with_offsets(tokens)
>>> subtokens
<tf.RaggedTensor [[[b'they', b"##'", b'##re'], [b'the'],
[b'great', b'##est']]]>
>>> starts
<tf.RaggedTensor [[[0, 4, 5], [0], [0, 5]]]>
>>> ends
<tf.RaggedTensor [[[4, 5, 7], [3], [5, 8]]]>
>>> vocab = ["they", "##'", "##re", "the", "great", "##est", "[UNK]",
... "'", "re"]
>>> tokenizer = FastWordpieceTokenizer(vocab, token_out_type=tf.string)
>>> tokens = [["they're the greatest", "the greatest"]]
>>> subtokens, starts, ends = tokenizer.tokenize_with_offsets(tokens)
>>> subtokens
<tf.RaggedTensor [[[b'they', b"'", b're', b'the', b'great', b'##est'],
[b'the', b'great', b'##est']]]>
>>> starts
<tf.RaggedTensor [[[0, 4, 5, 8, 12, 17], [0, 4, 9]]]>
>>> ends
<tf.RaggedTensor [[[4, 5, 7, 11, 17, 20], [3, 9, 12]]]>
| Args | |
|---|---|
| `input` | An N-dimensional `Tensor` or `RaggedTensor` of UTF-8 strings. |
