description: Tokenizes UTF-8 by splitting when there is a change in Unicode script.
Tokenizes UTF-8 by splitting when there is a change in Unicode script.
Inherits From: TokenizerWithOffsets,
Tokenizer,
SplitterWithOffsets,
Splitter
text.UnicodeScriptTokenizer(
keep_whitespace=False
)
By default, this tokenizer leaves out scripts matching the whitespace unicode
property (use the keep_whitespace argument to keep it), so in this case the
results are similar to the WhitespaceTokenizer. Any punctuation will get its
own token (since it is in a different script), and any script change in the
input string will be the location of a split.
>>> tokenizer = tf_text.UnicodeScriptTokenizer()
>>> tokens = tokenizer.tokenize(["xy.,z de", "fg?h", "abαβ"])
>>> print(tokens.to_list())
[[b'xy', b'.,', b'z', b'de'], [b'fg', b'?', b'h'],
[b'ab', b'\xce\xb1\xce\xb2']]
>>> tokens = tokenizer.tokenize(u"累計7239人")
>>> print(tokens)
tf.Tensor([b'\xe7\xb4\xaf\xe8\xa8\x88' b'7239' b'\xe4\xba\xba'], shape=(3,),
dtype=string)
Both the punctuation and the whitespace in the first string have been split, but the punctuation run is present as a token while the whitespace isn't emitted (by default). The third example shows the case of a script change without any whitespace. This results in a split at that boundary point.
| `keep_whitespace` | A boolean that specifices whether to emit whitespace tokens (default `False`). |
split(
input
)
Alias for
Tokenizer.tokenize.
split_with_offsets(
input
)
Alias for
TokenizerWithOffsets.tokenize_with_offsets.
tokenize(
input
)
Tokenizes UTF-8 by splitting when there is a change in Unicode script.
The strings are split when successive tokens change their Unicode script or change being whitespace or not. The script codes used correspond to International Components for Unicode (ICU) UScriptCode values. See: http://icu-project.org/apiref/icu4c/uscript_8h.html
ICU-defined whitespace characters are dropped, unless the keep_whitespace
option was specified at construction time.
| Args | |
|---|---|
| `input` | A `RaggedTensor`or `Tensor` of UTF-8 strings with any shape. |
| Returns | |
|---|---|
| A `RaggedTensor` of tokenized text. The returned shape is the shape of the input tensor with an added ragged dimension for tokens of each string. |
tokenize_with_offsets(
input
)
Tokenizes UTF-8 by splitting when there is a change in Unicode script.
The strings are split when a change in the Unicode script is detected between sequential tokens. The script codes used correspond to International Components for Unicode (ICU) UScriptCode values. See: http://icu-project.org/apiref/icu4c/uscript_8h.html
ICU defined whitespace characters are dropped, unless the keep_whitespace option was specified at construction time.
>>> tokenizer = tf_text.UnicodeScriptTokenizer()
>>> tokens = tokenizer.tokenize_with_offsets(["xy.,z de", "abαβ"])
>>> print(tokens[0].to_list())
[[b'xy', b'.,', b'z', b'de'], [b'ab', b'\xce\xb1\xce\xb2']]
>>> print(tokens[1].to_list())
[[0, 2, 4, 6], [0, 2]]
>>> print(tokens[2].to_list())
[[2, 4, 5, 8], [2, 6]]
>>> tokens = tokenizer.tokenize_with_offsets(u"累計7239人")
>>> print(tokens[0])
tf.Tensor([b'\xe7\xb4\xaf\xe8\xa8\x88' b'7239' b'\xe4\xba\xba'],
shape=(3,), dtype=string)
>>> print(tokens[1])
tf.Tensor([ 0 6 10], shape=(3,), dtype=int64)
>>> print(tokens[2])
tf.Tensor([ 6 10 13], shape=(3,), dtype=int64)
The start_offsets and end_offsets are in byte indices of the original string. When calling with multiple string inputs, the offset indices will be relative to the individual source strings.
| Args | |
|---|---|
| `input` | A `RaggedTensor`or `Tensor` of UTF-8 strings with any shape. |
