Sunbelt Computer Software

description: Tokenizes UTF-8 by splitting when there is a change in Unicode script.

text.UnicodeScriptTokenizer

Tokenizes UTF-8 by splitting when there is a change in Unicode script.

Inherits From: TokenizerWithOffsets, Tokenizer, SplitterWithOffsets, Splitter

text.UnicodeScriptTokenizer(
    keep_whitespace=False
)

By default, this tokenizer leaves out scripts matching the whitespace unicode property (use the keep_whitespace argument to keep it), so in this case the results are similar to the WhitespaceTokenizer. Any punctuation will get its own token (since it is in a different script), and any script change in the input string will be the location of a split.

Example:

>>> tokenizer = tf_text.UnicodeScriptTokenizer()
>>> tokens = tokenizer.tokenize(["xy.,z de", "fg?h", "abαβ"])
>>> print(tokens.to_list())
[[b'xy', b'.,', b'z', b'de'], [b'fg', b'?', b'h'],
 [b'ab', b'\xce\xb1\xce\xb2']]

>>> tokens = tokenizer.tokenize(u"累計7239人")
>>> print(tokens)
tf.Tensor([b'\xe7\xb4\xaf\xe8\xa8\x88' b'7239' b'\xe4\xba\xba'], shape=(3,),
          dtype=string)

Both the punctuation and the whitespace in the first string have been split, but the punctuation run is present as a token while the whitespace isn't emitted (by default). The third example shows the case of a script change without any whitespace. This results in a split at that boundary point.

Args
`keep_whitespace`	A boolean that specifices whether to emit whitespace tokens (default `False`).

Methods

`split`

View source

split(
    input
)

Alias for Tokenizer.tokenize.

`split_with_offsets`

View source

split_with_offsets(
    input
)

Alias for TokenizerWithOffsets.tokenize_with_offsets.

`tokenize`

View source

tokenize(
    input
)

Tokenizes UTF-8 by splitting when there is a change in Unicode script.

The strings are split when successive tokens change their Unicode script or change being whitespace or not. The script codes used correspond to International Components for Unicode (ICU) UScriptCode values. See: http://icu-project.org/apiref/icu4c/uscript_8h.html

ICU-defined whitespace characters are dropped, unless the keep_whitespace option was specified at construction time.

Args
`input`	A `RaggedTensor`or `Tensor` of UTF-8 strings with any shape.

Returns
A `RaggedTensor` of tokenized text. The returned shape is the shape of the input tensor with an added ragged dimension for tokens of each string.

`tokenize_with_offsets`

View source

tokenize_with_offsets(
    input
)

Tokenizes UTF-8 by splitting when there is a change in Unicode script.

The strings are split when a change in the Unicode script is detected between sequential tokens. The script codes used correspond to International Components for Unicode (ICU) UScriptCode values. See: http://icu-project.org/apiref/icu4c/uscript_8h.html

ICU defined whitespace characters are dropped, unless the keep_whitespace option was specified at construction time.

Example:

>>> tokenizer = tf_text.UnicodeScriptTokenizer()
>>> tokens = tokenizer.tokenize_with_offsets(["xy.,z de", "abαβ"])
>>> print(tokens[0].to_list())
[[b'xy', b'.,', b'z', b'de'], [b'ab', b'\xce\xb1\xce\xb2']]
>>> print(tokens[1].to_list())
[[0, 2, 4, 6], [0, 2]]
>>> print(tokens[2].to_list())
[[2, 4, 5, 8], [2, 6]]

>>> tokens = tokenizer.tokenize_with_offsets(u"累計7239人")
>>> print(tokens[0])
tf.Tensor([b'\xe7\xb4\xaf\xe8\xa8\x88' b'7239' b'\xe4\xba\xba'],
    shape=(3,), dtype=string)
>>> print(tokens[1])
tf.Tensor([ 0  6 10], shape=(3,), dtype=int64)
>>> print(tokens[2])
tf.Tensor([ 6 10 13], shape=(3,), dtype=int64)

The start_offsets and end_offsets are in byte indices of the original string. When calling with multiple string inputs, the offset indices will be relative to the individual source strings.

Args
`input`	A `RaggedTensor`or `Tensor` of UTF-8 strings with any shape.

Sunbelt Computer Software

PL/B Language Development and Support

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

text.UnicodeScriptTokenizer

Example:

Args

Methods

`split`

`split_with_offsets`

`tokenize`

`tokenize_with_offsets`

Example:

Sunbelt Computer Software

PL/B Language Development and Support

Uh oh!

FilesExpand file tree

UnicodeScriptTokenizer.md

Latest commit

History

UnicodeScriptTokenizer.md

File metadata and controls

text.UnicodeScriptTokenizer

Example:

Args

Methods

split

split_with_offsets

tokenize

tokenize_with_offsets

Example:

`split`

`split_with_offsets`

`tokenize`

`tokenize_with_offsets`