Sunbelt Computer Software

description: Base class for detokenizer implementations.

text.Detokenizer

Base class for detokenizer implementations.

text.Detokenizer(
    name=None
)

A Detokenizer is a module that combines tokens to form strings. Generally, subclasses of Detokenizer will also be subclasses of Tokenizer; and the detokenize method will be the inverse of the tokenize method. I.e., tokenizer.detokenize(tokenizer.tokenize(s)) == s.

Each Detokenizer subclass must implement a detokenize method, which combines tokens together to form strings. E.g.:

>>> class SimpleDetokenizer(tf_text.Detokenizer):
...   def detokenize(self, input):
...     return tf.strings.reduce_join(input, axis=-1, separator=" ")
>>> text = tf.ragged.constant([["hello", "world"], ["a", "b", "c"]])
>>> print(SimpleDetokenizer().detokenize(text))
tf.Tensor([b'hello world' b'a b c'], shape=(2,), dtype=string)

Methods

`detokenize`

View source

@abc.abstractmethod
detokenize(
    input
)

Assembles the tokens in the input tensor into a string.

Generally, detokenize is the inverse of the tokenize method, and can be used to reconstrct a string from a set of tokens. This is especially helpful in cases where the tokens are integer ids, such as indexes into a vocabulary table -- in that case, the tokenized encoding is not very human-readable (since it's just a list of integers), so the detokenize method can be used to turn it back into something that's more readable.

Args
`input`	An N-dimensional UTF-8 string or integer `Tensor` or `RaggedTensor`.

Sunbelt Computer Software

PL/B Language Development and Support

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

text.Detokenizer

Methods

`detokenize`

Sunbelt Computer Software

PL/B Language Development and Support

Uh oh!

FilesExpand file tree

Detokenizer.md

Latest commit

History

Detokenizer.md

File metadata and controls

text.Detokenizer

Methods

detokenize

`detokenize`