text/docs/api_docs/python/text/SentencepieceTokenizer.md at master · tensorflow/text · GitHub
Skip to content

Latest commit

 

History

History
487 lines (413 loc) · 11.5 KB

File metadata and controls

487 lines (413 loc) · 11.5 KB

description: Tokenizes a tensor of UTF-8 strings.

text.SentencepieceTokenizer

View source

Tokenizes a tensor of UTF-8 strings.

Inherits From: TokenizerWithOffsets, Tokenizer, SplitterWithOffsets, Splitter, Detokenizer

text.SentencepieceTokenizer(
    model=None,
    out_type=dtypes.int32,
    nbest_size=0,
    alpha=1.0,
    reverse=False,
    add_bos=False,
    add_eos=False,
    return_nbest=False,
    name=None
)

SentencePiece is an unsupervised text tokenizer and detokenizer. It is used mainly for Neural Network-based text generation systems where the vocabulary size is predetermined prior to the neural model training. SentencePiece implements subword units with the extension of direct training from raw sentences.

Before using the tokenizer, you will need to train a vocabulary and build a model configuration for it. Please visit the Sentencepiece repository for the most up-to-date instructions on this process.

Args

`model` The sentencepiece model serialized proto.
`out_type` output type. tf.int32 or tf.string (Default = tf.int32) Setting tf.int32 directly encodes the string into an id sequence.
`nbest_size` A scalar for sampling. * `nbest_size = {0,1}`: No sampling is performed. (default) * `nbest_size > 1`: samples from the nbest_size results. * `nbest_size < 0`: assuming that nbest_size is infinite and samples from the all hypothesis (lattice) using forward-filtering-and-backward-sampling algorithm.
`alpha` A scalar for a smoothing parameter. Inverse temperature for probability rescaling.
`reverse` Reverses the tokenized sequence (Default = false)
`add_bos` Add beginning of sentence token to the result (Default = false)
`add_eos` Add end of sentence token to the result (Default = false). When `reverse=True` beginning/end of sentence tokens are added after reversing.
`return_nbest` If True requires that `nbest_size` is a scalar and `> 1`. Returns the `nbest_size` best tokenizations for each sentence instead of a single one. The returned tensor has shape `[batch * nbest, (tokens)]`.
`name` The name argument that is passed to the op function.

Methods

detokenize

View source

detokenize(
    input, name=None
)

Detokenizes tokens into preprocessed text.

This function accepts tokenized text, and reforms it back into sentences.

Args
`input` A `RaggedTensor` or `Tensor` of UTF-8 string tokens with a rank of at least 1.
`name` The name argument that is passed to the op function.
Returns
A N-1 dimensional string Tensor or RaggedTensor of the detokenized text.

id_to_string

View source

id_to_string(
    input, name=None
)

Converts vocabulary id into a token.

Args
`input` An arbitrary tensor of int32 representing the token IDs.
`name` The name argument that is passed to the op function.
Returns
A tensor of string with the same shape as input.

split

View source

split(
    input
)

Alias for Tokenizer.tokenize.

split_with_offsets

View source

split_with_offsets(
    input
)

Alias for TokenizerWithOffsets.tokenize_with_offsets.

string_to_id

View source

string_to_id(
    input, name=None
)

Converts token into a vocabulary id.

This function is particularly helpful for determining the IDs for any special tokens whose ID could not be determined through normal tokenization.

Args
`input` An arbitrary tensor of string tokens.
`name` The name argument that is passed to the op function.
Returns
A tensor of int32 representing the IDs with the same shape as input.

tokenize

View source

tokenize(
    input, name=None
)

Tokenizes a tensor of UTF-8 strings.

Args
`input` A `RaggedTensor` or `Tensor` of UTF-8 strings with any shape.
`name` The name argument that is passed to the op function.
Returns
A `RaggedTensor` of tokenized text. The returned shape is the shape of the input tensor with an added ragged dimension for tokens of each string.

tokenize_with_offsets

View source

tokenize_with_offsets(
    input, name=None
)

Tokenizes a tensor of UTF-8 strings.

This function returns a tuple containing the tokens along with start and end byte offsets that mark where in the original string each token was located.

Args
`input` A `RaggedTensor` or `Tensor` of UTF-8 strings with any shape.
`name` The name argument that is passed to the op function.
Returns
A tuple `(tokens, start_offsets, end_offsets)` where:
`tokens` is an N+1-dimensional UTF-8 string or integer `Tensor` or `RaggedTensor`.
`start_offsets` is an N+1-dimensional integer `Tensor` or `RaggedTensor` containing the starting indices of each token (byte indices for input strings).
`end_offsets` is an N+1-dimensional integer `Tensor` or `RaggedTensor` containing the exclusive ending indices of each token (byte indices for input strings).

vocab_size

View source

vocab_size(
    name=None
)

Returns the vocabulary size.

The number of tokens from within the Sentencepiece vocabulary provided at the time of initialization.

Args
`name` The name argument that is passed to the op function.
Returns
A scalar representing the vocabulary size.