description: An abstract base class for splitters that return offsets.
An abstract base class for splitters that return offsets.
Inherits From: Splitter
text.SplitterWithOffsets(
name=None
)
Each SplitterWithOffsets subclass must implement the split_with_offsets
method, which returns a tuple containing both the pieces and the offsets where
those pieces occurred in the input string. E.g.:
>>> class CharSplitter(SplitterWithOffsets):
... def split_with_offsets(self, input):
... chars, starts = tf.strings.unicode_split_with_offsets(input, 'UTF-8')
... lengths = tf.expand_dims(tf.strings.length(input), -1)
... ends = tf.concat([starts[..., 1:], tf.cast(lengths, tf.int64)], -1)
... return chars, starts, ends
... def split(self, input):
... return self.split_with_offsets(input)[0]
>>> pieces, starts, ends = CharSplitter().split_with_offsets("a😊c")
>>> print(pieces.numpy(), starts.numpy(), ends.numpy())
[b'a' b'\xf0\x9f\x98\x8a' b'c'] [0 1 5] [1 5 6]
@abc.abstractmethodsplit( input )
Splits the input tensor into pieces.
Generally, the pieces returned by a splitter correspond to substrings of the original string, and can be encoded using either strings or integer ids.
>>> print(tf_text.WhitespaceTokenizer().split("small medium large"))
tf.Tensor([b'small' b'medium' b'large'], shape=(3,), dtype=string)
| Args | |
|---|---|
| `input` | An N-dimensional UTF-8 string (or optionally integer) `Tensor` or `RaggedTensor`. |
| Returns | |
|---|---|
| An N+1-dimensional UTF-8 string or integer `Tensor` or `RaggedTensor`. For each string from the input tensor, the final, extra dimension contains the pieces that string was split into. |
@abc.abstractmethodsplit_with_offsets( input )
Splits the input tensor, and returns the resulting pieces with offsets.
>>> splitter = tf_text.WhitespaceTokenizer()
>>> pieces, starts, ends = splitter.split_with_offsets("a bb ccc")
>>> print(pieces.numpy(), starts.numpy(), ends.numpy())
[b'a' b'bb' b'ccc'] [0 2 5] [1 4 8]
| Args | |
|---|---|
| `input` | An N-dimensional UTF-8 string (or optionally integer) `Tensor` or `RaggedTensor`. |
