text/docs/api_docs/python/text/combine_segments.md at master · tensorflow/text · GitHub
Skip to content

Latest commit

 

History

History
143 lines (117 loc) · 3.91 KB

File metadata and controls

143 lines (117 loc) · 3.91 KB

description: Combine one or more input segments for a model's input sequence.

text.combine_segments

View source

Combine one or more input segments for a model's input sequence.

text.combine_segments(
    segments, start_of_sequence_id, end_of_segment_id
)

combine_segments combines the tokens of one or more input segments to a single sequence of token values and generates matching segment ids. combine_segments can follow a Trimmer, who limit segment lengths and emit RaggedTensor outputs, and can be followed up by ModelInputPacker.

See Detailed Experimental Setup in BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (https://arxiv.org/pdf/1810.04805.pdf) for more examples of combined segments.

combine_segments first flattens and combines a list of one or more segments (RaggedTensors of n dimensions) together along the 1st axis, then packages any special tokens into a final n dimensional RaggedTensor.

And finally combine_segments generates another RaggedTensor (with the same rank as the final combined RaggedTensor) that contains a distinct int id for each segment.

Example usage:

segment_a = [[1, 2],
             [3, 4,],
             [5, 6, 7, 8, 9]]

segment_b = [[10, 20,],
             [30, 40, 50, 60,],
             [70, 80]]
expected_combined, expected_ids = combine_segments([segment_a, segment_b])

# segment_a and segment_b have been combined w/ special tokens describing
# the beginning of a sequence and end of a sequence inserted.
expected_combined=[
 [101, 1, 2, 102, 10, 20, 102],
 [101, 3, 4, 102, 30, 40, 50, 60, 102],
 [101, 5, 6, 7, 8, 9, 102, 70, 80, 102],
]

# ids describing which items belong to which segment.
expected_ids=[
 [0, 0, 0, 0, 1, 1, 1],
 [0, 0, 0, 0, 1, 1, 1, 1, 1],
 [0, 0, 0, 0, 0, 0, 0, 1, 1, 1]]

Args

`segments` A list of `RaggedTensor`s with the tokens of the input segments. All elements must have the same dtype (int32 or int64), same rank, and same dimension 0 (namely batch size). Slice `segments[i][j, ...]` contains the tokens of the i-th input segment to the j-th example in the batch.
`start_of_sequence_id` a python int or scalar Tensor containing the id used to denote the start of a sequence (e.g. `[CLS]` token in BERT terminology).
`end_of_segment_id` a python int or scalar Tensor containing the id used to denote end of a segment (e.g. the `[SEP]` token in BERT terminology).

Returns

a tuple of (combined_segments, segment_ids), where:
`combined_segments` A `RaggedTensor` with segments combined and special tokens inserted.
`segment_ids` A `RaggedTensor` w/ the same shape as `combined_segments` and containing int ids for each item detailing the segment that they correspond to.