text/docs/api_docs/python/text/MaskValuesChooser.md at master · tensorflow/text · GitHub
Skip to content

Latest commit

 

History

History
173 lines (127 loc) · 4.13 KB

File metadata and controls

173 lines (127 loc) · 4.13 KB

description: Assigns values to the items chosen for masking.

text.MaskValuesChooser

View source

Assigns values to the items chosen for masking.

text.MaskValuesChooser(
    vocab_size, mask_token, mask_token_rate=0.8, random_token_rate=0.1
)

MaskValuesChooser encapsulates the logic for deciding the value to assign items that where chosen for masking. The following are the behavior in the default implementation:

For mask_token_rate of the time, replace the item with the [MASK] token:

my dog is hairy -> my dog is [MASK]

For random_token_rate of the time, replace the item with a random word:

my dog is hairy -> my dog is apple

For 1 - mask_token_rate - random_token_rate of the time, keep the item unchanged:

my dog is hairy -> my dog is hairy.

The default behavior is consistent with the methodology specified in Masked LM and Masking Procedure described in BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (https://arxiv.org/pdf/1810.04805.pdf).

Users may further customize this with behavior through subclassing and overriding get_mask_values().

Args

`vocab_size` size of vocabulary.
`mask_token` The id of the mask token.
`mask_token_rate` (optional) A float between 0 and 1 which indicates how often the `mask_token` is substituted for tokens selected for masking. Default is 0.8, NOTE: `mask_token_rate` + `random_token_rate` <= 1.
`random_token_rate` A float between 0 and 1 which indicates how often a random token is substituted for tokens selected for masking. Default is 0.1. NOTE: `mask_token_rate` + `random_token_rate` <= 1.

Attributes

`mask_token`
`random_token_rate`
`vocab_size`

Methods

get_mask_values

View source

get_mask_values(
    masked_lm_ids
)

Get the values used for masking, random injection or no-op.

Args
`masked_lm_ids` a `RaggedTensor` of n dimensions and dtype int32 or int64 whose values are the ids of items that have been selected for masking.
Returns
a `RaggedTensor` of the same dtype and shape with `masked_lm_ids` whose values contain either the mask token, randomly injected token or original value.