libai.tokenizer¶

class libai.tokenizer.BertTokenizer(vocab_file, do_lower_case=True, do_basic_tokenize=True, never_split=None, unk_token='[UNK]', sep_token='[SEP]', pad_token='[PAD]', cls_token='[CLS]', mask_token='[MASK]', tokenize_chinese_chars=True, do_chinese_wwm=False, **kwargs)[source]¶

Construct a BERT tokenizer. Based on WordPiece.

Parameters

vocab_file (str) – Path to a one-wordpiece-per-line vocabulary file.
do_lower_case (bool, optional, defaults to True) – Whether to lower case the input Only has an effect when do_basic_tokenize=True.
do_basic_tokenize (bool, optional, defaults to True) – Whether to do basic tokenization before wordpiece.
never_split (Iterable, optional) – List of tokens which will never be split during tokenization. Only has an effect when do_basic_tokenize=True.
tokenize_chinese_chars (bool, optional, defaults to True) – Whether to tokenize Chinese characters. This should likely be deactivated for Japanese, see: https://github.com/huggingface/pytorch-pretrained-BERT/issues/328.
do_chinese_wwm (bool, optional, defaults to False) – Whether to do whole word masking for Chinese. Chinese sentence will be segmented by a third-party tool first. Each substr will be added ‘##’ prefix and its index will be calucated by id(##A) = id(A) + vocab_size.

convert_tokens_to_string(tokens)[source]¶: Converts a sequence of tokens (string) in a single string.

get_vocab()[source]¶: Returns the vocabulary as a dictionary of token to index. tokenizer.get_vocab()[token] is equivalent to tokenizer.convert_tokens_to_ids(token) when token is in the vocab. :returns: The vocabulary. :rtype: Dict[str, int]

save_vocabulary(save_directory, filename_prefix=None)[source]¶: Save the tokenizer vocabulary to a directory or file.

property vocab_size¶: Size of the base vocabulary (without the added tokens).

Construct a GPT-2 tokenizer. Based on byte-level Byte-Pair-Encoding.

Parameters

vocab_file (str) – Path to the vocabulary file.
merges_file (str) – Path to the merges file.
errors (str, optional, defaults to "replace") – Paradigm to follow when decoding bytes to UTF-8. See bytes.decode for more information.
unk_token (str, optional, defaults to <|endoftext|>) – The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this token instead.
bos_token (str, optional, defaults to <|endoftext|>) – The beginning of sequence token.
eos_token (str, optional, defaults to <|endoftext|>) – The end of sequence token.

convert_tokens_to_string(tokens)[source]¶: Converts a sequence of tokens (string) in a single string.

get_vocab()[source]¶: Returns the vocabulary as a dictionary of token to index. tokenizer.get_vocab()[token] is equivalent to tokenizer.convert_tokens_to_ids(token) when token is in the vocab. :returns: The vocabulary. :rtype: Dict[str, int]

save_vocabulary(save_directory, filename_prefix=None)[source]¶: Save the tokenizer vocabulary to a directory. This method does NOT save added tokens and special token mappings. Please use save_pretrained() () to save the full Tokenizer state if you want to reload it using the from_pretrained() class method.

property vocab_size¶: Size of the base vocabulary (without the added tokens).