libai.tokenizer

class libai.tokenizer.BertTokenizer(vocab_file, do_lower_case=True, do_basic_tokenize=True, never_split=None, unk_token='[UNK]', sep_token='[SEP]', pad_token='[PAD]', cls_token='[CLS]', mask_token='[MASK]', tokenize_chinese_chars=True, do_chinese_wwm=False, **kwargs)[source]

Construct a BERT tokenizer. Based on WordPiece.

Parameters
  • vocab_file (str) – Path to a one-wordpiece-per-line vocabulary file.

  • do_lower_case (bool, optional, defaults to True) – Whether to lower case the input Only has an effect when do_basic_tokenize=True.

  • do_basic_tokenize (bool, optional, defaults to True) – Whether to do basic tokenization before wordpiece.

  • never_split (Iterable, optional) – List of tokens which will never be split during tokenization. Only has an effect when do_basic_tokenize=True.

  • tokenize_chinese_chars (bool, optional, defaults to True) – Whether to tokenize Chinese characters. This should likely be deactivated for Japanese, see: https://github.com/huggingface/pytorch-pretrained-BERT/issues/328.

  • do_chinese_wwm (bool, optional, defaults to False) – Whether to do whole word masking for Chinese. Chinese sentence will be segmented by a third-party tool first. Each substr will be added ‘##’ prefix and its index will be calucated by id(##A) = id(A) + vocab_size.

convert_tokens_to_string(tokens)[source]

Converts a sequence of tokens (string) in a single string.

get_vocab()[source]

Returns the vocabulary as a dictionary of token to index. tokenizer.get_vocab()[token] is equivalent to tokenizer.convert_tokens_to_ids(token) when token is in the vocab. :returns: The vocabulary. :rtype: Dict[str, int]

save_vocabulary(save_directory, filename_prefix=None)[source]

Save the tokenizer vocabulary to a directory or file.

property vocab_size

Size of the base vocabulary (without the added tokens).

class libai.tokenizer.GPT2Tokenizer(vocab_file, merges_file, errors='replace', unk_token='<|endoftext|>', bos_token='<|endoftext|>', eos_token='<|endoftext|>', **kwargs)[source]

Construct a GPT-2 tokenizer. Based on byte-level Byte-Pair-Encoding.

Parameters
  • vocab_file (str) – Path to the vocabulary file.

  • merges_file (str) – Path to the merges file.

  • errors (str, optional, defaults to "replace") – Paradigm to follow when decoding bytes to UTF-8. See bytes.decode for more information.

  • unk_token (str, optional, defaults to <|endoftext|>) – The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this token instead.

  • bos_token (str, optional, defaults to <|endoftext|>) – The beginning of sequence token.

  • eos_token (str, optional, defaults to <|endoftext|>) – The end of sequence token.

convert_tokens_to_string(tokens)[source]

Converts a sequence of tokens (string) in a single string.

get_vocab()[source]

Returns the vocabulary as a dictionary of token to index. tokenizer.get_vocab()[token] is equivalent to tokenizer.convert_tokens_to_ids(token) when token is in the vocab. :returns: The vocabulary. :rtype: Dict[str, int]

save_vocabulary(save_directory, filename_prefix=None)[source]

Save the tokenizer vocabulary to a directory. This method does NOT save added tokens and special token mappings. Please use save_pretrained() () to save the full Tokenizer state if you want to reload it using the from_pretrained() class method.

property vocab_size

Size of the base vocabulary (without the added tokens).