libai.tokenizer¶
-
class
libai.tokenizer.BertTokenizer(vocab_file, do_lower_case=True, do_basic_tokenize=True, never_split=None, unk_token='[UNK]', sep_token='[SEP]', pad_token='[PAD]', cls_token='[CLS]', mask_token='[MASK]', tokenize_chinese_chars=True, do_chinese_wwm=False, **kwargs)[source]¶ Construct a BERT tokenizer. Based on WordPiece.
- Parameters
vocab_file (
str) – Path to a one-wordpiece-per-line vocabulary file.do_lower_case (
bool, optional, defaults toTrue) – Whether to lower case the input Only has an effect when do_basic_tokenize=True.do_basic_tokenize (
bool, optional, defaults toTrue) – Whether to do basic tokenization before wordpiece.never_split (
Iterable, optional) – List of tokens which will never be split during tokenization. Only has an effect when do_basic_tokenize=True.tokenize_chinese_chars (
bool, optional, defaults toTrue) – Whether to tokenize Chinese characters. This should likely be deactivated for Japanese, see: https://github.com/huggingface/pytorch-pretrained-BERT/issues/328.do_chinese_wwm (
bool, optional, defaults toFalse) – Whether to do whole word masking for Chinese. Chinese sentence will be segmented by a third-party tool first. Each substr will be added ‘##’ prefix and its index will be calucated by id(##A) = id(A) + vocab_size.
-
convert_tokens_to_string(tokens)[source]¶ Converts a sequence of tokens (string) in a single string.
-
get_vocab()[source]¶ Returns the vocabulary as a dictionary of token to index.
tokenizer.get_vocab()[token]is equivalent totokenizer.convert_tokens_to_ids(token)whentokenis in the vocab. :returns: The vocabulary. :rtype:Dict[str, int]
-
save_vocabulary(save_directory, filename_prefix=None)[source]¶ Save the tokenizer vocabulary to a directory or file.
-
property
vocab_size¶ Size of the base vocabulary (without the added tokens).
-
class
libai.tokenizer.GPT2Tokenizer(vocab_file, merges_file, errors='replace', unk_token='<|endoftext|>', bos_token='<|endoftext|>', eos_token='<|endoftext|>', **kwargs)[source]¶ Construct a GPT-2 tokenizer. Based on byte-level Byte-Pair-Encoding.
- Parameters
vocab_file (
str) – Path to the vocabulary file.merges_file (
str) – Path to the merges file.errors (
str, optional, defaults to"replace") – Paradigm to follow when decoding bytes to UTF-8. See bytes.decode for more information.unk_token (
str, optional, defaults to<|endoftext|>) – The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this token instead.bos_token (
str, optional, defaults to<|endoftext|>) – The beginning of sequence token.eos_token (
str, optional, defaults to<|endoftext|>) – The end of sequence token.
-
convert_tokens_to_string(tokens)[source]¶ Converts a sequence of tokens (string) in a single string.
-
get_vocab()[source]¶ Returns the vocabulary as a dictionary of token to index.
tokenizer.get_vocab()[token]is equivalent totokenizer.convert_tokens_to_ids(token)whentokenis in the vocab. :returns: The vocabulary. :rtype:Dict[str, int]
-
save_vocabulary(save_directory, filename_prefix=None)[source]¶ Save the tokenizer vocabulary to a directory. This method does NOT save added tokens and special token mappings. Please use
save_pretrained()() to save the full Tokenizer state if you want to reload it using thefrom_pretrained()class method.
-
property
vocab_size¶ Size of the base vocabulary (without the added tokens).