libai.data¶
libai.data.data_utils module¶
-
class
libai.data.data_utils.BlockIndexedDataset(data_prefix, indexed_dataset, max_seq_length=512)[source]¶ This class is propused for building sample mapping index from indexed_dataset to actural dataset. It will extract the sentence with the length of max_seq_length from the document. If it is less than the maximum length, it will be intercepted from the next document. Therefore, it always returns sentences with max_seq_length, but it may contain incomplete sentences. This is used for GPT training, and it can reduce padding and improve training efficiency.
-
class
libai.data.data_utils.SentenceIndexedDataset(data_prefix, indexed_dataset, max_seq_length=512, short_seq_prob=0.0, binary_head=False)[source]¶ This class is propused for building sample mapping index from indexed_dataset to actural dataset. It will combine as many consecutive sentences as possible in the same document without exceeding max_seq_length. When it does not reach maximum length, the pad will be filled later. All the sentences in it are complete. binary_head controls whether to return one or two sentences, which will be used in Bert.
-
libai.data.data_utils.split_ds(ds, split=None, shuffle=False, save_splits=None, load_splits=None)[source]¶ Split a dataset into subsets given proportions of how much to allocate per split. If a split is 0% returns None for that split. Purpose: Useful for creating train/val/test splits
- Parameters
ds (Dataset or array-like) – Data to be split.
split (1D array-like) – proportions to split ds. sum(splits) != 0
libai.data.datasets module¶
-
class
libai.data.datasets.BertDataset(tokenizer, data_prefix, indexed_dataset, max_seq_length=512, mask_lm_prob=0.15, short_seq_prob=0.0, max_preds_per_seq=None, seed=1234, binary_head=True)[source]¶ Dataset containing sentence pairs for BERT training. Each index corresponds to a randomly generated sentence pair.
- Parameters
tokenizer – Tokenizer to use.
data_prefix – Path to the training dataset.
indexed_dataset – Indexed dataset to use.
max_seq_length – Maximum length of the sequence. All values are padded to this length. Defaults to 512.
mask_lm_prob – Probability to mask tokens. Defaults to 0.15.
short_seq_prob – Probability of producing a short sequence. Defaults to 0.0.
max_preds_per_seq – Maximum number of mask tokens in each sentence. Defaults to None.
seed – Seed for random number generator for reproducibility. Defaults to 1234.
binary_head – Specifies whether the underlying dataset generates a pair of blocks along with a sentence_target or not. Setting it to True assumes that the underlying dataset generates a label for the pair of sentences which is surfaced as sentence_target. Defaults to True.
-
create_masked_lm_predictions(tokens, np_rng, max_ngrams=3, do_whole_word_mask=True, favor_longer_ngram=False, geometric_dist=False)[source]¶ Creates the predictions for the masked LM objective. Note: Tokens here are vocab ids and not text tokens.
-
create_tokens_and_token_types(tokens_a, tokens_b)[source]¶ merge segments A and B, add [CLS] and [SEP] and build token types.
-
mask_token(idx, tokens, np_rng)[source]¶ helper function to mask idx token from tokens according to section 3.3.1 of https://arxiv.org/pdf/1810.04805.pdf
-
class
libai.data.datasets.CIFAR100Dataset(root: str, train: bool = True, transform: Optional[Callable] = None, download: bool = False, **kwargs)[source]¶ CIFAR100 Dataset in LiBai.
- Parameters
root (string) – Root directory of dataset where directory
cifar-10-batches-pyexists or will be saved to if download is set to True.train (bool, optional) – If True, creates dataset from training set, otherwise creates from test set.
transform (callable, optional) – A function/transform that takes in an PIL image and returns a transformed version. E.g,
transforms.RandomCropdownload (bool, optional) – If true, downloads the dataset from the internet and puts it in root directory. If dataset is already downloaded, it is not downloaded again.
dataset_name (str, optional) – Name for the dataset as an identifier. E.g,
cifar100
-
class
libai.data.datasets.CIFAR10Dataset(root: str, train: bool = True, transform: Optional[Callable] = None, download: bool = False, **kwargs)[source]¶ CIFAR10 Dataset in LiBai.
- Parameters
root (string) – Root directory of dataset where directory
cifar-10-batches-pyexists or will be saved to if download is set to True.train (bool, optional) – If True, creates dataset from training set, otherwise creates from test set.
transform (callable, optional) – A function/transform that takes in an PIL image and returns a transformed version. E.g,
transforms.RandomCropdownload (bool, optional) – If true, downloads the dataset from the internet and puts it in root directory. If dataset is already downloaded, it is not downloaded again.
-
class
libai.data.datasets.GPT2Dataset(tokenizer, data_prefix, indexed_dataset, max_seq_length=512)[source]¶ Dataset containing sentences for GPT2 training.
- Parameters
tokenizer – Tokenizer to use.
data_prefix (str) – Path to the training dataset.
indexed_dataset – Indexed dataset to use.
max_seq_length (int, optional) – Maximum length of the sequence passing into encoder. All values are padded to this length. Defaults to 512.
-
class
libai.data.datasets.ImageNetDataset(root: str, train: bool = True, transform: Optional[Callable] = None, **kwargs)[source]¶ ImageNet 2012 Classification Dataset in LiBai.
- Parameters
root (string) – Root directory of the ImageNet Dataset.
train (bool, optional) – If True, creates dataset from training set, otherwise creates from test set.
transform (callable, optional) – A function/transform that takes in an PIL image and returns a transformed version. E.g,
transforms.RandomCrop
-
class
libai.data.datasets.MNISTDataset(root: str, train: bool = True, transform: Optional[Callable] = None, download: bool = False, **kwargs)[source]¶ MNIST Dataset in LiBai.
- Parameters
root (string) – Root directory of dataset where
MNIST/processed/training.ptandMNIST/processed/test.ptexist.train (bool, optional) – If True, creates dataset from
training.pt, otherwise fromtest.pt.download (bool, optional) – If true, downloads the dataset from the internet and puts it in root directory. If dataset is already downloaded, it is not downloaded again.
transform (callable, optional) – A function/transform that takes in an PIL image and returns a transformed version. E.g,
transforms.RandomCropdataset_name (str, optional) – Name for the dataset as an identifier. E.g,
mnist
-
class
libai.data.datasets.T5Dataset(tokenizer, data_prefix, indexed_dataset, max_seq_length=512, max_seq_length_dec=128, mask_lm_prob=0.15, max_preds_per_seq=None, short_seq_prob=0.0, seed=1234)[source]¶ Dataset containing sentences for T5 training.
- Parameters
tokenizer – Tokenizer to use.
data_prefix (str) – Path to the training dataset.
indexed_dataset – Indexed dataset to use.
max_seq_length (int, optional) – Maximum length of the sequence passing into encoder. All values are padded to this length. Defaults to 512.
max_seq_length_dec (int, optional) – Maximum length of the sequence passing into decoder. All values are padded to this length. Defaults to 128.
mask_lm_prob (float, optional) – Probability to mask tokens. Defaults to 0.15.
max_preds_per_seq (int, optional) – Maximum number of masked tokens in each sentence. Defaults to None.
short_seq_prob (float, optional) – Probability of producing a short sequence. Defaults to 0.0.
seed (int, optional) – Seed for random number generator for reproducibility. Defaults to 1234.
-
create_masked_lm_predictions(tokens, np_rng, max_ngrams=3, do_whole_word_mask=True, favor_longer_ngram=False, geometric_dist=False)[source]¶ Creates the predictions for the masked LM objective. Note: Tokens here are vocab ids and not text tokens.
libai.data.samplers module¶
-
class
libai.data.samplers.CyclicSampler(dataset, micro_batch_size, shuffle=False, consumed_samples=0, data_parallel_rank=0, data_parallel_size=1, seed=0)[source]¶ This sampler supports cyclic sampling, and it is also compatible with non data parallelism and data parallelism.
- Parameters
dataset – dataset to be sampled.
micro_batch_size – batch size for per model instance.
is micro_batch_size times data_parallel_size. (global_batch_size) –
shuffle – whether to shuffle the dataset.
consumed_samples – the number of samples that have been trained at the current time, used for resuming training (default:
0).data_parallel_rank – local rank for data parallelism.
data_parallel_size – the size of data parallelism.
seed – random seed, used for reproducing experiments (default:
0).
-
class
libai.data.samplers.SingleRoundSampler(dataset, micro_batch_size, shuffle=False, data_parallel_rank=0, data_parallel_size=1, seed=0, drop_last=False)[source]¶ This sampler supports single round sampling, and it is also compatible with non data parallelism and data parallelism.
- Parameters
dataset – dataset to be sampled.
micro_batch_size – batch size for per model instance, global_batch_size is micro_batch_size times data_parallel_size.
shuffle – whether to shuffle the dataset.
data_parallel_rank – local rank for data parallelism.
data_parallel_size – the size of data parallelism.
seed – random seed, used for reproducing experiments (default:
0).drop_last – whether to drop the remaining data (default:
False).
libai.data.build module¶
-
libai.data.build.build_image_test_loader(dataset, test_batch_size, sampler=None, num_workers=4, seed=0, collate_fn=None, **kwargs)[source]¶ Build image test dataloader, it’s used for test dataset
- Returns
It will return test dataloader
test_loader: dataloader for testing
- Parameters
dataset – dataset from which to load the data. e.g.: dataset or [dataset1, dataset2, …]
test_batch_size – how many samples per batch to load in testing (micro-batch-size per GPU).
sampler – defines the strategy to draw samples from the dataset. Can be any
Iterablewith__len__implemented.num_workers – how many subprocesses to use for data loading.
0means that the data will be loaded in the main process. (default:4).seed – random seed, used for reproducing experiments (default:
0).collate_fn – merges a list of samples to form a mini-batch of Tensor(s). Used when using batched loading from a map-style dataset.
-
libai.data.build.build_image_train_loader(dataset, train_batch_size, test_batch_size=None, sampler=None, num_workers=4, consumed_samples=0, seed=0, collate_fn=None, dataset_mixer=<class 'oneflow.utils.data.dataset.ConcatDataset'>, mixup_func=None, **kwargs)[source]¶ Build image train dataloader, it’s used for train dataset
- Returns
It will return train dataloader, and Nonetype for valid/test dataloader
train_loader: dataloader for training
None: Nonetype
None: Nonetype
- Parameters
dataset – dataset from which to load the data. e.g.: dataset or [dataset1, dataset2, …]
train_batch_size – how many samples per batch to load in training (micro-batch-size per GPU).
test_batch_size – no use, set it to None.
sampler – defines the strategy to draw samples from the dataset. Can be any
Iterablewith__len__implemented.num_workers – how many subprocesses to use for data loading.
0means that the data will be loaded in the main process. (default:4).consumed_samples – the number of samples that have been trained at the current time, used for resuming training (default:
0).seed – random seed, used for reproducing experiments (default:
0).collate_fn – merges a list of samples to form a mini-batch of Tensor(s). Used when using batched loading from a map-style dataset.
dataset_mixer – function for concating list dataset.
mixup_func – function for data argumentation.
-
libai.data.build.build_nlp_test_loader(dataset, test_batch_size, sampler=None, num_workers=4, seed=0, collate_fn=None)[source]¶ Build nlp test dataloader, it’s used for test dataset
- Returns
It will return test dataloader
test_loader: dataloader for testing
- Parameters
dataset – dataset from which to load the data. e.g.: dataset or [dataset1, dataset2, …]
test_batch_size – how many samples per batch to load in testing (micro-batch-size per GPU).
sampler – defines the strategy to draw samples from the dataset. Can be any
Iterablewith__len__implemented.num_workers – how many subprocesses to use for data loading.
0means that the data will be loaded in the main process. (default:4).seed – random seed, used for reproducing experiments (default:
0).collate_fn – merges a list of samples to form a mini-batch of Tensor(s). Used when using batched loading from a map-style dataset.
-
libai.data.build.build_nlp_train_loader(dataset, train_batch_size, test_batch_size=None, sampler=None, num_workers=4, consumed_samples=0, seed=0, collate_fn=None, dataset_mixer=<class 'oneflow.utils.data.dataset.ConcatDataset'>, **kwargs)[source]¶ Build nlp train dataloader, it’s used for train dataset
- Returns
It will return train dataloader, and Nonetype for valid/test dataloader
train_loader: dataloader for training
None: Nonetype
None: Nonetype
- Parameters
dataset – dataset from which to load the data. e.g.: dataset or [dataset1, dataset2, …]
train_batch_size – how many samples per batch to load in training (micro-batch-size per GPU).
test_batch_size – no use, set it to None.
sampler – defines the strategy to draw samples from the dataset. Can be any
Iterablewith__len__implemented.num_workers – how many subprocesses to use for data loading.
0means that the data will be loaded in the main process. (default:4).consumed_samples – the number of samples that have been trained at the current time, used for resuming training (default:
0).seed – random seed, used for reproducing experiments (default:
0).collate_fn – merges a list of samples to form a mini-batch of Tensor(s). Used when using batched loading from a map-style dataset.
dataset_mixer – function for concating list dataset.
-
libai.data.build.build_nlp_train_val_test_loader(dataset, splits, weights, train_batch_size, test_batch_size, sampler=None, num_workers=4, consumed_samples=0, seed=0, collate_fn=None, dataset_mixer=<class 'oneflow.utils.data.dataset.ConcatDataset'>)[source]¶ Build nlp train_val_test dataloader, it’s used for dataset lack of valid/test dataset
- Returns
It will return train/valid/test dataloader
train_loader: dataloader for training
valid_loader: dataloader for validation
test_loader: dataloader for testing
- Parameters
dataset – dataset from which to load the data. e.g.: dataset or [dataset1, dataset2, …]
splits – ratio config for spliting dataset to train/valid/test. e.g.: [[7, 2, 1], …]
weights – ratio config for concate dataset list (Not Supported yet). e.g.: [1.0, …]
train_batch_size – how many samples per batch to load in training (micro-batch-size per GPU).
test_batch_size – how many samples per batch to load in testing (micro-batch-size per GPU).
sampler – defines the strategy to draw samples from the dataset. Can be any
Iterablewith__len__implemented.num_workers – how many subprocesses to use for data loading.
0means that the data will be loaded in the main process. (default:4).consumed_samples – the number of samples that have been trained at the current time, used for resuming training (default:
0).seed – random seed, used for reproducing experiments (default:
0).collate_fn – merges a list of samples to form a mini-batch of Tensor(s). Used when using batched loading from a map-style dataset.
dataset_mixer – function for concating list dataset.