libai.data¶

libai.data.data_utils module¶

class libai.data.data_utils.BlockIndexedDataset(data_prefix, indexed_dataset, max_seq_length=512)[source]¶: This class is propused for building sample mapping index from indexed_dataset to actural dataset. It will extract the sentence with the length of max_seq_length from the document. If it is less than the maximum length, it will be intercepted from the next document. Therefore, it always returns sentences with max_seq_length, but it may contain incomplete sentences. This is used for GPT training, and it can reduce padding and improve training efficiency.

class libai.data.data_utils.IndexedCachedDataset(path)[source]¶

class libai.data.data_utils.IndexedDataset(path)[source]¶: Loader for IndexedDataset

class libai.data.data_utils.MMapIndexedDataset(path, skip_warmup=False)[source]¶

get(idx, offset=0, length=None)[source]¶

Retrieves a single item from the dataset with the option to only return a portion of the item.

get(idx) is the same as [idx] but get() does not support slicing.

class libai.data.data_utils.SentenceIndexedDataset(data_prefix, indexed_dataset, max_seq_length=512, short_seq_prob=0.0, binary_head=False)[source]¶: This class is propused for building sample mapping index from indexed_dataset to actural dataset. It will combine as many consecutive sentences as possible in the same document without exceeding max_seq_length. When it does not reach maximum length, the pad will be filled later. All the sentences in it are complete. binary_head controls whether to return one or two sentences, which will be used in Bert.

class libai.data.data_utils.SplitDataset(dataset, split_inds)[source]¶

libai.data.data_utils.split_ds(ds, split=None, shuffle=False, save_splits=None, load_splits=None)[source]¶

Split a dataset into subsets given proportions of how much to allocate per split. If a split is 0% returns None for that split. Purpose: Useful for creating train/val/test splits

Parameters

ds (Dataset or array-like) – Data to be split.
split (1D array-like) – proportions to split ds. sum(splits) != 0

libai.data.datasets module¶

class libai.data.datasets.BertDataset(tokenizer, data_prefix, indexed_dataset, max_seq_length=512, mask_lm_prob=0.15, short_seq_prob=0.0, max_preds_per_seq=None, seed=1234, binary_head=True)[source]¶

Dataset containing sentence pairs for BERT training. Each index corresponds to a randomly generated sentence pair.

Parameters

tokenizer – Tokenizer to use.
data_prefix – Path to the training dataset.
indexed_dataset – Indexed dataset to use.
max_seq_length – Maximum length of the sequence. All values are padded to this length. Defaults to 512.
mask_lm_prob – Probability to mask tokens. Defaults to 0.15.
short_seq_prob – Probability of producing a short sequence. Defaults to 0.0.
max_preds_per_seq – Maximum number of mask tokens in each sentence. Defaults to None.
seed – Seed for random number generator for reproducibility. Defaults to 1234.
binary_head – Specifies whether the underlying dataset generates a pair of blocks along with a sentence_target or not. Setting it to True assumes that the underlying dataset generates a label for the pair of sentences which is surfaced as sentence_target. Defaults to True.

create_masked_lm_predictions(tokens, np_rng, max_ngrams=3, do_whole_word_mask=True, favor_longer_ngram=False, geometric_dist=False)[source]¶: Creates the predictions for the masked LM objective. Note: Tokens here are vocab ids and not text tokens.

create_tokens_and_token_types(tokens_a, tokens_b)[source]¶: merge segments A and B, add [CLS] and [SEP] and build token types.

mask_token(idx, tokens, np_rng)[source]¶: helper function to mask idx token from tokens according to section 3.3.1 of https://arxiv.org/pdf/1810.04805.pdf

pad_and_convert_to_tensor(tokens, token_types, masked_positions, masked_labels)[source]¶: pad sequences and convert them to tensor

truncate_seq_pair(tokens_a, tokens_b, max_num_tokens, np_rng)[source]¶: truncate sequence pair to a maximum sequence length

class libai.data.datasets.CIFAR100Dataset(root: str, train: bool = True, transform: Optional[Callable] = None, download: bool = False, **kwargs)[source]¶

CIFAR100 Dataset in LiBai.

Parameters

root (string) – Root directory of dataset where directory cifar-10-batches-py exists or will be saved to if download is set to True.
train (bool, optional) – If True, creates dataset from training set, otherwise creates from test set.
transform (callable, optional) – A function/transform that takes in an PIL image and returns a transformed version. E.g, transforms.RandomCrop
download (bool, optional) – If true, downloads the dataset from the internet and puts it in root directory. If dataset is already downloaded, it is not downloaded again.
dataset_name (str, optional) – Name for the dataset as an identifier. E.g, cifar100

class libai.data.datasets.CIFAR10Dataset(root: str, train: bool = True, transform: Optional[Callable] = None, download: bool = False, **kwargs)[source]¶

CIFAR10 Dataset in LiBai.

Parameters

root (string) – Root directory of dataset where directory cifar-10-batches-py exists or will be saved to if download is set to True.
train (bool, optional) – If True, creates dataset from training set, otherwise creates from test set.
transform (callable, optional) – A function/transform that takes in an PIL image and returns a transformed version. E.g, transforms.RandomCrop
download (bool, optional) – If true, downloads the dataset from the internet and puts it in root directory. If dataset is already downloaded, it is not downloaded again.

class libai.data.datasets.GPT2Dataset(tokenizer, data_prefix, indexed_dataset, max_seq_length=512)[source]¶

Dataset containing sentences for GPT2 training.

Parameters

tokenizer – Tokenizer to use.
data_prefix (str) – Path to the training dataset.
indexed_dataset – Indexed dataset to use.
max_seq_length (int, optional) – Maximum length of the sequence passing into encoder. All values are padded to this length. Defaults to 512.

class libai.data.datasets.ImageNetDataset(root: str, train: bool = True, transform: Optional[Callable] = None, **kwargs)[source]¶

ImageNet 2012 Classification Dataset in LiBai.

Parameters

root (string) – Root directory of the ImageNet Dataset.
train (bool, optional) – If True, creates dataset from training set, otherwise creates from test set.
transform (callable, optional) – A function/transform that takes in an PIL image and returns a transformed version. E.g, transforms.RandomCrop

class libai.data.datasets.MNISTDataset(root: str, train: bool = True, transform: Optional[Callable] = None, download: bool = False, **kwargs)[source]¶

MNIST Dataset in LiBai.

Parameters

root (string) – Root directory of dataset where MNIST/processed/training.pt and MNIST/processed/test.pt exist.
train (bool, optional) – If True, creates dataset from training.pt, otherwise from test.pt.
download (bool, optional) – If true, downloads the dataset from the internet and puts it in root directory. If dataset is already downloaded, it is not downloaded again.
transform (callable, optional) – A function/transform that takes in an PIL image and returns a transformed version. E.g, transforms.RandomCrop
dataset_name (str, optional) – Name for the dataset as an identifier. E.g, mnist

class libai.data.datasets.T5Dataset(tokenizer, data_prefix, indexed_dataset, max_seq_length=512, max_seq_length_dec=128, mask_lm_prob=0.15, max_preds_per_seq=None, short_seq_prob=0.0, seed=1234)[source]¶

Dataset containing sentences for T5 training.

Parameters

tokenizer – Tokenizer to use.
data_prefix (str) – Path to the training dataset.
indexed_dataset – Indexed dataset to use.
max_seq_length (int, optional) – Maximum length of the sequence passing into encoder. All values are padded to this length. Defaults to 512.
max_seq_length_dec (int, optional) – Maximum length of the sequence passing into decoder. All values are padded to this length. Defaults to 128.
mask_lm_prob (float, optional) – Probability to mask tokens. Defaults to 0.15.
max_preds_per_seq (int, optional) – Maximum number of masked tokens in each sentence. Defaults to None.
short_seq_prob (float, optional) – Probability of producing a short sequence. Defaults to 0.0.
seed (int, optional) – Seed for random number generator for reproducibility. Defaults to 1234.

create_masked_lm_predictions(tokens, np_rng, max_ngrams=3, do_whole_word_mask=True, favor_longer_ngram=False, geometric_dist=False)[source]¶: Creates the predictions for the masked LM objective. Note: Tokens here are vocab ids and not text tokens.

make_attention_mask(source_block, target_block)[source]¶: Returns a 2-dimensional (2-D) attention mask :param source_block: 1-D array :param target_block: 1-D array

pad_and_convert_to_numpy(tokens, masked_spans)[source]¶: pad sequences and convert them to numpy array

libai.data.samplers module¶

class libai.data.samplers.CyclicSampler(dataset, micro_batch_size, shuffle=False, consumed_samples=0, data_parallel_rank=0, data_parallel_size=1, seed=0)[source]¶

This sampler supports cyclic sampling, and it is also compatible with non data parallelism and data parallelism.

Parameters

dataset – dataset to be sampled.
micro_batch_size – batch size for per model instance.
is micro_batch_size times data_parallel_size. (global_batch_size) –
shuffle – whether to shuffle the dataset.
consumed_samples – the number of samples that have been trained at the current time, used for resuming training (default: 0).
data_parallel_rank – local rank for data parallelism.
data_parallel_size – the size of data parallelism.
seed – random seed, used for reproducing experiments (default: 0).

set_consumed_samples(consumed_samples)[source]¶: you can recover the training iteration by setting consumed_samples.

set_epoch(epoch)[source]¶: used for restoring training status.

class libai.data.samplers.SingleRoundSampler(dataset, micro_batch_size, shuffle=False, data_parallel_rank=0, data_parallel_size=1, seed=0, drop_last=False)[source]¶

This sampler supports single round sampling, and it is also compatible with non data parallelism and data parallelism.

Parameters

dataset – dataset to be sampled.
micro_batch_size – batch size for per model instance, global_batch_size is micro_batch_size times data_parallel_size.
shuffle – whether to shuffle the dataset.
data_parallel_rank – local rank for data parallelism.
data_parallel_size – the size of data parallelism.
seed – random seed, used for reproducing experiments (default: 0).
drop_last – whether to drop the remaining data (default: False).

libai.data.build module¶

libai.data.build.build_image_test_loader(dataset, test_batch_size, sampler=None, num_workers=4, seed=0, collate_fn=None, **kwargs)[source]¶

Build image test dataloader, it’s used for test dataset

Returns

It will return test dataloader

test_loader: dataloader for testing

Parameters

dataset – dataset from which to load the data. e.g.: dataset or [dataset1, dataset2, …]
test_batch_size – how many samples per batch to load in testing (micro-batch-size per GPU).
sampler – defines the strategy to draw samples from the dataset. Can be any Iterable with __len__ implemented.
num_workers – how many subprocesses to use for data loading. 0 means that the data will be loaded in the main process. (default: 4).
seed – random seed, used for reproducing experiments (default: 0).
collate_fn – merges a list of samples to form a mini-batch of Tensor(s). Used when using batched loading from a map-style dataset.

libai.data.build.build_image_train_loader(dataset, train_batch_size, test_batch_size=None, sampler=None, num_workers=4, consumed_samples=0, seed=0, collate_fn=None, dataset_mixer=<class 'oneflow.utils.data.dataset.ConcatDataset'>, mixup_func=None, **kwargs)[source]¶

Build image train dataloader, it’s used for train dataset

Returns

It will return train dataloader, and Nonetype for valid/test dataloader

train_loader: dataloader for training

None: Nonetype

None: Nonetype

Parameters

dataset – dataset from which to load the data. e.g.: dataset or [dataset1, dataset2, …]
train_batch_size – how many samples per batch to load in training (micro-batch-size per GPU).
test_batch_size – no use, set it to None.
sampler – defines the strategy to draw samples from the dataset. Can be any Iterable with __len__ implemented.
num_workers – how many subprocesses to use for data loading. 0 means that the data will be loaded in the main process. (default: 4).
consumed_samples – the number of samples that have been trained at the current time, used for resuming training (default: 0).
seed – random seed, used for reproducing experiments (default: 0).
collate_fn – merges a list of samples to form a mini-batch of Tensor(s). Used when using batched loading from a map-style dataset.
dataset_mixer – function for concating list dataset.
mixup_func – function for data argumentation.

libai.data.build.build_nlp_test_loader(dataset, test_batch_size, sampler=None, num_workers=4, seed=0, collate_fn=None)[source]¶

Build nlp test dataloader, it’s used for test dataset

Returns

It will return test dataloader

test_loader: dataloader for testing

Parameters

dataset – dataset from which to load the data. e.g.: dataset or [dataset1, dataset2, …]
test_batch_size – how many samples per batch to load in testing (micro-batch-size per GPU).
sampler – defines the strategy to draw samples from the dataset. Can be any Iterable with __len__ implemented.
num_workers – how many subprocesses to use for data loading. 0 means that the data will be loaded in the main process. (default: 4).
seed – random seed, used for reproducing experiments (default: 0).
collate_fn – merges a list of samples to form a mini-batch of Tensor(s). Used when using batched loading from a map-style dataset.

libai.data.build.build_nlp_train_loader(dataset, train_batch_size, test_batch_size=None, sampler=None, num_workers=4, consumed_samples=0, seed=0, collate_fn=None, dataset_mixer=<class 'oneflow.utils.data.dataset.ConcatDataset'>, **kwargs)[source]¶

Build nlp train dataloader, it’s used for train dataset

Returns

It will return train dataloader, and Nonetype for valid/test dataloader

train_loader: dataloader for training

None: Nonetype

None: Nonetype

Parameters

dataset – dataset from which to load the data. e.g.: dataset or [dataset1, dataset2, …]
train_batch_size – how many samples per batch to load in training (micro-batch-size per GPU).
test_batch_size – no use, set it to None.
sampler – defines the strategy to draw samples from the dataset. Can be any Iterable with __len__ implemented.
num_workers – how many subprocesses to use for data loading. 0 means that the data will be loaded in the main process. (default: 4).
consumed_samples – the number of samples that have been trained at the current time, used for resuming training (default: 0).
seed – random seed, used for reproducing experiments (default: 0).
collate_fn – merges a list of samples to form a mini-batch of Tensor(s). Used when using batched loading from a map-style dataset.
dataset_mixer – function for concating list dataset.

libai.data.build.build_nlp_train_val_test_loader(dataset, splits, weights, train_batch_size, test_batch_size, sampler=None, num_workers=4, consumed_samples=0, seed=0, collate_fn=None, dataset_mixer=<class 'oneflow.utils.data.dataset.ConcatDataset'>)[source]¶

Build nlp train_val_test dataloader, it’s used for dataset lack of valid/test dataset

Returns

It will return train/valid/test dataloader

train_loader: dataloader for training

valid_loader: dataloader for validation

test_loader: dataloader for testing

Parameters

dataset – dataset from which to load the data. e.g.: dataset or [dataset1, dataset2, …]
splits – ratio config for spliting dataset to train/valid/test. e.g.: [[7, 2, 1], …]
weights – ratio config for concate dataset list (Not Supported yet). e.g.: [1.0, …]
train_batch_size – how many samples per batch to load in training (micro-batch-size per GPU).
test_batch_size – how many samples per batch to load in testing (micro-batch-size per GPU).
sampler – defines the strategy to draw samples from the dataset. Can be any Iterable with __len__ implemented.
num_workers – how many subprocesses to use for data loading. 0 means that the data will be loaded in the main process. (default: 4).
consumed_samples – the number of samples that have been trained at the current time, used for resuming training (default: 0).
seed – random seed, used for reproducing experiments (default: 0).
collate_fn – merges a list of samples to form a mini-batch of Tensor(s). Used when using batched loading from a map-style dataset.
dataset_mixer – function for concating list dataset.