MRT Dataset Documentation¶
Contents
Dataset to be loaded by MRT has been abstracted as an
python base class named: Dataset.
The Dataset class defines many interface function to
be implemented in the concrete derived dataset class. And
the details about that refer to the following sections please.
Note
Usage: The dataset class is mainly invoked via the
main2.py program and developers may also add extra
custom derived dataset class and invoke the unify API.
We have achieved some common datasets in MRT dataset module, including image classification, detection and NLP models.
Package Info¶
module exports: DS_REG, Dataset
DS_REG
DS_REG contains all the implemented and registered
dataset obeying the pair format of dataset name and concrete
class.
The supported(have implemented and registered into DS_REG)
datasets are listed as below:
register_dataset is a convinent function to decorate
the concrete dataset class, which can auto-register into
export variable: DS_REG.
Abstract Dataset¶
- class mrt.dataset.Dataset(input_shape, root='/home/docs/.mxnet/datasets')¶
Base dataset class, with pre-defined interface.
The dataset directory is located at the
rootdirectory containing the dataset name directory. And the custom dataset should pass the parameter location of root, or implement the derived class of your data iterator, metrics and validate function.- Notice:
Our default imagenet dataset is organized as an
recordbinary format, which can amplify the throughput for image read. Custom Image dataset of third party could be preprocessed by the im2rec procedure to transform the image into the record format.The transformation script is located at
docs/mrt/im2rec.py. And more details refer to the script helper documentation please(print usage with command-h).
- Parameters
input_shape (Tuple or List) – The input shape requested from user, and some dataset would check the format validity. Generally, specific dataset will do some checks for input shape, such as the channel number for image. Example: imagenet’s input shape is like to this, (N, C, H, W), where the C must be equal to 3, H equals to W and N indicates the batch size user want. Different H(W) requests the dataset loader to resize image.
root (os.path.Path or path string) – The location where dataset is stored, defined with variable
MRT_DATASET_ROOTin conf.py or custom directory.
Custom Dataset Implementation (derived this class):
1. Register dataset name into DS_REG that can be accessed at the
datasetpackage API. And releated function is theregister_datasetfunction.Override the abstract method defined in base dataset class:
- _load_data(self) [Required]:
Load data from disk that stored into the data variable. And save the required data_loader to the member: data.
- iter_func(self) [Optional]:
Return the tuple (data, label) for each invocation according to the member data loaded from the function _load_data.
Also, this function is optional since we have implemented a naive version if the member data is python generator- compatible type, supporting the iter(data) function. Or you will override the function you need.
- metrics(self) [Required]:
Return the metrics object for the dataset, such as some auxiliary variable.
- validate(self, metrics, predict, label) [Required]:
Calculates the accuracy for model inference of string. Return formated string type
Examples
>>> from mxnet import ndarray as nd >>> @register_dataset("my_dataset") >>> class MyDataset(Dataset): ... def _load_data(self): ... B = self.ishape[0] ... def _data_loader(): ... for i in range(1000): ... yield nd.array([i + c for c in range(B)]) ... self.data = _data_loader() ... ... # use the default `iter_func` defined in base class ... ... def metrics(self): ... return {"count": 0, "total": 0} ... def validate(self, metrics, predict, label): ... for idx in range(predict.shape[0]): ... res_label = predict[idx].asnumpy().argmax() ... data_label = label[idx].asnumpy() ... if res_label == data_label: ... metrics["acc"] += 1 ... metrics["total"] += 1 ... acc = 1. * metrics["acc"] / metrics["total"] ... return "{:6.2%}".format(acc) >>> >>> # usage >>> md_cls = DS_REG["my_dataset"] >>> ds = md_cls([8]) # batch size is 8 >>> data_iter_func = ds.iter_func() >>> data_iter_func() # get the batch data NDArray<[0, 1, 2, 3, 4, 5, 6, 7] @ctx(cpu)>
- _load_data()¶
Load data from disk.
Save the data loader into member data like:
self.data = data_loader
And validate the input shape if necessary:
N, C, H, W = self.ishape assert C == 3 and H == W
- iter_func()¶
Returns (data, label) iterator function.
Get the iterator of self.data and iterate each batch sample with next function manually. Call like this:
data_iter_func = dataset.iter_func() data, label = data_iter_func()
Common Datasets¶
- class mrt.dataset.COCODataset(input_shape, root='/home/docs/.mxnet/datasets')¶
- _load_data()¶
Customized _load_data method introduction.
COCO dataset only support layout of NCHW and the number of channels must be 3, i.e. (batch_size, 3, input_size, input_size).
The validation dataset will be created by MS COCO Detection Dataset and use SSDDefaultValTransform as data preprocess function.
- metrics()¶
Customized metrics method introduction.
COCODetectionMetric is used which is the detection metric for COCO bbox task.
- validate(metrics, predict, label)¶
Customized validate method introduction.
The image height must be equal to the image width.
The model output is [id, score, bounding_box], where bounding_box is of layout (x1, y1, x2, y2).
The data label is implemented as follows:
map_name, mean_ap = metrics.get() acc = {k: v for k,v in zip(map_name, mean_ap)} acc = float(acc['~~~~ MeanAP @ IoU=[0.50, 0.95] ~~~~\n']) / 100
- class mrt.dataset.VOCDataset(input_shape, root='/home/docs/.mxnet/datasets')¶
- _load_data()¶
Customized _load_data method introduction.
VOC dataset only support layout of NCHW and the number of channels must be 3, i.e. (batch_size, 3, input_size, input_size).
The validation dataset will be created by Pascal VOC detection Dataset and use YOLO3DefaultValTransform as data preprocess function.
- metrics()¶
Customized metric method introduction.
VOC07MApMetric is used which is the Mean average precision metric for PASCAL V0C 07 dataset.
- validate(metrics, predict, label)¶
Customized validate method introduction.
The image height must be equal to the image width.
The model output is [id, score, bounding_box], where bounding_box is of layout (x1, y1, x2, y2).
The data label is implemented as follows:
map_name, mean_ap = metrics.get() acc = {k: v for k,v in zip(map_name, mean_ap)}['mAP']
- class mrt.dataset.VisionDataset(input_shape, root='/home/docs/.mxnet/datasets')¶
- metrics()¶
Customized metric method introduction.
Computes accuracy classification score and top k predictions accuracy.
- validate(metrics, predict, label)¶
Customized metric method introduction.
The model output include score for 1000 classes.
- class mrt.dataset.ImageNetDataset(input_shape, root='/home/docs/.mxnet/datasets')¶
- _load_data()¶
Customized _load_data method introduction.
ImageNet dataset only support layout of NCHW and the number of channels must be 3, i.e. (batch_size, 3, input_size, input_size). The image height must be equal to the image width.
The data preprocess process includes:
\[crop_ratio = 0.875\]\[resize = ceil(H / crop\_ratio)\]\[mean_rgb = [123.68, 116.779, 103.939]\]\[std_rgb = [58.393, 57.12, 57.375]\]Use ImageRecordIter to iterate on image record io files.
- class mrt.dataset.Cifar10Dataset(input_shape, root='/home/docs/.mxnet/datasets')¶
- _load_data()¶
Customized _load_data method introduction.
Cifar10Dataset only support layout of NCHW and the number of channels must be 3, i.e. (batch_size, 3, 32, 32). The image height and width must be equal to 32.
The data preprocess process includes:
\[mean = [0.4914, 0.4822, 0.4465]\]\[std = [0.2023, 0.1994, 0.2010]\]
- class mrt.dataset.QuickDrawDataset(input_shape, is_train=False, **kwargs)¶
- _load_data()¶
Customized _load_data method introduction.
QuickDrawDataset only support layout of NCHW and the number of channels must be 3, the image height and width must be equal to 32, i.e. (batch_size, 3, 28, 28).
- class mrt.dataset.MnistDataset(input_shape, root='/home/docs/.mxnet/datasets')¶
- _load_data()¶
Customized _load_data method introduction.
The MxNet gluon package will auto-download the mnist dataset.
MnistDataset only support layout of NCHW and the number of channels must be 1, the image height and width must be equal to 32, i.e. (batch_size, 1, 28, 28).
- class mrt.dataset.TrecDataset(input_shape, is_train=False, **kwargs)¶
- _load_data()¶
Customized _load_data method introduction.
The MxNet gluon package will auto-download the mnist dataset.
TrecDataset only support layout of (I, N), the image height and width must be equal to 32, i.e. (batch_size, 1, 28, 28).
- validate(metrics, predict, label)¶
Customized validate method introduction.
The score for 6 classes is the model output. The data label is implemented as follows:
acc = 1. * metrcs["acc"] / metrics["total"]
Customize Dataset¶
One may want to add implemantary dataset into MRT framework,
and there are two situations: the extra dataset format is
compatible with the existed dataset such as another imagenet
dataset with the same MxNet Record Binary file format,
or the dataset has an unique format that need to customize
the data load logic.
Compatible Format¶
For dataset that is compatible with existed dataset, one can
simply reuse the corresponding dataset class with changing
the dataset root directory, since the abstract dataset has
supplied the extra root parameter to replace the default
MRT dataset location.
Codes like:
ds = dataset.DS_REG['imagenet'](
(16, 3, 614, 614), # the dataset input shape
root="your/dataset/path", # specify the new dataset path
)
Custom Format¶
You need to implement the unique dataset class after importing the MRT dataset package. And we suggest that you review the section: Abstract Dataset for the dataset interface.
Generally, one should derive the Dataset class and
implement the four abstract functions: _load_data,
iter_func, metrics, validate.
Here are some example codes:
1from mxnet import ndarray as nd
2@register_dataset("my_dataset")
3class MyDataset(Dataset):
4 def _load_data(self):
5 B = self.ishape[0]
6 def _data_loader():
7 for i in range(1000):
8 yield nd.array([i + c for c in range(B)])
9 self.data = _data_loader()
10
11 # use the default `iter_func` defined in base class
12
13 def metrics(self):
14 return {"count": 0, "total": 0}
15 def validate(self, metrics, predict, label):
16 for idx in range(predict.shape[0]):
17 res_label = predict[idx].asnumpy().argmax()
18 data_label = label[idx].asnumpy()
19 if res_label == data_label:
20 metrics["acc"] += 1
21 metrics["total"] += 1
22 acc = 1. * metrics["acc"] / metrics["total"]
23 return "{:6.2%}".format(acc)
24
25# usage
26md_cls = DS_REG["my_dataset"]
27ds = md_cls([8]) # batch size is 8
28data_iter_func = ds.iter_func()
29data_iter_func() # get the batch data
30
31# output
32NDArray<[0, 1, 2, 3, 4, 5, 6, 7] @ctx(cpu)>