Register Datasets#

The first step in integrating your training code with 3LC is to register your datasets. This can be accomplished in a number of ways, depending on the format of your data and the framework you are integrating with.

Importer Tables#

The most direct way of registering a dataset is to manually create a Table object, using one of the “importer” table types. We currently support loading CSV, parquet and COCO-format datasets.

[ ]:
import tlc

## Assuming data.csv is in the same directory as this notebook
csv_table = tlc.TableFromCsv(input_url="./data.csv")

## Assuming data.parquet is in the same directory as this notebook
parquet_table = tlc.TableFromParquet(input_url="./data.parquet")

## Assuming annotations.json and images/ are in the same directory as this notebook
coco_table = tlc.TableFromCoco(input_url="./annotations.json", image_folder_url="./images")

The above code creates Table objects for each of the three datasets. The Table object is lazy, and does not load the data into memory until it is needed. The tables created above have not yet been written to a persistent location, so they will not be available in the 3LC Dashboard.

In order to make the tables available in the 3LC Dashboard, we need to write them to a persistent location. This involves setting the url property of the Table object and calling write_to_url().

[ ]:
import tlc

root_location = tlc.Table.default_write_location()

csv_table.url = root_location / "csv_table.json"
csv_table.write_to_url()

parquet_table.url = root_location / "parquet_table.json"
parquet_table.write_to_url()

coco_table.url = root_location / "coco_table.json"
coco_table.write_to_url()

The tables have now been written to the default write location for Tables, and will be available in the 3LC Dashboard.

Note: Table.default_write_location() returns the default write location for tables, which defaults to the value of the config variable TABLE_ROOT_URL (see Configuration).

We use this location because it is guaranteed to be indexed by the Object Service.

From PyTorch Dataset#

To register a PyTorch Dataset as a Table, call Table.from_torch_dataset.

Under the hood, this will create a TableFromTorchDataset, which is a subclass of Table.

[ ]:
from torchvision.datasets import CIFAR10
import tlc

dataset = CIFAR10(root="./data", download=True)

tlc_dataset = tlc.Table.from_torch_dataset(
    dataset=dataset,
    dataset_name="cifar10-train",
)

YOLO Format#

For details on how to register datasets in the YOLO format, see the YOLOv5 integration documentation.

COCO Format with Detectron2#

When integrating with the detectron2 framework, the python package provides a drop-in replacement for the register_coco_instances function from detectron2. See the Detectron2 integration documentation for more details.

Hugging Face Datasets#

In order to use datasets from Hugging Face datasets, the tlc python package provides a drop-in replacement for the load_dataset function. See the Hugging Face integration documentation for more details.