Register Datasets#

The first step in integrating your training code with 3LC is to register your datasets. This can be accomplished in a number of ways, depending on the format of your data and the framework you are integrating with.

Importer Tables#

The most direct way of registering a dataset is to manually create a Table object, using one of the “importer” table types. We currently support loading CSV, parquet, COCO-format and Pandas datasets.

import tlc

## Assuming data.csv is in the same directory as this notebook
csv_table = tlc.Table.from_csv("./data.csv", table_name="my-csv-table")

## Assuming data.parquet is in the same directory as this notebook
parquet_table = tlc.Table.from_parquet("./data.parquet", table_name="my-parquet-table")

## Assuming annotations.json and images/ are in the same directory as this notebook
coco_table = tlc.Table.from_coco(
    annotations_file="./annotations.json",
    image_folder="./images",
    table_name="my-coco-table",
)

## Assuming df is a pandas DataFrame
df_table = tlc.Table.from_pandas(df, table_name="my-pandas-table")

## Assuming data is a dictionary
dict_table = tlc.Table.from_dict(data, table_name="my-dict-table")

The above code creates Table objects for each of the input types. All the Table.from_* methods provide a set of common parameters for controlling the destination URL, schema and sample-view information, the behavior if the table already exists, configuration of default columns, and more.

See 3LC Project Structure for more information on how to control the URL of the created Table object.

From PyTorch Dataset#

To register a PyTorch Dataset as a Table, call Table.from_torch_dataset.

Under the hood, this will create a TableFromTorchDataset., which is a subclass of Table.

import tlc
from torchvision.datasets import CIFAR10

dataset = CIFAR10(root="./data", download=True)

tlc_dataset = tlc.Table.from_torch_dataset(
    dataset=dataset,
    dataset_name="cifar10-train",
)

YOLO Format#

For details on how to register datasets in the YOLO format, see the YOLOv5 or YOLOv8 integration documentation for more details.

COCO Format with Detectron2#

When integrating with the detectron2 framework, the python package provides a drop-in replacement for the register_coco_instances function from detectron2. See the detectron2 integration documentation for more details.

Hugging Face Datasets#

In order to use datasets from Hugging Face 🤗 Datasets, tlc provides an alternative to the datasets.load_dataset function; Table.from_hugging_face. See the Hugging Face integration documentation for more details.