Register Datasets¶

The first step in integrating your training code with 3LC is to register your datasets. This can be accomplished in a number of ways, depending on the format of your data and the framework you are integrating with.

Importer Tables¶

The most direct way of registering a dataset is to manually create a Table object, using one of the “importer” table types. We currently support loading CSV, parquet, COCO-format and Pandas datasets.

import tlc

## Assuming data.csv is in the same directory as this notebook
csv_table = tlc.Table.from_csv("./data.csv", table_name="my-csv-table")

## Assuming data.parquet is in the same directory as this notebook
parquet_table = tlc.Table.from_parquet("./data.parquet", table_name="my-parquet-table")

## Assuming annotations.json and images/ are in the same directory as this notebook
coco_table = tlc.Table.from_coco(
    annotations_file="./annotations.json",
    image_folder="./images",
    table_name="my-coco-table",
)

## Assuming df is a pandas DataFrame
df_table = tlc.Table.from_pandas(df, table_name="my-pandas-table")

## Assuming data is a dictionary
dict_table = tlc.Table.from_dict(data, table_name="my-dict-table")

The above code creates Table objects for each of the input types. All the Table.from_* methods provide a set of common parameters for controlling the destination URL, schema and sample-view information, the behavior if the table already exists, configuration of default columns, and more.

See 3LC Project Structure for more information on how to control the URL of the created Table object.

From PyTorch Dataset¶

To register a PyTorch Dataset as a Table, call Table.from_torch_dataset.

Under the hood, this will create a TableFromTorchDataset., which is a subclass of Table.

import tlc
from torchvision.datasets import CIFAR10

dataset = CIFAR10(root="./data", download=True)

tlc_dataset = tlc.Table.from_torch_dataset(
    dataset=dataset,
    dataset_name="cifar10-train",
)

From Folders of Images¶

The Table.from_image_folder helper function is designed to create a structured table from a folder containing images or subfolders containing images. This function is particularly useful when working with datasets where images are organized in subfolders, each representing a different category or label. It extends the functionality of torchvision.datasets.ImageFolder by allowing not only labelled subfolders, but arbitrary folder hierarchies of (unlabelled) images.

The standard arguments for providing names, descriptions, adding weight column, adding extra columns, etc. are available.

Examples¶

# Folder structure:
# /cats-and-dogs/
#     ├── cats/
#     │   ├── cat1.jpg
#     │   ├── cat2.jpg
#     │   └── ...
#     └── dogs/
#         ├── dog1.jpg
#         ├── dog2.jpg
#         └── ...

# Example usage for the `cats-and-dogs` folder structure
table = Table.from_image_folder(
    root="/cats-and-dogs",
    image_column_name="my_image",           # Override the name of the column containing images
    table_name="cats_and_dogs_table",
    dataset_name="cats_and_dogs_dataset",
    project_name="animal_classification",
    description="A dataset of cats and dogs images organized by category."
)

# The resulting table will have the following columns:
# | my_image | label | weight |

# Folder structure:
# /data/
#     ├── subfolderA/
#     │   ├── diagram1.png
#     │   ├── diagram2.png
#     │   ├── notes1.txt
#     │   ├── notes2.txt
#     │   └── ...
#     └── subfolderB/
#         ├── image1.png
#         ├── image2.png
#         ├── readme.txt
#         └── ...

# Example usage for recursively scanning the `data` folder structure
table = from_image_folder(
    root="/data",
    include_label_column=False,
    extensions=("png",),        # ignore files with extensions other than .png
    table_name="data_images_table",
    dataset_name="project_data_images",
    project_name="data_analysis_project",
    add_weight_column=False,
    description="A dataset of images."
)

# The resulting table will have the following columns:
# | image |

YOLO Format¶

For details on how to register datasets in the YOLO format, see the YOLOv5 or YOLO integration documentation for more details.

COCO Format with Detectron2¶

When integrating with the detectron2 framework, the python package provides a drop-in replacement for the register_coco_instances function from detectron2. See the detectron2 integration documentation for more details.

Hugging Face Datasets¶

In order to use datasets from Hugging Face 🤗 Datasets, tlc provides an alternative to the datasets.load_dataset function; Table.from_hugging_face. See the Hugging Face integration documentation for more details.