Register Datasets¶
The first step in integrating your training code with 3LC is to register your datasets. This can be accomplished in a number of ways, depending on the format of your data and the framework you are integrating with.
Importer Tables¶
The most direct way of registering a dataset is to manually create a Table
object, using one of the “importer” table types.
We currently support loading CSV, parquet, COCO-format and Pandas datasets.
import tlc
## Assuming data.csv is in the same directory as this notebook
csv_table = tlc.Table.from_csv("./data.csv", table_name="my-csv-table")
## Assuming data.parquet is in the same directory as this notebook
parquet_table = tlc.Table.from_parquet("./data.parquet", table_name="my-parquet-table")
## Assuming annotations.json and images/ are in the same directory as this notebook
coco_table = tlc.Table.from_coco(
annotations_file="./annotations.json",
image_folder="./images",
table_name="my-coco-table",
)
## Assuming df is a pandas DataFrame
df_table = tlc.Table.from_pandas(df, table_name="my-pandas-table")
## Assuming data is a dictionary
dict_table = tlc.Table.from_dict(data, table_name="my-dict-table")
The above code creates Table
objects for each of the input types. All the Table.from_*
methods provide a set of
common parameters for controlling the destination URL, schema and sample-view information, the
behavior if the table already exists, configuration of default columns, and more.
See 3LC Project Structure for more information on how to control the URL of the created Table
object.
From PyTorch Dataset¶
To register a PyTorch Dataset as a Table
, call
Table.from_torch_dataset
.
Under the hood, this will create a
TableFromTorchDataset
.,
which is a subclass of Table
.
From Folders of Images¶
The Table.from_image_folder
helper function is designed to
create a structured table from a folder containing images or subfolders containing images. This function is particularly
useful when working with datasets where images are organized in subfolders, each representing a different category or
label. It extends the functionality of torchvision.datasets.ImageFolder
by allowing not only
labelled subfolders, but arbitrary folder hierarchies of (unlabelled) images.
The standard arguments for providing names, descriptions, adding weight column, adding extra columns, etc. are available.
Examples¶
# Folder structure:
# /cats-and-dogs/
# ├── cats/
# │ ├── cat1.jpg
# │ ├── cat2.jpg
# │ └── ...
# └── dogs/
# ├── dog1.jpg
# ├── dog2.jpg
# └── ...
# Example usage for the `cats-and-dogs` folder structure
table = Table.from_image_folder(
root="/cats-and-dogs",
image_column_name="my_image", # Override the name of the column containing images
table_name="cats_and_dogs_table",
dataset_name="cats_and_dogs_dataset",
project_name="animal_classification",
description="A dataset of cats and dogs images organized by category."
)
# The resulting table will have the following columns:
# | my_image | label | weight |
# Folder structure:
# /data/
# ├── subfolderA/
# │ ├── diagram1.png
# │ ├── diagram2.png
# │ ├── notes1.txt
# │ ├── notes2.txt
# │ └── ...
# └── subfolderB/
# ├── image1.png
# ├── image2.png
# ├── readme.txt
# └── ...
# Example usage for recursively scanning the `data` folder structure
table = from_image_folder(
root="/data",
include_label_column=False,
extensions=("png",), # ignore files with extensions other than .png
table_name="data_images_table",
dataset_name="project_data_images",
project_name="data_analysis_project",
add_weight_column=False,
description="A dataset of images."
)
# The resulting table will have the following columns:
# | image |
YOLO Format¶
For details on how to register datasets in the YOLO format, see the YOLOv5 or YOLO integration documentation for more details.
COCO Format with Detectron2¶
When integrating with the detectron2 framework, the python package provides a drop-in replacement for the
register_coco_instances
function from
detectron2. See the detectron2 integration documentation for more details.
Hugging Face Datasets¶
In order to use datasets from Hugging Face 🤗 Datasets, tlc
provides an alternative to the
datasets.load_dataset
function;
Table.from_hugging_face
.
See the Hugging Face integration documentation for more details.