Import

Tables can be thought of as a recipe to generate the rows of a dataset. Some Tables refer to input data and produce a Schema based on those inputs — these are commonly referred to as importer tables. The schema can in some cases be overridden or tweaked, but it is generally the importer that has responsibility for determining the schema.

Other Tables are a result of applying an operation to input Table(s), such as filtering away rows, applying edits or joining two tables together. These are referred to as procedural tables, and we will learn more about them in Revisions.

A third category is tables created from live data — for example by iterating over input objects or collecting metrics during training. This is what 3LC uses internally for metrics collection. In these cases schemas can be inferred automatically, but it is usually a good idea to specify schemas explicitly to enable richer views in the Dashboard.

This page serves as an index into the different import methods, with simple examples, caveats and links to the API documentation for more details.

Importer Table.from_* parameters

The importer Table.from_* methods all offer a more or less identical interface outside of the parameters specific to that import format. The parameters common to each method are described here:

  • project_name: The project name should be something describing the model and/or dataset you are working on.

  • dataset_name: A descriptive name of a group of samples in your project, such as the split of the dataset.

  • table_name: The revision of your dataset. Defaults to "initial", which is a good value for new Tables; set an explicit name when you want to distinguish multiple tables in the same dataset, or when you have existing version information that is more accurate than "initial".

  • root_url: Override for the configured project root url, to save the Table in a different location.

  • if_exists: What to do if a table with the same root url, project, dataset and table name already exist. The default is reuse, which means the existing Table is returned. It is also possible to use raise, which is useful when you expect there not to exist a Table, rename to add a suffix like _0000 and create a new Table, or overwrite to delete any existing Table and make a new one.

  • schema: Override or refine the schema that the importer would otherwise produce. Available on importers where the auto-detected schema can meaningfully be customized (from_dict, from_torch_dataset, from_pandas, from_csv, from_coco, from_parquet, from_ndjson, from_yolo). Not available on importers that fully determine their own schema (e.g. from_hugging_face_hub, from_image_folder, from_yolo_url, from_yolo_ndjson).

  • add_weight_column: Whether to add a column for sample weights, only visible in the Dashboard.

  • weight_column_value: The value to assign each weight to, if a weight column is to be added.

  • description: A description for the Table, which will be shown in the DESCRIPTION column in the Dashboard. You can think of this as a commit message.

  • extra_columns: The schema of any extra columns to add to the Table.

  • input_tables: URLs to any existing Tables to declare as inputs. These will get arrows to the Table being created in the LINEAGE column in the Dashboard.

  • table_url: Instead of providing a project_name, dataset_name, table_name and (optionally) a root_url, it is possible to instead provide the mutually exclusive table_url, which writes the Table to a completely custom location.

If your dataset is a Python dictionary of Python lists with column values (often called “struct of arrays”), use tlc.Table.from_dict(). Specify a schema to tell 3LC how to interpret the data.

import tlc

data = {"image": ["/path/to/image0.png", "/path/to/image1.png", ...]}

table = tlc.Table.from_dict(
    data=data,
    schema=...,
    project_name="Images From Dict Project",
    dataset_name="train",
)

If your dataset is an iterable of Rows or data in the Sample view (often called “array of structs”), for example a Torch Dataset, use tlc.Table.from_torch_dataset(). Specify a schema to tell 3LC how to interpret the data.

import tlc

table = tlc.Table.from_torch_dataset(
    dataset=data,
    schema=...,
    project_name="Images From Iterable Project",
    dataset_name="train",
)

Image Folder datasets are structured as image files in directories for each category in the dataset, like what is defined in the commonly used torchvision.datasets.ImageFolder, and in the following example:

root/
  ├── dog/
  │  ├── image1.jpg
  │  ├── image2.jpg
  │  └── ...
  └── cat/
     ├── image1.jpg
     ├── image2.jpg
     └── ...

Use the method Table.from_image_folder() to create a tlc.Table from such an image folder dataset.

import tlc

table = tlc.Table.from_image_folder(
    root="root",
    project_name="Image Folder Dataset",
    dataset_name="train",
)

For datasets in the COCO format, use tlc.Table.from_coco() and provide your task:

import tlc

table = tlc.Table.from_coco(
    annotations_file="path/to/annotations.json",
    image_folder="/path/to/images",
    project_name="My COCO Project",
    dataset_name="train",
    task="detect",
)

The resulting tlc.Table has a column that references the images named image, and a column with ground truth labels whose name, associated Schema and data depends on which task is provided. See Computer Vision Columns for more details.

For datasets in the YOLO format, use tlc.Table.from_yolo_url() and provide the location of your images, the categories and task (defaults to "detect").

import tlc

table = tlc.Table.from_yolo_url(
    images_url="path/to/images",
    categories={0: "cat", 1: "dog"},
    task="detect",
    project_name="My YOLO Project",
    dataset_name="train",
)

The resulting tlc.Table has a column that references the images named image, and a column with ground truth labels whose name, associated Schema and data depends on which task is provided. See Computer Vision Columns for more details.

YOLO Dataset YAML file

If your YOLO dataset has split locations, categories and other metadata (such as keypoint shapes, download code etc.) declared in a YOLO Dataset YAML file, see the Ultralytics YOLO integration for functionality to create Tables from such a file.

Multiple Locations

We recommend creating one tlc.Table for each split. If the images in one split are stored in multiple image folders, pass a list of URLs to the images_url parameter.

Alternatively, create a text file with one image URL on each line, and provide the URL to the text file as the images_url parameter.

Use the method tlc.Table.from_hugging_face_hub() to create a tlc.Table from a dataset available through the Hugging Face datasets package. The Hugging Face Dataset is downloaded, and the Features of the Dataset are mapped to a corresponding tlc.Schema.

import tlc

table = tlc.Table.from_hugging_face_hub(
    path="beans",
    split="train",
    project_name="Beans Project",
    dataset_name="train",
)

If your data is in a CSV (comma-separated values) file, use tlc.Table.from_csv().

import tlc

table = tlc.Table.from_csv(
    csv_file="path/to/file.csv",
    project_name="My CSV Project",
    dataset_name="split",
)

If your data is in a pandas.DataFrame, use tlc.Table.from_pandas(). Provide a schema to indicate what each row contains.

import tlc
import pandas as pd

df = pd.DataFrame(
    data={
        "name": ["Sam Pullweight", "Max Epoch", "Minnie Epoch"],
        "grade": [9, 8, 10],
    }
)

table = tlc.Table.from_pandas(
    df=df,
    schema={"name": tlc.schemas.StringSchema(), "grade": tlc.schemas.Int32Schema()},
    project_name="My Pandas Project",
    dataset_name="train",
)

If your data is in a Apache Parquet file, use tlc.Table.from_parquet().

import tlc

table = tlc.Table.from_parquet(
    parquet_file="path/to/file.parquet",
    project_name="My Parquet Project",
    dataset_name="train",
)

NDJSON (Newline-Delimited JSON) is a format where each line of a file contains a JSON object. For example, the following is a valid NDJSON file:

{"name": "Sam Pullweight", "grade": 9}
{"name": "Max Epoch", "grade": 8}
{"name": "Minnie Epoch", "grade": 10}

To create a Table from such a file, use tlc.Table.from_ndjson().

import tlc

table = tlc.Table.from_ndjson(
    ndjson_file="path/to/file.ndjson",
    project_name="My NDJSON Project",
    dataset_name="split",
)

YOLO NDJSON is an alternative way to define datasets for Ultralytics YOLO models. The format stores metadata and labels in a single file, with a defined structure.

The following example is how the format is expected in 3LC. The first line is reserved for metadata, and the subsequent lines correspond to images and their labels.

{"task": "detect", "class_names": {"0": "cat", "1": "dog"}, "description": "Cats and Dogs"}
{"file": "image0.png", "width": 1280, "height": 920, "split": "train", "annotations": ["..."],}
{"file": "image1.png", "width": 640, "height": 480, "split": "train", "annotations": ["..."],}
{"file": "image2.png", "width": 1280, "height": 920, "split": "val", "annotations": ["..."],}
"..."

The format of the annotations depends on the task, refer to the Ultralytics Documentation for details.

To create a Table from a YOLO NDJSON file, use tlc.Table.from_yolo_ndjson(). We recommend creating one tlc.Table for each split, in a loop like this:

import tlc

for split in ("train", "val", "test"):
    table = tlc.Table.from_yolo_ndjson(
        ndjson_file="path/to/yolo.ndjson",
        image_folder="path/to/image/folder",
        split=split,
        project_name="My YOLO Project from NDJSON",
        dataset_name=split,
    )

The image_folder parameter only needs to be used when the "file" paths in the NDJSON file are relative to some directory other than the one containing the NDJSON file.

If the JSON object in the first line of the NDJSON file contains a field "description", it will be used for the Table unless a description is provided to tlc.Table.from_yolo_ndjson().

A tlc.TableWriter is a way of producing a tlc.Table from rows or batches of Python data. The TableWriter accepts Python objects (PIL Images, NumPy arrays, dataclasses) and handles serialization and externalization automatically based on the schema.

import tlc
from PIL import Image

writer = tlc.TableWriter(
    schema={"image": tlc.schemas.ImageSchema(), "label": tlc.schemas.CategoricalLabelSchema(classes=["cat", "dog"])},
    project_name="Image Classification",
    dataset_name="train",
)

for i in range(5):
    writer.add_row({"image": Image.new("RGB", (100, 100)), "label": i % 2})

table = writer.finalize()

For scalar or batch data, add_batch() writes multiple rows at once:

import tlc

table_writer = tlc.TableWriter(
    schema={"my_float": tlc.schemas.Float32Schema()},
    project_name="Data from Table Writer Project",
    dataset_name="train",
)

table_writer.add_batch({"my_float": [0.0, 1.0, 2.0]})
table_writer.add_batch({"my_float": [3.0, 4.0, 5.0]})

table = table_writer.finalize()

Tip

Pass sample-form values (e.g. PIL.Image objects) for any column with a sample type; TableWriter will serialize them. If you already have the row-form value on hand (e.g. a URL string pointing to an image file for an ImageSchema column), you can pass it directly and it will be stored without re-encoding the data — the URL is still made table-relative, or rewritten to an alias if one applies:

writer = tlc.TableWriter(
    schema={"image": tlc.schemas.ImageSchema()},
    project_name="Images From URLs",
)
writer.add_row({"image": "/path/to/existing.png"})

Keep each batch uniform — either all sample-form or all row-form per column. That’s the happy path.