Import¶
Tables can be thought of as a recipe to generate the rows of a dataset. Some Tables refer to input
data and produce a Schema based on those inputs — these are commonly referred to as importer tables. The
schema can in some cases be overridden or tweaked, but it is generally the importer that has responsibility for
determining the schema.
Other Tables are a result of applying an operation to input Table(s), such as filtering away rows, applying edits or
joining two tables together. These are referred to as procedural tables, and we will learn more about them in
Revisions.
A third category is tables created from live data — for example by iterating over input objects or collecting metrics during training. This is what 3LC uses internally for metrics collection. In these cases schemas can be inferred automatically, but it is usually a good idea to specify schemas explicitly to enable richer views in the Dashboard.
This page serves as an index into the different import methods, with simple examples, caveats and links to the API documentation for more details.
Importer Table.from_* parameters
The importer Table.from_* methods all offer a more or less identical interface outside of the parameters specific to
that import format. The parameters common to each method are described here:
project_name: The project name should be something describing the model and/or dataset you are working on.dataset_name: A descriptive name of a group of samples in your project, such as the split of the dataset.table_name: The revision of your dataset. Defaults to"initial", which is a good value for new Tables; set an explicit name when you want to distinguish multiple tables in the same dataset, or when you have existing version information that is more accurate than"initial".root_url: Override for the configured project root url, to save the Table in a different location.if_exists: What to do if a table with the same root url, project, dataset and table name already exist. The default isreuse, which means the existing Table is returned. It is also possible to useraise, which is useful when you expect there not to exist a Table,renameto add a suffix like_0000and create a new Table, oroverwriteto delete any existing Table and make a new one.schema: Override or refine the schema that the importer would otherwise produce. Available on importers where the auto-detected schema can meaningfully be customized (from_dict,from_torch_dataset,from_pandas,from_csv,from_coco,from_parquet,from_ndjson,from_yolo). Not available on importers that fully determine their own schema (e.g.from_hugging_face_hub,from_image_folder,from_yolo_url,from_yolo_ndjson).add_weight_column: Whether to add a column for sample weights, only visible in the Dashboard.weight_column_value: The value to assign each weight to, if a weight column is to be added.description: A description for the Table, which will be shown in theDESCRIPTIONcolumn in the Dashboard. You can think of this as a commit message.extra_columns: The schema of any extra columns to add to the Table.input_tables: URLs to any existing Tables to declare as inputs. These will get arrows to the Table being created in theLINEAGEcolumn in the Dashboard.table_url: Instead of providing aproject_name,dataset_name,table_nameand (optionally) aroot_url, it is possible to instead provide the mutually exclusivetable_url, which writes the Table to a completely custom location.
If your dataset is a Python dictionary of Python lists with column values (often called “struct of arrays”), use
tlc.Table.from_dict(). Specify a
schema to tell 3LC how to interpret the data.
import tlc
data = {"image": ["/path/to/image0.png", "/path/to/image1.png", ...]}
table = tlc.Table.from_dict(
data=data,
schema=...,
project_name="Images From Dict Project",
dataset_name="train",
)
If your dataset is an iterable of Rows or data in the Sample view (often called “array of structs”), for example a
Torch Dataset, use tlc.Table.from_torch_dataset(). Specify a
schema to tell 3LC how to interpret the data.
import tlc
table = tlc.Table.from_torch_dataset(
dataset=data,
schema=...,
project_name="Images From Iterable Project",
dataset_name="train",
)
Image Folder datasets are structured as image files in directories for each category in the dataset, like what is
defined in the commonly used torchvision.datasets.ImageFolder, and in the following example:
root/
├── dog/
│ ├── image1.jpg
│ ├── image2.jpg
│ └── ...
└── cat/
├── image1.jpg
├── image2.jpg
└── ...
Use the method Table.from_image_folder() to create a
tlc.Table from such an image folder dataset.
import tlc
table = tlc.Table.from_image_folder(
root="root",
project_name="Image Folder Dataset",
dataset_name="train",
)
For datasets in the COCO format, use tlc.Table.from_coco() and
provide your task:
import tlc
table = tlc.Table.from_coco(
annotations_file="path/to/annotations.json",
image_folder="/path/to/images",
project_name="My COCO Project",
dataset_name="train",
task="detect",
)
The resulting tlc.Table has a column that references the images named image,
and a column with ground truth labels whose name, associated Schema and data depends on which task is
provided. See Computer Vision Columns for more details.
For datasets in the YOLO format, use
tlc.Table.from_yolo_url() and provide the location of your
images, the categories and task (defaults to "detect").
import tlc
table = tlc.Table.from_yolo_url(
images_url="path/to/images",
categories={0: "cat", 1: "dog"},
task="detect",
project_name="My YOLO Project",
dataset_name="train",
)
The resulting tlc.Table has a column that references the images named image,
and a column with ground truth labels whose name, associated Schema and data depends on which task is
provided. See Computer Vision Columns for more details.
YOLO Dataset YAML file
If your YOLO dataset has split locations, categories and other metadata (such as keypoint shapes, download code etc.)
declared in a YOLO Dataset YAML file, see the Ultralytics YOLO integration for functionality to
create Tables from such a file.
Multiple Locations
We recommend creating one tlc.Table for each split. If the images in one split
are stored in multiple image folders, pass a list of URLs to the images_url parameter.
Alternatively, create a text file with one image URL on each line, and provide the URL to the text file
as the images_url parameter.
Use the method tlc.Table.from_hugging_face_hub() to
create a tlc.Table from a dataset available through the Hugging Face
datasets package. The Hugging Face Dataset is downloaded, and the
Features of the Dataset are mapped to a corresponding
tlc.Schema.
import tlc
table = tlc.Table.from_hugging_face_hub(
path="beans",
split="train",
project_name="Beans Project",
dataset_name="train",
)
If your data is in a CSV (comma-separated values) file, use tlc.Table.from_csv().
import tlc
table = tlc.Table.from_csv(
csv_file="path/to/file.csv",
project_name="My CSV Project",
dataset_name="split",
)
If your data is in a pandas.DataFrame, use tlc.Table.from_pandas().
Provide a schema to indicate what each row contains.
import tlc
import pandas as pd
df = pd.DataFrame(
data={
"name": ["Sam Pullweight", "Max Epoch", "Minnie Epoch"],
"grade": [9, 8, 10],
}
)
table = tlc.Table.from_pandas(
df=df,
schema={"name": tlc.schemas.StringSchema(), "grade": tlc.schemas.Int32Schema()},
project_name="My Pandas Project",
dataset_name="train",
)
If your data is in a Apache Parquet file, use tlc.Table.from_parquet().
import tlc
table = tlc.Table.from_parquet(
parquet_file="path/to/file.parquet",
project_name="My Parquet Project",
dataset_name="train",
)
NDJSON (Newline-Delimited JSON) is a format where each line of a file contains a JSON object. For example, the following is a valid NDJSON file:
{"name": "Sam Pullweight", "grade": 9}
{"name": "Max Epoch", "grade": 8}
{"name": "Minnie Epoch", "grade": 10}
To create a Table from such a file, use tlc.Table.from_ndjson().
import tlc
table = tlc.Table.from_ndjson(
ndjson_file="path/to/file.ndjson",
project_name="My NDJSON Project",
dataset_name="split",
)
YOLO NDJSON is an alternative way to define datasets for Ultralytics YOLO models. The format stores metadata and labels in a single file, with a defined structure.
The following example is how the format is expected in 3LC. The first line is reserved for metadata, and the subsequent lines correspond to images and their labels.
{"task": "detect", "class_names": {"0": "cat", "1": "dog"}, "description": "Cats and Dogs"}
{"file": "image0.png", "width": 1280, "height": 920, "split": "train", "annotations": ["..."],}
{"file": "image1.png", "width": 640, "height": 480, "split": "train", "annotations": ["..."],}
{"file": "image2.png", "width": 1280, "height": 920, "split": "val", "annotations": ["..."],}
"..."
The format of the annotations depends on the task, refer to the Ultralytics Documentation for details.
To create a Table from a YOLO NDJSON file, use tlc.Table.from_yolo_ndjson().
We recommend creating one tlc.Table for each split, in a loop like this:
import tlc
for split in ("train", "val", "test"):
table = tlc.Table.from_yolo_ndjson(
ndjson_file="path/to/yolo.ndjson",
image_folder="path/to/image/folder",
split=split,
project_name="My YOLO Project from NDJSON",
dataset_name=split,
)
The image_folder parameter only needs to be used when the "file" paths in the NDJSON file are relative to some
directory other than the one containing the NDJSON file.
If the JSON object in the first line of the NDJSON file contains a field "description", it will be used for the
Table unless a description is provided to tlc.Table.from_yolo_ndjson().
A tlc.TableWriter is a way of producing a
tlc.Table from rows or batches of Python data. The TableWriter accepts
Python objects (PIL Images, NumPy arrays, dataclasses) and handles serialization and externalization automatically
based on the schema.
import tlc
from PIL import Image
writer = tlc.TableWriter(
schema={"image": tlc.schemas.ImageSchema(), "label": tlc.schemas.CategoricalLabelSchema(classes=["cat", "dog"])},
project_name="Image Classification",
dataset_name="train",
)
for i in range(5):
writer.add_row({"image": Image.new("RGB", (100, 100)), "label": i % 2})
table = writer.finalize()
For scalar or batch data, add_batch() writes multiple rows at once:
import tlc
table_writer = tlc.TableWriter(
schema={"my_float": tlc.schemas.Float32Schema()},
project_name="Data from Table Writer Project",
dataset_name="train",
)
table_writer.add_batch({"my_float": [0.0, 1.0, 2.0]})
table_writer.add_batch({"my_float": [3.0, 4.0, 5.0]})
table = table_writer.finalize()
Tip
Pass sample-form values (e.g. PIL.Image objects) for any column with a sample type; TableWriter will
serialize them. If you already have the row-form value on hand (e.g. a URL string pointing to an image file
for an ImageSchema column), you can pass it directly
and it will be stored without re-encoding the data — the URL is still made table-relative, or rewritten to
an alias if one applies:
writer = tlc.TableWriter(
schema={"image": tlc.schemas.ImageSchema()},
project_name="Images From URLs",
)
writer.add_row({"image": "/path/to/existing.png"})
Keep each batch uniform — either all sample-form or all row-form per column. That’s the happy path.