Custom Sample Types¶
Do you need this page?
If you’re using built-in types — images, arrays, bounding boxes, keypoints, segmentation — you don’t need custom sample types. The built-in schemas handle everything automatically. This page is for defining new data types that aren’t covered by the built-ins.
Dashboard features depend on built-in schemas
Custom sample types let you round-trip any Python object through 3LC Tables, but the Dashboard will only display the raw stored data (URLs, numbers, lists) for columns using custom schemas. Specialized Dashboard features — image rendering, annotation editing, label management, dimensionality reduction — are tied to the built-in schema patterns.
This is perfectly fine for many use cases: an external column will store URLs in the row, and you’ll still see those URLs in the Dashboard. Any metrics you collect on the table still work, so you can identify interesting samples even without a specialized widget. But if you need rich Dashboard interaction for a column, prefer a built-in schema where one exists.
A sample type converts between two forms of data:
Sample form: Python objects used in ML code — NumPy arrays, dataclasses, etc.
Row form: Serializable primitives stored in table rows — numbers, strings, nested lists.
There are two base classes, matching the two storage tracks:
Inline: Python object --- to_row() --> row data --- from_row() --> Python object
External: Python object --- save(url) --> resource --- load(url) --> Python object
Inline columns store data directly in the row. Subclass
SampleType and implement to_row() / from_row().
External columns store data as individual resources (files, S3 objects, etc.) referenced by URL in the row.
Subclass ExternalSampleType and implement save() / load().
External Sample Type¶
External storage is the right choice when your data naturally lives as individual files — it’s how images, large arrays, and other bulk data work internally. Here’s a complete example: a sample type that stores audio waveforms as WAV files.
Step 1: Define the sample type¶
Subclass ExternalSampleType and implement save(), load(), and
accepts(), plus a schema() classmethod that builds a Schema matching your type:
@tlc.sample_types.register_sample_type("wav_audio")
class WavAudioSampleType(tlc.sample_types.ExternalSampleType):
"""Stores audio waveforms as WAV files, returns ``AudioWaveform`` in sample view."""
file_extension = ".wav"
def save(self, sample: AudioWaveform, url: tlc.Url) -> None:
import soundfile as sf
buf = io.BytesIO()
sf.write(buf, sample.waveform, sample.sample_rate, format="WAV")
url.write_bytes(buf.getvalue())
def load(self, url: tlc.Url) -> AudioWaveform:
import soundfile as sf
data, sr = sf.read(io.BytesIO(url.read_bytes()), dtype="float32")
return AudioWaveform(waveform=data, sample_rate=int(sr))
def accepts(self, value: Any) -> bool:
return isinstance(value, AudioWaveform)
@classmethod
def schema(
cls,
*,
display_name: str = "",
description: str = "",
writable: bool = True,
visible: bool = True,
bulk_data_location: str | None = None,
) -> tlc.Schema:
"""Factory for a Schema bound to this sample type."""
return UrlSchema(
sample_type="wav_audio",
display_name=display_name,
description=description,
writable=writable,
default_visible=visible,
bulk_data_location=bulk_data_location,
)
The schema() classmethod keeps the sample type and its schema definition together in one class, and callers use
MyType.schema(...) wherever a Schema is expected. The schema’s value class
(UrlStringValue here) controls how the Dashboard displays the column; the
sample_type dict carries the registered name plus any per-instance arguments.
Key class attributes:
Attribute |
Default |
Description |
|---|---|---|
|
|
File extension for externally stored data (e.g. |
|
|
This sample type owns the full conversion — Schema stops recursing at this node |
|
|
|
Override accepts()¶
External sample types should override accepts() to identify sample-form values. The write pipeline calls it on
every value in a batch to tell live Python objects (to externalize via save()) apart from pre-externalized URL
strings (passed through unchanged). This lets a single column mix both forms in one batch. The audio example uses
isinstance(value, np.ndarray) and value.ndim == 1; a subclass that leaves the default will silently skip
externalization.
Reference existing files with source_url()¶
When a sample is already backed by a file, override source_url() to return its URL — the externalizer will
reference the existing file instead of writing a copy via save(). This is how PILImageSampleType avoids
re-encoding images that were loaded from disk:
Leave the default (returns None) if round-tripping through save() is cheap enough not to matter.
Step 2: Use it¶
writer = tlc.TableWriter(
project_name="Audio Project",
schema={
"audio": WavAudioSampleType.schema(),
"label": tlc.schemas.CategoricalLabelSchema(classes=["speech", "music", "noise"]),
},
if_exists="rename",
)
t = np.linspace(0, 1, 16000, dtype=np.float32)
waveform = 0.5 * np.sin(2 * np.pi * 440 * t) # 1 second of A4 (440 Hz)
writer.add_row({"audio": AudioWaveform(waveform=waveform, sample_rate=16000), "label": 0})
table = writer.finalize()
# Row view: URL reference to the WAV file
table.table_rows[0]
# {'audio': '../../bulk_data/.../0000000.wav', 'label': 0, 'weight': 1.0}
# Sample view: AudioWaveform loaded from the WAV file
table[0]["audio"]
# AudioWaveform(waveform=array([...], dtype=float32), sample_rate=16000)
Inline Sample Type¶
For data that fits directly in the row (dicts, nested lists, scalars), use to_row() / from_row() instead of
save() / load(). This gives you convenient access to the underlying inline data with an arbitrary transform —
for example, wrapping row values in a dataclass. Here’s an example using a dataclass, but any Python object can be
returned from from_row():
@dataclass
class AudioFeatures:
"""Mel-spectrogram features for an audio clip."""
mel: list[float]
sample_rate: int
duration_ms: float
@tlc.sample_types.register_sample_type("audio_features")
class AudioFeaturesSampleType(tlc.sample_types.SampleType):
"""Converts AudioFeatures to/from serializable dicts."""
is_leaf = True
def to_row(self, sample: AudioFeatures) -> dict[str, Any]:
return {
"mel": sample.mel,
"sample_rate": sample.sample_rate,
"duration_ms": sample.duration_ms,
}
def from_row(self, data: Any) -> AudioFeatures:
return AudioFeatures(
mel=data["mel"],
sample_rate=data["sample_rate"],
duration_ms=data["duration_ms"],
)
def accepts(self, value: Any) -> bool:
return isinstance(value, AudioFeatures)
@classmethod
def schema(cls, display_name: str = "Audio Features") -> tlc.Schema:
"""Return a Schema matching the to_row() output format."""
return tlc.Schema(
display_name=display_name,
sample_type="audio_features",
values={
"mel": tlc.schemas.Float32Schema(shape=(-1,)),
"sample_rate": tlc.schemas.Int32Schema(),
"duration_ms": tlc.schemas.Float64Schema(),
},
)
The important contract is that the schema structure must match the to_row() output. Here, the schema() classmethod
is defined on the sample type itself — this is just a convenience to keep the schema and conversion logic centralized,
not a requirement.
Built-in Sample Types¶
3LC ships with sample types for common data types. Most have a corresponding convenience schema that configures everything in one step.
Sample type name |
Python type |
Convenience Schema |
Storage |
|---|---|---|---|
|
|
|
external |
|
|
any primitive schema with |
inline |
|
|
|
external |
|
|
any primitive schema with |
inline |
|
|
|
external |
|
|
|
inline |
|
|
|
inline |
|
|
|
inline |
|
|
|
inline |
|
|
|
inline |
|
|
|
inline |
|
|
|
inline |
|
|
— |
inline |
|
(excluded from sample view) |
— |
— |
You can list all registered sample types at any time:
import tlc
tlc.sample_types.get_sample_types()
# ['bounding_boxes_2d', 'hidden', 'numpy_array', 'pil_image', ...]
Plugin Registration¶
Third-party packages can register sample types so they are available automatically. The recommended approach is to use Python entry points, which lets 3LC discover sample types without requiring any explicit imports:
# pyproject.toml
[project.entry-points."tlc.sample_types"]
my_custom_type = "my_package.sample_types:MyCustomType"
The entry point name (my_custom_type) becomes the registered name, and the class is loaded lazily — only when
first needed. This is the same mechanism used by 3LC’s own built-in sample types.
For a full, installable plugin package — project layout, pyproject.toml, README, and the sample type class — see
3lc-examples/examples/audio-sample-type.
Alternatively, sample types can be registered at runtime using the @tlc.sample_types.register_sample_type decorator. This is
convenient during development, but requires the module to be imported before the sample type is used:
# my_plugin/sample_types.py
import tlc
@tlc.sample_types.register_sample_type("my_custom_type")
class MyCustomType(tlc.sample_types.ExternalSampleType):
is_leaf = True
file_extension = ".dat"
def save(self, sample, url):
url.write_bytes(serialize(sample))
def load(self, url):
return deserialize(url.read_bytes())