Custom Sample Types¶

Do you need this page?

If you’re using built-in types — images, arrays, bounding boxes, keypoints, segmentation — you don’t need custom sample types. The built-in schemas handle everything automatically. This page is for defining new data types that aren’t covered by the built-ins.

Dashboard features depend on built-in schemas

Custom sample types let you round-trip any Python object through 3LC Tables, but the Dashboard will only display the raw stored data (URLs, numbers, lists) for columns using custom schemas. Specialized Dashboard features — image rendering, annotation editing, label management, dimensionality reduction — are tied to the built-in schema patterns.

This is perfectly fine for many use cases: an external column will store URLs in the row, and you’ll still see those URLs in the Dashboard. Any metrics you collect on the table still work, so you can identify interesting samples even without a specialized widget. But if you need rich Dashboard interaction for a column, prefer a built-in schema where one exists.

A sample type converts between two forms of data:

Sample form: Python objects used in ML code — NumPy arrays, dataclasses, etc.
Row form: Serializable primitives stored in table rows — numbers, strings, nested lists.

There are two base classes, matching the two storage tracks:

  Inline:    Python object --- to_row() --> row data --- from_row() --> Python object
  External:  Python object --- save(url) --> resource --- load(url) --> Python object

Inline columns store data directly in the row. Subclass SampleType and implement to_row() / from_row().

External columns store data as individual resources (files, S3 objects, etc.) referenced by URL in the row. Subclass ExternalSampleType and implement save() / load().

External Sample Type¶

External storage is the right choice when your data naturally lives as individual files — it’s how images, large arrays, and other bulk data work internally. Here’s a complete example: a sample type that stores audio waveforms as WAV files.

Step 1: Define the sample type¶

Subclass ExternalSampleType and implement save(), load(), and accepts(), plus a schema() classmethod that builds a Schema matching your type:

@tlc.sample_types.register_sample_type("wav_audio")
class WavAudioSampleType(tlc.sample_types.ExternalSampleType):
    """Stores audio waveforms as WAV files, returns ``AudioWaveform`` in sample view."""

    file_extension = ".wav"

    def save(self, sample: AudioWaveform, url: tlc.Url) -> None:
        import soundfile as sf

        buf = io.BytesIO()
        sf.write(buf, sample.waveform, sample.sample_rate, format="WAV")
        url.write_bytes(buf.getvalue())

    def load(self, url: tlc.Url) -> AudioWaveform:
        import soundfile as sf

        data, sr = sf.read(io.BytesIO(url.read_bytes()), dtype="float32")
        return AudioWaveform(waveform=data, sample_rate=int(sr))

    def accepts(self, value: Any) -> bool:
        return isinstance(value, AudioWaveform)

    @classmethod
    def schema(
        cls,
        *,
        display_name: str = "",
        description: str = "",
        writable: bool = True,
        visible: bool = True,
        bulk_data_location: str | None = None,
    ) -> tlc.Schema:
        """Factory for a Schema bound to this sample type."""
        return UrlSchema(
            sample_type="wav_audio",
            display_name=display_name,
            description=description,
            writable=writable,
            default_visible=visible,
            bulk_data_location=bulk_data_location,
        )

The schema() classmethod keeps the sample type and its schema definition together in one class, and callers use MyType.schema(...) wherever a Schema is expected. The schema’s value class (UrlStringValue here) controls how the Dashboard displays the column; the sample_type dict carries the registered name plus any per-instance arguments.

Key class attributes:

Attribute	Default	Description
`file_extension`	`""`	File extension for externally stored data (e.g. `".wav"`, `".npy"`)
`is_leaf`	`True` (from `ExternalSampleType`)	This sample type owns the full conversion — Schema stops recursing at this node
`is_included_in_sample`	`True`	`False` excludes the column from sample view

Override `accepts()`¶

External sample types should override accepts() to identify sample-form values. The write pipeline calls it on every value in a batch to tell live Python objects (to externalize via save()) apart from pre-externalized URL strings (passed through unchanged). This lets a single column mix both forms in one batch. The audio example uses isinstance(value, np.ndarray) and value.ndim == 1; a subclass that leaves the default will silently skip externalization.

Reference existing files with `source_url()`¶

When a sample is already backed by a file, override source_url() to return its URL — the externalizer will reference the existing file instead of writing a copy via save(). This is how PILImageSampleType avoids re-encoding images that were loaded from disk:

def source_url(self, sample: PIL.Image.Image) -> str | None:
    if hasattr(sample, "_tlc_url") and sample._tlc_url:
        return sample._tlc_url
    if sample.filename:
        return sample.filename
    return None

Leave the default (returns None) if round-tripping through save() is cheap enough not to matter.

Step 2: Use it¶

writer = tlc.TableWriter(
    project_name="Audio Project",
    schema={
        "audio": WavAudioSampleType.schema(),
        "label": tlc.schemas.CategoricalLabelSchema(classes=["speech", "music", "noise"]),
    },
    if_exists="rename",
)

t = np.linspace(0, 1, 16000, dtype=np.float32)
waveform = 0.5 * np.sin(2 * np.pi * 440 * t)  # 1 second of A4 (440 Hz)
writer.add_row({"audio": AudioWaveform(waveform=waveform, sample_rate=16000), "label": 0})
table = writer.finalize()

# Row view: URL reference to the WAV file
table.table_rows[0]
# {'audio': '../../bulk_data/.../0000000.wav', 'label': 0, 'weight': 1.0}

# Sample view: AudioWaveform loaded from the WAV file
table[0]["audio"]
# AudioWaveform(waveform=array([...], dtype=float32), sample_rate=16000)

Inline Sample Type¶

For data that fits directly in the row (dicts, nested lists, scalars), use to_row() / from_row() instead of save() / load(). This gives you convenient access to the underlying inline data with an arbitrary transform — for example, wrapping row values in a dataclass. Here’s an example using a dataclass, but any Python object can be returned from from_row():

@dataclass
class AudioFeatures:
    """Mel-spectrogram features for an audio clip."""

    mel: list[float]
    sample_rate: int
    duration_ms: float


@tlc.sample_types.register_sample_type("audio_features")
class AudioFeaturesSampleType(tlc.sample_types.SampleType):
    """Converts AudioFeatures to/from serializable dicts."""

    is_leaf = True

    def to_row(self, sample: AudioFeatures) -> dict[str, Any]:
        return {
            "mel": sample.mel,
            "sample_rate": sample.sample_rate,
            "duration_ms": sample.duration_ms,
        }

    def from_row(self, data: Any) -> AudioFeatures:
        return AudioFeatures(
            mel=data["mel"],
            sample_rate=data["sample_rate"],
            duration_ms=data["duration_ms"],
        )

    def accepts(self, value: Any) -> bool:
        return isinstance(value, AudioFeatures)

    @classmethod
    def schema(cls, display_name: str = "Audio Features") -> tlc.Schema:
        """Return a Schema matching the to_row() output format."""
        return tlc.Schema(
            display_name=display_name,
            sample_type="audio_features",
            values={
                "mel": tlc.schemas.Float32Schema(shape=(-1,)),
                "sample_rate": tlc.schemas.Int32Schema(),
                "duration_ms": tlc.schemas.Float64Schema(),
            },
        )

The important contract is that the schema structure must match the to_row() output. Here, the schema() classmethod is defined on the sample type itself — this is just a convenience to keep the schema and conversion logic centralized, not a requirement.

Built-in Sample Types¶

3LC ships with sample types for common data types. Most have a corresponding convenience schema that configures everything in one step.

Sample type name	Python type	Convenience Schema	Storage
`"pil_image"`	`PIL.Image`	`ImageSchema`	external
`"numpy_array"`	`numpy.ndarray`	any primitive schema with `sample_type="numpy_array"`	inline
`"external_numpy_array"`	`numpy.ndarray`	`ExternalNumpyArraySchema`	external
`"torch_tensor"`	`torch.Tensor`	any primitive schema with `sample_type="torch_tensor"`	inline
`"external_torch_tensor"`	`torch.Tensor`	`ExternalTorchTensorSchema`	external
`"bounding_boxes_2d"`	`BoundingBoxes2D`	`BoundingBoxes2D.schema()`	inline
`"bounding_boxes_3d"`	`BoundingBoxes3D`	`BoundingBoxes3D.schema()`	inline
`"keypoints_2d"`	`Keypoints2D`	`Keypoints2D.schema()`	inline
`"oriented_bounding_boxes_2d"`	`OrientedBoundingBoxes2D`	`OrientedBoundingBoxes2D.schema()`	inline
`"oriented_bounding_boxes_3d"`	`OrientedBoundingBoxes3D`	`OrientedBoundingBoxes3D.schema()`	inline
`"segmentation_polygons"`	`SegmentationPolygons`	`SegmentationPolygons.schema()`	inline
`"segmentation_masks"`	`SegmentationMasks`	`SegmentationMasks.schema()`	inline
`"semantic_segmentation"`	`SemanticSegmentation`	`SemanticSegmentationRleSchema`	inline (RLE)
`"path"`	`str`	—	inline
`"hidden"`	(excluded from sample view)	—	—

You can list all registered sample types at any time:

import tlc
tlc.sample_types.get_sample_types()
# ['bounding_boxes_2d', 'hidden', 'numpy_array', 'pil_image', ...]

Plugin Registration¶

Third-party packages can register sample types so they are available automatically. The recommended approach is to use Python entry points, which lets 3LC discover sample types without requiring any explicit imports:

# pyproject.toml
[project.entry-points."tlc.sample_types"]
my_custom_type = "my_package.sample_types:MyCustomType"

The entry point name (my_custom_type) becomes the registered name, and the class is loaded lazily — only when first needed. This is the same mechanism used by 3LC’s own built-in sample types.

For a full, installable plugin package — project layout, pyproject.toml, README, and the sample type class — see 3lc-examples/examples/audio-sample-type.

Alternatively, sample types can be registered at runtime using the @tlc.sample_types.register_sample_type decorator. This is convenient during development, but requires the module to be imported before the sample type is used:

# my_plugin/sample_types.py
import tlc

@tlc.sample_types.register_sample_type("my_custom_type")
class MyCustomType(tlc.sample_types.ExternalSampleType):
    is_leaf = True
    file_extension = ".dat"

    def save(self, sample, url):
        url.write_bytes(serialize(sample))

    def load(self, url):
        return deserialize(url.read_bytes())