Data LayoutΒΆ

The 3LC ecosystem uses a hierarchical structure to organize the data it creates, storing objects like Runs and Tables in a folder-based layout, grouped into Projects and Datasets. The tlc Python package and 3LC Dashboard provide ways to create, modify and retrieve these objects. When creating objects we therefore specify a project_name, for Runs a run_name and for Tables dataset_name and table_name.

projects
β”œβ”€β”€ index.3lc.json
└── <project_name>
    β”œβ”€β”€ index.3lc.json
    β”œβ”€β”€ default_aliases.3lc.yaml
    β”œβ”€β”€ datasets
    β”‚   β”œβ”€β”€ <dataset_1>
    β”‚   β”‚   β”œβ”€β”€ tables
    β”‚   β”‚   β”‚   β”œβ”€β”€ <table_1>
    β”‚   β”‚   β”‚   β”‚   β”œβ”€β”€ object.3lc.json
    β”‚   β”‚   β”‚   β”‚   └── row_cache.parquet
    β”‚   β”‚   β”‚   └── <table_2>
    β”‚   β”‚   β”‚       ...
    β”‚   β”‚   └── bulk_data
    β”‚   └── <dataset_2>
    β”‚       β”œβ”€β”€ tables
    β”‚       └── bulk_data
    β”‚       ...
    └── runs
        β”œβ”€β”€ <run_1>
        β”‚   β”œβ”€β”€ object.3lc.json
        β”‚   └── metric_0000
        β”‚       └── object.3lc.json
        └── <run_2>
            ...
    ...

The root of the 3LC project folder structure is the projects directory. Each project is a subdirectory of projects and contains all the data and metadata associated with a specific machine learning project.

projects
β”œβ”€β”€ index.3lc.json
└── <project_name>
    β”œβ”€β”€ index.3lc.json
    β”œβ”€β”€ default_aliases.3lc.yaml
    β”œβ”€β”€ datasets
    β”‚   β”œβ”€β”€ <dataset_1>
    β”‚   β”‚   β”œβ”€β”€ tables
    β”‚   β”‚   β”‚   β”œβ”€β”€ <table_1>
    β”‚   β”‚   β”‚   β”‚   β”œβ”€β”€ object.3lc.json
    β”‚   β”‚   β”‚   β”‚   └── row_cache.parquet
    β”‚   β”‚   β”‚   └── <table_2>
    β”‚   β”‚   β”‚       ...
    β”‚   β”‚   └── bulk_data
    β”‚   └── <dataset_2>
    β”‚       β”œβ”€β”€ tables
    β”‚       └── bulk_data
    β”‚       ...
    └── runs
        β”œβ”€β”€ <run_1>
        β”‚   β”œβ”€β”€ object.3lc.json
        β”‚   └── metric_0000
        β”‚       └── object.3lc.json
        └── <run_2>
            ...
    ...

Within each project, there can any number of datasets. Each dataset holds some number of tables, where each corresponds to a revision of that dataset.

projects
β”œβ”€β”€ index.3lc.json
└── <project_name>
    β”œβ”€β”€ index.3lc.json
    β”œβ”€β”€ default_aliases.3lc.yaml
    β”œβ”€β”€ datasets
    β”‚   β”œβ”€β”€ <dataset_1>
    β”‚   β”‚   β”œβ”€β”€ tables
    β”‚   β”‚   β”‚   β”œβ”€β”€ <table_1>
    β”‚   β”‚   β”‚   β”‚   β”œβ”€β”€ object.3lc.json
    β”‚   β”‚   β”‚   β”‚   └── row_cache.parquet
    β”‚   β”‚   β”‚   └── <table_2>
    β”‚   β”‚   β”‚       ...
    β”‚   β”‚   └── bulk_data
    β”‚   └── <dataset_2>
    β”‚       β”œβ”€β”€ tables
    β”‚       └── bulk_data
    β”‚       ...
    └── runs
        β”œβ”€β”€ <run_1>
        β”‚   β”œβ”€β”€ object.3lc.json
        β”‚   └── metric_0000
        β”‚       └── object.3lc.json
        └── <run_2>
            ...
    ...

A run is the 3LC object used to store hyperparameters, sample metrics and any other data related to a process that produces some kind of output, often a training run.

The 3lc.object.json files

3LC objects are always represented as folder locations. Internally, the 3LC API uses a file called object.3lc.json to store the serialized object. This is similar to the index.html file used to represent a folder in a web server. The 3lc.object.json file is always located in the same folder as the object it represents, and the location of the object is the path to the folder containing the 3lc.object.json file. The file is automatically created and managed by the 3LC API, and users should not need to interact with it directly.

The index.3lc.json files

The primary job of the Object Service, outside of communicating the Dashboard, is indexing 3LC objects in the project locations it is configured to scan. To avoid recursing through and opening every file in every project to pick up changes, any 3LC code that produces or edits a 3LC Object will touch the index.3lc.json file for that project. When the Object Service encounters this changed index.3lc.json, it will reindex this location and reflect the changes in the data it sends to the Dashboard. To learn more about the indexing system, see the in-depth documentation of the Object Service. The index.3lc.json files are automatically created and managed by the 3LC API, and users should not need to interact with it directly.

bulk_data directories

While 3LC aims to avoid copying your data, in some cases it needs to be serialized because it is in-memory and not backed by a file on persistent storage. For example, when recording predicted semantic segmentation masks, the predictions are stored on disk in PNG files. In these cases, 3LC will store this data under bulk_data.

row_cache.parquet

3LC supports caching the row data of any given Table, which is useful when it is expensive to repeatedly produce the data. An example is when you have made many revisions to your data, and many tables reference each other which have call into each other to build the current Table. In the folder structure, within a Table, you will sometimes therefore see a file row_cache.parquet, which can be loaded into memory quickly.

default_aliases.3lc.yaml

Any local configuration of 3LC can define a set of aliases. In addition, it is possible to define project default aliases, which will apply as a fallback for anyone scanning the project without the alias defined themselves. This is useful in settings where multiple people work on a project, and the referenced data (such as images) are available in a shared location referenced by the default alias. Read more about default aliases in the document on Sharing, and see the Deployment Examples for concrete examples.