3LC Project Structure#

Overview#

The 3LC ecosystem uses a simple hierarchical structure to organize the data used in machine learning experiments, storing objects like Runs, Tables, and associated bulk data in a folder-based layout. The tlc Python package provides ways to create, modify, and retrieve these objects.

Project Folder Structure#

The root of the 3LC project folder structure is the projects directory. Each project is a subdirectory of projects and contains all the data and metadata associated with a specific machine learning project. The data is organized in a specific folder structure for clarity, ease of access, and ease of sharing. The structure is as follows:

projects
└── <project_name>
    ├── runs
    │   ├── <run_1>
    │   │   ├── object.3lc.json
    │   │   └── metric_0000
    │   │       └── object.3lc.json
    │   └── <run_2>
    │       ...
    └── datasets
        ├── <dataset_1>
        │   ├── tables
        │   │   ├── <table_1>
        │   │   │   ├── object.3lc.json
        │   │   │   └── data.parquet
        │   │   └── <table_2>
        │   │       ...
        │   └── bulk_data
        └── <dataset_2>
            ├── tables
            └── bulk_data
            ...
The 3lc.object.json File

URLs to 3LC objects are always represented as folder paths. Internally, the 3LC API uses a file called object.3lc.json to store the serialized object. This is similar to the index.html file used to represent a folder in a web server. The 3lc.object.json file is always located in the same folder as the object it represents, and the URL to the object is the path to the folder containing the 3lc.object.json file. The file is automatically created and managed by the 3LC API, and the user should not need to interact with it directly.

In a given tlc instance, multiple project locations can be indexed but only one will be considered the primary project root, which is where new objects will be created by default. The indexing.project-scan-urls configuration variable allows the indexing of multiple project roots, while the indexing.project-root-url configuration variable sets the primary project root location (see default file locations for default values).

Transient Attributes

In addition to normal attributes, 3LC Objects can have “transient” attributes, which are not stored in the object itself. Instead, these are derived from the URL of the object. For example, Table has the transient properties table_name, dataset_name, and project_name, which are derived directly from the location of the table within the project folder.

Working with Object URLs#

The tlc Python package provides a standard set of method arguments used for creating or retrieving objects by URLs. These are summarized in the table below:

Parameter

Description

table_name/run_name

The name of the object, corresponding to the last part of the URL.

dataset_name

The dataset name to use. Defaults to default-dataset.

project_name

The project name to use. Defaults to default-project.

root_url

The project root URL to use. Defaults to the PROJECT_ROOT_URL configuration variable.

if_exists

How to handle the case where the object already exists. Typical values are “overwrite”, “reuse”, “rename”, and “raise”

table_url/run_url

A fully-qualified custom URL to the object, disregarding the project folder structure.

Examples#

Create a URL to a table or run:

import tlc

table_url = tlc.Url.create_table_url(table_name, dataset_name, project_name)
run_url = tlc.Url.create_run_url(run_name, project_name)

Create or retrieve a table from from some input data:

import tlc

data = {
    "column_1": [1, 2, 3],
    "column_2": ["a", "b", "c"]
}
table = tlc.Table.from_dict(data, table_name, dataset_name, project_name, if_exists="reuse")
# table is now a Table object with a URL of the form 
# <project_root>/<project_name>/datasets/<dataset_name>/tables/<table_name>

Common URL manipulation:

# Some examples of common URL manipulation
import tlc

# Create a unique URL from an existing one
url = tlc.Url.create_table_url(table_name, dataset_name, project_name)
unique_url = url.create_unique() # If a file or folder exists at the URL, appends a unique suffix
assert not unique_url.exists() # The URL is guaranteed to be unique

# Create a URL next to an existing one
url = tlc.Url.create_table_url(table_name, dataset_name, project_name)
next_to_url = url.create_sibling("new_table") # Creates a URL next to the existing one, with the name "new_table"
# Any object created at next_to_url will be in the same folder as the original object,
# and thereby belong to the same project and dataset (if applicable).

Datasets vs. Tables

The primary object for storing tabular data in 3LC is a Table. In many cases, the Table object corresponds directly to the common concept of a dataset as used in machine learning and data science. Due to the immutable nature of Tables, and the ability to create derived Tables from existing ones, in many cases there will be multiple tables in the same dataset. Therefore it is more appropriate to think of a dataset as a collection of Tables, rather than a single Table. As such, in 3LC, a dataset merely provides a logical way to order tables, and does not have any special properties or methods of its own.