Sharing 3LC Tables and Runs#
Sharing Tables and Runs between different users or environments in 3LC is designed to be both efficient and flexible. This section outlines how to share objects (with minimal or no data duplication), manage data through Aliases, and ensure that your data is accessible in different contexts.
Efficiency: No unnecessary data copying occurs during the sharing process. This is particularly advantageous in machine learning workflows where training commonly happens on temporary VM instances that require fast, local data access.
Flexibility: Aliases provide a powerful mechanism to abstract away the physical location of your data. This makes it easier to manage and consume the data, whether it resides in a single location or multiple copies are available.
Real-world workflow#
Imagine a typical machine learning scenario: You’re training a model on a temporary VM. When the VM starts, training data is downloaded from a shared data location, such as an S3 bucket. After the VM is terminated, the results of the Run can be easily shared with other users, even though the data used for training no longer exists. Those users can access the necessary data from the source shared location simply by setting the appropriate Aliases.
Aliases#
An Alias is simply a text string that represents a part of a path to an object.
Let’s say User A has run ML training using local data at /data/project/*.jpg, which was initially copied from a shared object store at s3://company-wide-bucket/project/*.jpg.
Prior to training, user A sets the alias PROJECT_DATA to /data/project. Any Table (whether consisting of Examples or Metrics) referencing the data will substitute /data/project with <PROJECT_DATA> before being written to disk.
Specifically, whenever /data/project/1.jpg is written to a Table, it will automatically be replaced with <PROJECT_DATA>/1.jpg.
Thus, if user B sets the alias PROJECT_DATA to s3://company-wide-bucket/project, they will be able to inspect the output of user A’s training run without having to copy the data to their local machine. Internally, when the Dashboard requests the file <PROJECT_DATA>/1.jpg, the object service will translate this to the full S3 URL s3://company-wide-bucket/project/1.jpg and return the file contents.
Alias configuration#
The active Aliases substitutions for URLs in 3lc can be set in primary configuration files or in default-/project-alias configuration files as described in the Configuration documentation. The syntax is the same for both kinds:
aliases:
PROJECT_DATA: /data/project
It is easier to share and maintain data with project default aliases, since these are bundled inside the project.
Aliases can also be set through environment variables. Environment variables with the prefix
TLC_ALIAS_
will be picked up during initialization of the tlc
module. To define an alias equivalent to the example
above, you might for instance run:
export TLC_ALIAS_PROJECT_DATA="/data/project"
To get a detailed configuration listing, run the following:
3lc config --list --detail
The output will include information about the origin of all aliases, where they come from (config file, env, etc.) and what the file or environment variable was called.
Note that the above listing will not include default-aliases as they are discovered by the indexer while the system is running.
In an interactive environment, it is possible to list the currently active Aliases with:
import tlc
for alias, value in tlc.get_registered_url_aliases().items():
print(f"{alias}: {value}")
Persisting project aliases#
The system has support for defining and persisting per-project aliases:
import tlc
# register a default alias for project (if no project is given current active project is detected)
tlc.register_project_url_alias(token="PROJECT_XYZ_ALIAS", path="/projects/project-xyz", project="project-xyz")
This command will add the alias to registry (the current session) and also persist it in the file located inside the
given project, e.g. path/to/project-xyz/default_aliases.3lc.yaml
so that it will be reloaded the next time the system
is started. This bundling of aliases with project data makes for flexible sharing setup, see also the
Configuration documentation.
Alias Best Practices#
Since Aliases offer a powerful way of abstracting data locations, some care should be taken to avoid confusion between different users.
Here are some best practices:
Use project aliases: Make it a habit to persist project aliases with the data, i.e. in the
default_aliases.3lc.yaml
file inside the project folder as described above. This allows other users to get a default value for the aliases, while retaining the possibility to override aliases in their local configuration.Be careful when using Aliases that are very general, such as e.g. DATA or DATA_ROOT. These generic names should be reserved for company-wide shared locations and should typically not be overridden by individual users.
Consequence: Using general aliases can lead to data overlap or collisions, making it difficult to manage data.
Instead, use project-scoped or customer-scoped aliases when relevant, e.g. CUSTOMER_XYZ_DATA.
More targeted names are less likely to be ambiguous.
Keep in mind that aliases should be set for the object service as well as any notebooks or scripts that are used to access the data. This is trivial if they share the same configuration file or environment, but if they don’t it is recommended to use the same alias names in both locations.
3lc share
command-line tool#
The simple aliasing procedure described above does not work for all sharing workflows. For example if a user wants to share a local run with a colleague, this might not be as simple as sending the JSON file as an email attachment, since a) the JSON file might hold references to other local files, b) the JSON file might reference local data using absolute paths that do not exist on the colleague’s machine, and c) the JSON file might reference local data using aliases that are not set on the colleague’s machine.
An upcoming release of 3LC will include a command-line utility for bundling Tables, Runs and their referred-to resources into convenient chunks, as well as automatically setting required aliases and relativizing/absolutizing URLs as necessary.