Sharing 3LC Tables and Runs#
Sharing Tables and Runs between different users or environments in 3LC is designed to be both efficient and flexible. This section outlines how to share objects (with minimal or no data duplication), manage data through Aliases, and ensure that your data is accessible in different contexts.
Efficiency: No unnecessary data copying occurs during the sharing process. This is particularly advantageous in machine learning workflows where training commonly happens on temporary VM instances that require fast, local data access.
Flexibility: Aliases provide a powerful mechanism to abstract away the physical location of your data. This makes it easier to manage and consume the data, whether it resides in a single location or multiple copies are available.
Real-world workflow#
Imagine a typical machine learning scenario: You’re training a model on a temporary VM. When the VM starts, training data is downloaded from a shared data location, such as an S3 bucket. After the VM is terminated, the results of the Run can be easily shared with other users, even though the data used for training no longer exists. Those users can access the necessary data from the source shared location simply by setting the appropriate Aliases.
Aliases#
An Alias is simply a text string that represents a part of a path to an object.
Let’s say User A has run ML training using local data at /data/project/*.jpg, which was initially copied from a shared object store at s3://company-wide-bucket/project/*.jpg.
Prior to training, user A sets the alias PROJECT_DATA to /data/project. Any Table (whether consisting of Examples or Metrics) referencing the data will substitute /data/project with <PROJECT_DATA> before being written to disk.
Specifically, whenever /data/project/1.jpg is written to a Table, it will automatically be replaced with <PROJECT_DATA>/1.jpg.
Thus, if user B sets the alias PROJECT_DATA to s3://company-wide-bucket/project, they will be able to inspect the output of user A’s training run without having to copy the data to their local machine. Internally, when the Dashboard requests the file <PROJECT_DATA>/1.jpg, the object service will translate this to the full S3 URL s3://company-wide-bucket/project/1.jpg and return the file contents.
Setting Aliases#
Aliases can be set in the main configuration file or in data configuration files (see the Configuration module documentation for details):
aliases:
PROJECT_DATA: /data/project
Aliases can also be set through environment variables. Environment variables with the prefix
TLC_ALIAS_
will be picked up during initialization of the tlc
module. To define an alias equivalent to the example
above, you might for instance run:
export TLC_ALIAS_PROJECT_DATA="/data/project"
To list the currently active Aliases, run
import tlc
for alias, value in tlc.get_registered_url_aliases().items():
print(f"{alias}: {value}")
Alias Best Practices#
Since Aliases offer a powerful way of abstracting data locations, some care should be taken to avoid confusion between different users.
Here are some best practices:
Be careful when using Aliases that are very general, such as e.g. DATA or DATA_ROOT. These generic names should be reserved for company-wide shared locations and should typically not be overridden by individual users.
Consequence: Using general aliases can lead to data overlap or collisions, making it difficult to manage data.
Instead, use project-scoped or customer-scoped aliases when relevant, e.g. CUSTOMER_XYZ_DATA.
More targeted names are less likely to be ambiguous.
Keep in mind that aliases should be set for the object service as well as any notebooks or scripts that are used to access the data. This is trivial if they share the same configuration file or environment, but if they don’t it is recommended to use the same alias names in both locations.
3lc share
command-line tool#
The simple aliasing procedure described above does not work for all sharing workflows. For example if a user wants to share a local run with a colleague, this might not be as simple as sending the JSON file as an email attachment, since a) the JSON file might hold references to other local files, b) the JSON file might reference local data using absolute paths that do not exist on the colleague’s machine, and c) the JSON file might reference local data using aliases that are not set on the colleague’s machine.
The next release of 3LC will include a command-line utility for bundling Tables, Runs and their referred-to resources into convenient chunks, as well as automatically setting required aliases and relativizing/absolutizing URLs as necessary.