URLs¶

3LC objects are identified by URLs, which are represented by the tlc.Url class in the Python API. URLs can refer to files on disk, or objects on cloud storage such as S3. An object’s URL is generally the location that it was read from and/or may be written to.

Schemes¶

3LC supports file paths and various cloud storage locations, each with a different scheme. A URL scheme is the first part of the URL, up to the first :. The following cards summarize the available schemes, how to configure credentials and how to install necessary dependencies.

File file://

URLs representing file paths may refer to data on a local disk, a mapped network drive, etc. URLs with no scheme are interpreted as file path URLs.

Credentials

To read from or write to the file system, 3LC requires access to the underlying file system and the necessary permissions for the relevant files and directories. Make sure the user or process running 3LC has appropriate read and/or write permissions for the paths you intend to use (including referenced bulk data), or file operations may fail.

Amazon S3 s3://

Amazon S3 URLs refer to data stored in S3 buckets.

Credentials

The tlc package generally uses the boto3 credentials order when accessing data stored on S3. In particular, this means that AWS environment variables take precedence, then the shared credential file (~/.aws/credentials), then the AWS config file (~/.aws/config), then the instance metadata service if running on an Amazon EC2 instance that has an IAM role configured.

Google Cloud Storage gcs://

Google Cloud Storage (GCS) URLs refer to data stored in GCS buckets.

Credentials

The tlc package generally uses Google’s application default credentials order when accessing data stored on GCS. In particular, this means that the GOOGLE_APPLICATION_CREDENTIALS environment variable takes precedence, then the gcloud application default credentials, then the instance metadata service if running on a Google Compute Engine (GCE) instance with an attached service account.

Installation

GCS support is not enabled by default and may be enabled by installing the 3lc[gcs] extra.

Azure Blob Storage abfs://

Azure Blob storage URLs refer to data stored in Azure Blob containers.

Credentials

The tlc package supports access to Azure Blob storage using AZURE_STORAGE environment variables. Common variations include:

  • AZURE_STORAGE_ACCOUNT_NAME and AZURE_STORAGE_ACCOUNT_KEY

  • AZURE_STORAGE_ACCOUNT_NAME and AZURE_STORAGE_SAS_TOKEN

  • AZURE_STORAGE_CONNECTION_STRING

Installation

Azure Blob Storage support is not enabled by default and may be enabled by installing the 3lc[abfs] extra.

Cloud credential configuration across multiple processes using tlc

It is common with 3LC to run multiple processes that each use the tlc Python package independently, such as a training notebook and the 3LC Object Service. In order for those different components in different processes to interoperate correctly with respect to cloud storage URLs, it is important to configure their cloud credentials in a compatible way. For example, if cloud credentials are configured via environment variables, it is likely that the same environment variables should be set for each process.

Object URLs¶

The tlc Python package provides a standard set of method arguments used for creating or retrieving objects by URLs. These are summarized in the table below:

Parameter

Description

table_name/run_name

The name of the object, corresponding to the last part of the URL.

dataset_name

The dataset name to use. Defaults to default-dataset.

project_name

The project name to use. Defaults to default-project.

root_url

The project root URL to use. Defaults to the PROJECT_ROOT_URL configuration variable.

if_exists

How to handle the case where the object already exists. Typical values are “overwrite”, “reuse”, “rename”, and “raise”

table_url/run_url

A fully-qualified custom URL to the object, disregarding the project folder structure.

Examples¶

Create a URL to a table or run:

import tlc

table_url = tlc.Url.create_table_url(table_name, dataset_name, project_name)
run_url = tlc.Url.create_run_url(run_name, project_name)

Create or retrieve a table from from some input data:

import tlc

data = {
    "column_1": [1, 2, 3],
    "column_2": ["a", "b", "c"]
}
table = tlc.Table.from_dict(data, table_name, dataset_name, project_name, if_exists="reuse")
# table is now a Table object with a URL of the form 
# <project_root>/<project_name>/datasets/<dataset_name>/tables/<table_name>

Common URL manipulation:

# Some examples of common URL manipulation
import tlc

# Create a unique URL from an existing one
url = tlc.Url.create_table_url(table_name, dataset_name, project_name)
unique_url = url.create_unique() # If a file or folder exists at the URL, appends a unique suffix
assert not unique_url.exists() # The URL is guaranteed to be unique

# Create a URL next to an existing one
url = tlc.Url.create_table_url(table_name, dataset_name, project_name)
next_to_url = url.create_sibling("new_table") # Creates a URL next to the existing one, with the name "new_table"
# Any object created at next_to_url will be in the same folder as the original object,
# and thereby belong to the same project and dataset (if applicable).