URLs¶
3LC objects are identified by URLs, which are represented by the tlc.Url class in the Python
API. URLs can refer to files on disk, or objects on cloud storage such as S3. An object’s URL is generally the location
that it was read from and/or may be written to.
Schemes¶
3LC supports file paths and various cloud storage locations, each with a different scheme. A URL scheme is the first
part of the URL, up to the first :. The following cards summarize the available schemes, how to configure credentials
and how to install necessary dependencies.
File file://
URLs representing file paths may refer to data on a local disk, a mapped network drive, etc. URLs with no scheme are interpreted as file path URLs.
Credentials
To read from or write to the file system, 3LC requires access to the underlying file system and the necessary permissions for the relevant files and directories. Make sure the user or process running 3LC has appropriate read and/or write permissions for the paths you intend to use (including referenced bulk data), or file operations may fail.
Amazon S3
s3://
Amazon S3 URLs refer to data stored in S3 buckets.
Credentials
The tlc package generally uses the
boto3 credentials order
when accessing data stored on S3. In particular, this means that AWS environment variables take precedence, then the
shared credential file (~/.aws/credentials), then the AWS config file (~/.aws/config), then the instance metadata
service if running on an Amazon EC2 instance that has an IAM role configured.
Google Cloud Storage
gcs://
Google Cloud Storage (GCS) URLs refer to data stored in GCS buckets.
Credentials
The tlc package generally uses Google’s
application default credentials order
when accessing data stored on GCS. In particular, this means that the GOOGLE_APPLICATION_CREDENTIALS environment
variable takes precedence, then the gcloud application default credentials, then the instance metadata service if
running on a Google Compute Engine (GCE) instance with an attached service account.
Installation
GCS support is not enabled by default and may be enabled by installing the 3lc[gcs] extra.
Azure Blob Storage
abfs://
Azure Blob storage URLs refer to data stored in Azure Blob containers.
Credentials
The tlc package supports access to Azure Blob storage using AZURE_STORAGE environment variables. Common variations
include:
AZURE_STORAGE_ACCOUNT_NAMEandAZURE_STORAGE_ACCOUNT_KEYAZURE_STORAGE_ACCOUNT_NAMEandAZURE_STORAGE_SAS_TOKENAZURE_STORAGE_CONNECTION_STRING
Installation
Azure Blob Storage support is not enabled by default and may be enabled by installing the 3lc[abfs] extra.
Cloud credential configuration across multiple processes using tlc
It is common with 3LC to run multiple processes that each use the tlc Python package independently, such as a training
notebook and the 3LC Object Service. In order for those different components in different processes to interoperate
correctly with respect to cloud storage URLs, it is important to configure their cloud credentials in a compatible way.
For example, if cloud credentials are configured via environment variables, it is likely that the same environment
variables should be set for each process.
Object URLs¶
The tlc Python package provides a standard set of method arguments used for creating or retrieving objects by URLs.
These are summarized in the table below:
Parameter |
Description |
|---|---|
|
The name of the object, corresponding to the last part of the URL. |
|
The dataset name to use. Defaults to |
|
The project name to use. Defaults to |
|
The project root URL to use. Defaults to the |
|
How to handle the case where the object already exists. Typical values are “overwrite”, “reuse”, “rename”, and “raise” |
|
A fully-qualified custom URL to the object, disregarding the project folder structure. |
Examples¶
Create a URL to a table or run:
import tlc
table_url = tlc.Url.create_table_url(table_name, dataset_name, project_name)
run_url = tlc.Url.create_run_url(run_name, project_name)
Create or retrieve a table from from some input data:
import tlc
data = {
"column_1": [1, 2, 3],
"column_2": ["a", "b", "c"]
}
table = tlc.Table.from_dict(data, table_name, dataset_name, project_name, if_exists="reuse")
# table is now a Table object with a URL of the form
# <project_root>/<project_name>/datasets/<dataset_name>/tables/<table_name>
Common URL manipulation:
# Some examples of common URL manipulation
import tlc
# Create a unique URL from an existing one
url = tlc.Url.create_table_url(table_name, dataset_name, project_name)
unique_url = url.create_unique() # If a file or folder exists at the URL, appends a unique suffix
assert not unique_url.exists() # The URL is guaranteed to be unique
# Create a URL next to an existing one
url = tlc.Url.create_table_url(table_name, dataset_name, project_name)
next_to_url = url.create_sibling("new_table") # Creates a URL next to the existing one, with the name "new_table"
# Any object created at next_to_url will be in the same folder as the original object,
# and thereby belong to the same project and dataset (if applicable).