3LC Background Indexing System

3LC maintains an index over all known projects and objects. This is kept up to date using a background indexing system and normally has little direct impact on 3LC usage.

The index is used for two purposes:

  1. To serve the 3LC Dashboard with available Tables and Runs, c.f. the Object Service

  2. To make it possible to deduce dataset lineage and keep track of revisions, as in Table.latest

Background indexing is not started until requested, either explicitly during Object Service startup, or lazily whenever Table.latest is called. Indexing can be started and stopped without impact. It will be restarted by any calls that depend on the latest index status.

Indexing Configuration

To create the index, 3LC recursively scans all configured project locations to discover Tables, Runs, and other objects. The following configuration settings are important:

  • indexing.project-root-url: # a single URL

  • indexing.project-scan-urls: # a list of URLs

See the configuration documentation for more information.

When changes occur in any indexed location, scanning must be (re-)performed in order to keep the index up-to-date. However, scanning project locations (especially remote cloud storage) can take time and incur expense, so 3LC uses a timestamp-based optimization system to minimize unnecessary scans.

Hierarchical timestamp files

3LC automatically creates and manages index.3lc.json timestamp files containing UTC timestamps that track when each location was last modified. This optimization means:

Faster change detection:

  • If a location has not changed since the last timestamp, the indexer skips re-scanning

  • When you call Table.latest(), it can quickly determine if new revisions exist

  • For cloud storage (S3, GCS, etc.), this dramatically reduces API calls and costs

  • Changes are quickly propagated to the 3LC Dashboard (from the Object Service)

To make this possible, creation and modifications of Runs, Tables and revisions are tracked throughout through the Python package and timestamps are updated automatically. Edits from the dashboard or other 3LC processes are similarly handled.

Behind the scenes

  • Small index.3lc.json timestamp files appear in your project directories (safe to ignore in version control)

  • Missing timestamp files are handled gracefully with automatic fallback to full scans

  • The system uses debounced writing to coalesce closely-timed similar updates and minimize storage operations

Important Limitations for Python Users

External file changes: If you manually copy, move, or modify 3LC project files outside of the Python package (e.g., using file explorer, command line, or other tools), these changes won’t be automatically detected. The following options are available:

  1. Wait for a new re-index with force=True to ignore timestamp files for specific object type (Table, Run, etc.).

  1. Delete relevant timestamp files for project and root

  2. Restart the python process

Read-only data sources and indexing

When publishing read-only datasets that a running 3LC object service cannot write timestamp files to (e.g., shared cloud buckets), manually creating timestamp files can prevent unnecessary re-scanning and improve performance. The TimestampHelper.instance() singleton has methods for handling timestamp operations:

scan_url = Url("s3://mybucket/3lc-projects")
TimestampHelper.instance().process_timestamp(scan_url) # <- root location / scan url
TimestampHelper.instance().process_timestamp(scan_url/"project_x") # <- project folder location