Tables#

Introduction#

Tables are the primary way to store tabular data in 3LC. There are many different types of Tables, each with their own unique features and capabilities. When you create a Table, you are telling the 3LC system how to interpret and interact with your data. This allows you to easily visualize, analyze, and modify your data in the 3LC Dashboard. Crucially, this allows any modifications made in the Dashboard to be consumed directly in your Python code.

While Tables themselves are always immutable, 3LC enables non-destructive editing of Tables through sparsely represented Table Revisions. Similar to a versioning system like Git, this allows you to make changes to your data without having to copy it, and at the same time have a full, reversible history of all changes made.

3LC contains several utilities for creating Tables from popular frameworks and formats such as Pandas, PyTorch, HuggingFace and COCO.

Data Access#

Accessing data in a Table can be done in several ways:

Sample View: When indexed, table[i] returns the “sample view” of the data. This format is suitable for feeding directly into a machine learning pipeline, making tlc.Table a drop-in replacement for a PyTorch dataset. For more details, refer to the SampleType documentation.

Table Row View: Accessing table.table_rows[i] provides the “table view” of the data, which only contains lightweight references to external data such as images and files. This is the format that is serialized and sent to the Dashboard.

Iteration: Iterating over a table directly yields the “sample view” of the data, as defined above.

Bulk Data Serialization: All the rows in a table can be serialized into a single bytestream using the get_rows_as_binary() method. This is useful for transferring or storing table data. It is the “table view” of the data that is serialized.

Serialization is the process of converting the table to a lightweight format for storage or transmission. It is lightweight, because whenever possible, bulk data is stored by reference, specifically through Urls. This should not be confused with the serialization of objects, which is simply the JSON representation of the object.

Examples#

Create a Table#

Let’s create a simple table from a list of columns and inspect its properties.

For educational purposes, our table will be created so that it can either return its data as a tuple, or as a dictionary. This is to highlight the duality between the “table rows view” of the table and the “sample view” of the table.

[47]:

import tlc

# Samples should be represented as tuples of ints when used in ML training
sample_structure = (tlc.Int("col_1"), tlc.Int("col_2"))

# Table.from_dict allows you to create a table directly from a python dictionary
table = tlc.Table.from_dict(
    data={"col_1": [1, 2, 3], "col_2": [4, 5, 6]},
    structure=sample_structure,
    table_name="sample_table"
)

Access a row by index#

[48]:

print(table[0])

(1, 4)

Iterate over a rows in the table (sample view)#

[49]:

for sample in table:
    print(sample)

(1, 4)
(2, 5)
(3, 6)

Access a row by index (table row view)#

[50]:

table.table_rows[0]

[50]:

ImmutableDict({'col_1': 1, 'col_2': 4, 'weight': 1.0})

Iterate over a rows in the table (table row view)#

[51]:

for row in table.table_rows:
    print(row.keys())

dict_keys(['col_1', 'col_2', 'weight'])
dict_keys(['col_1', 'col_2', 'weight'])
dict_keys(['col_1', 'col_2', 'weight'])

Serialization#

[52]:

serialized_bytes = table.get_rows_as_binary()
print(f"Read {len(serialized_bytes)} bytes from the table")

Read 1195 bytes from the table

Row Schema#

The schema of the rows attribute of a table is essential for describing how to share tables with the 3LC Dashboard, and how to represent the table data in the sample view. The row schema can be accessed through the rows_schema attribute of the table.

[53]:

table.rows_schema

[53]:

{
  "display_name":"Rows",
  "description":"Schema describing the column layout of the rows within this table",
  "sample_type":"horizontal_tuple",
  "size0":{
    "type":"int32",
    "min":0,
    "max":1000000000,
    "enforce_min":true,
    "number_role":"table_row_index",
    "display_name":"Table row"
  },
  "values":{
    "col_1":{
      "sample_type":"int",
      "value":{
        "type":"int32"
      }
    },
    "col_2":{
      "sample_type":"int",
      "value":{
        "type":"int32"
      }
    },
    "weight":{
      "display_name":"Weight",
      "description":"The weights of the samples in this table.",
      "sample_type":"hidden",
      "default_visible":false,
      "value":{
        "default_value":1.0,
        "min":0,
        "enforce_min":true,
        "number_role":"sample_weight"
      }
    }
  }
}

Row Cache#

While the actual serialization format is an implementation detail, we currently use Parquet as the default format.

If the row_cache_url attribute of a Table is set, a call to get_rows_as_binary will first check if the data is already cached at the given URL. If so, the cached data is returned. Otherwise, the data is serialized and cached at the given URL.

[54]:

# Observe that the row cache url is empty and the row cache is not populated
print(f"Table row cache: {table.row_cache_url}")
print(f"Row cache is populated? {table.row_cache_populated}")

table.set_row_cache_url(tlc.Url("../row_cache.parquet"))
table.get_rows_as_binary()

# After the above call, the row cache is populated
print(f"Table row cache: {table.row_cache_url}")
print(f"Row cache is populated? {table.row_cache_populated}")

Table row cache: ../row_cache.parquet
Row cache is populated? True
Table row cache: ../row_cache.parquet
Row cache is populated? True

Table Operations#

Tables are immutable. This ensures that the lineage of a dataset is always kept intact. This is important in order to be able to experiment with different dataset revisions. Performing modifications to tables is in practice accomplished by creating new tables deriving from input tables.

[55]:

# An example of adding a column to a table, which creates a new table inheriting from the input table
column_added_table = table.add_column("new_column", [7, 8, 9])

These “table operations” are the primary mechanism which enables 3LC to store and process sparse modifications to datasets.

Derived tables can also be the result of filtering, subsetting, and other transformations.

Finally, Tables are the primary mechanism by which 3LC stores metrics.

[56]:

# Creates a SubsetTable which includes 75% of the rows in the input table
subset_table = tlc.SubsetTable(
    input_table_url=column_added_table,
    range_factor_min=0.0,
    range_factor_max=1.0,
    include_probability=0.75,
)

[57]:

# Creates a FilteredTable which includes only rows where the value in the "new_column" column is greater than 7
filtered_table = tlc.FilteredTable(
    input_table_url=column_added_table,
    filter_criterion=tlc.NumericRangeFilterCriterion(
        attribute="new_column",
        min_value=7,
        max_value=10,
    ),
)