Tables#

Tables are the primary way to store data in 3LC. Whether it is a dataset, an overview of your Runs, or a stream of collected metrics, data contained within a Table can be visualized, analyzed and sometimes even modified in the 3LC Dashboard.

While Tables themselves are always immutable, 3LC enables non-destructive editing of all Tables through sparsely represented Revisions. Similar to a versioning system like Git, this allows you to make changes to your data without having to copy it, while simultaneously maintaining a full, reversible history of all changes made.

3LC contains several utilities for creating Tables from popular frameworks and formats such as Pandas, PyTorch, HuggingFace and COCO. Tables can also be serialized and cached to disk as parquet files.

[2]:
import tlc
[12/05 13:58:54 tlc.core.objects.tables.from_table]: UMAPTable imported; umap-learn must be installed
[12/05 13:58:55 tlc.integration]: integration/HuggingFace imported
[12/05 13:58:55 tlc.integration]: integration/detectron2 not installed, skipping

Let’s create a simple table from a list of columns and inspect its properties.

For educational purposes, our table will be created so that it can either return its data as a tuple, or as a dictionary. This is to highlight the duality between the “table rows view” of the table and the “sample view” of the table.

[3]:
# Samples should be represented as tuples of ints when used in ML training
sample_type = tlc.SampleType.from_structure((tlc.Int("col_1"), tlc.Int("col_2")))

# TableFromPydict allows you to create a table directly from a python dictionary
table = tlc.TableFromPydict(
    data={"col_1": [1, 2, 3], "col_2": [4, 5, 6]},
    override_table_rows_schema=sample_type.schema,
)
print(table)
{
  "type":"TableFromPydict",
  "created":"2023-12-05 12:58:55.952041+00:00",
  "row_cache_populated":false,
  "override_table_rows_schema":{
    "display_name":"value",
    "sample_type":"horizontal_tuple",
    "values":{
      "col_1":{
        "display_name":"col_1",
        "sample_type":"int",
        "value":{
          "type":"int32"
        }
      },
      "col_2":{
        "display_name":"col_2",
        "sample_type":"int",
        "value":{
          "type":"int32"
        }
      }
    }
  },
  "row_count":-1,
  "input_data":{
    "col_1":[
      1,
      2,
      3
    ],
    "col_2":[
      4,
      5,
      6
    ]
  }
}
[4]:
# Print the columns of the table
print(table.columns)
['col_1', 'col_2']
[5]:
# Print the row schema of the table
print(table.row_schema)
{
  "display_name":"Row schema",
  "description":"Row schema for TableFromPydict",
  "sample_type":"horizontal_tuple",
  "values":{
    "col_1":{
      "display_name":"col_1",
      "sample_type":"int",
      "value":{
        "type":"int32"
      }
    },
    "col_2":{
      "display_name":"col_2",
      "sample_type":"int",
      "value":{
        "type":"int32"
      }
    }
  }
}

Data Access#

Accessing data within a Table can be done in several ways:

Sample View: When indexed, table[i] returns the “sample view” of the data. This format is suitable for feeding directly into a machine learning pipeline, making tlc.Table a convenient drop-in replacement for a PyTorch dataset. For more details, refer to the SampleType documentation.

Table Row View: Accessing table.table_rows[i] provides the “table view” of the data, consisting of lightweight references to the underlying data. This is the format that is serialized and sent to the Dashboard.

Iteration: Iterating over a table directly yields the “sample view” of the data.

Bulk Data Serialization: All the rows in a table can be serialized into a single bytestream using the get_rows_as_binary() instance method. This is useful for transferring or storing table data. It is the “table view” of the data that is serialized.

[6]:
print("All rows in the table (sample view):")
for sample in table:
    print(sample)

print("First row in the table (sample view):")
print(table[0])
All rows in the table (sample view):
(1, 4)
(2, 5)
(3, 6)
First row in the table (sample view):
(1, 4)
[7]:
print("All rows in the table (table row view):")
for row in table.table_rows:
    print(row)

print("First row in the table (table row view):")
print(table.table_rows[0])
All rows in the table (table row view):
{'col_1': 1, 'col_2': 4}
{'col_1': 2, 'col_2': 5}
{'col_1': 3, 'col_2': 6}
First row in the table (table row view):
{'col_1': 1, 'col_2': 4}

Serialization is the process of converting the table to a lightweight format for storage or transmission. It is lightweight, because whenever possible, bulk data is stored by reference, specifically through Urls. This is not to be confused with the serialization of objects, which is simply the JSON representation of the object.

The serialized form of a table can be retrieved by calling get_rows_as_binary():

[8]:
serialized_bytes = table.get_rows_as_binary()
print(f"Read {len(serialized_bytes)} bytes from the table")
Read 869 bytes from the table

Row Schema#

The placeholder rows attribute allows the table to describe the schema of its rows, which is essential for sharing tables with the 3LC Dashboard.

{
  "display_name":"Row schema",
  "description":"Row schema for TableFromPydict",
  "sample_type":"horizontal_tuple",
  "values":{
    "col_1":{
      "display_name":"col_1",
      "sample_type":"int",
      "value":{
        "type":"int32"
      }
    },
    "col_2":{
      "display_name":"col_2",
      "sample_type":"int",
      "value":{
        "type":"int32"
      }
    }
  }
}

Row Cache#

While the actual serialization format is an implementation detail, we currently use Parquet as the default format.

If the row_cache_url attribute of a Table is set, a call to get_rows_as_binary will first check if the data is already cached at the given URL. If so, the cached data is returned. Otherwise, the data is serialized and cached at the given URL.

[9]:
# Observe that the row cache url is empty and the row cache is not populated
print(f"Table row cache: {table.row_cache_url}")
print(f"Row cache is populated? {table.row_cache_populated}")

table.set_row_cache_url(tlc.Url("../row_cache.parquet"))
table.get_rows_as_binary()

# After the above call, the row cache is populated
print(f"Table row cache: {table.row_cache_url}")
print(f"Row cache is populated? {table.row_cache_populated}")

Table row cache:
Row cache is populated? False
Table row cache: ../row_cache.parquet
Row cache is populated? False

Table Operations#

Tables are immutable. This ensures that the lineage of a dataset is always kept intact. This is important in order to be able to experiment with different dataset revisions. Performing modifications to tables is in practice accomplished by creating new tables deriving from input tables.

[10]:
# An example of adding a column to a table, which creates a new table inheriting from the input table
column_added_table = table.add_column("new_column", [7, 8, 9])
print(f"Created a new table of type {type(column_added_table)}")
print(f"First row of the new table: {column_added_table.table_rows[0]}")

Created a new table of type <class 'tlc.core.objects.tables.from_table.edited_table.EditedTable'>
First row of the new table: {'col_1': 1, 'col_2': 4, 'new_column': 7}

These “table operations” are the primary mechanism which enables 3LC to store and process sparse modifications to datasets.

Derived tables can also be the result of filtering, subsetting, and other transformations.

Finally, Tables are the primary mechanism by which 3LC stores metrics.

[11]:
# Creates a SubsetTable which includes 75% of the rows in the input table
subset_table = tlc.SubsetTable(
    input_table_url=column_added_table,
    range_factor_min=0.0,
    range_factor_max=1.0,
    include_probability=0.75,
)
print(f"Length of the input table: {len(column_added_table)}")
print(f"Length of the subset table: {len(subset_table)}")
Length of the input table: 3
Length of the subset table: 2
[12]:
# Creates a FilteredTable which includes only rows where the value in the "new_column" column is greater than 7
filtered_table = tlc.FilteredTable(
    input_table_url=column_added_table,
    filter_criterion=tlc.NumericRangeFilterCriterion(
        attribute="new_column",
        min_value=7,
        max_value=10,
    ),
)
print(f"Length of the input table: {len(column_added_table)}")
print(f"Length of the filtered table: {len(filtered_table)}")
Length of the input table: 3
Length of the filtered table: 2