Tables¶
Introduction¶
Tables are the primary way to store tabular data in 3LC. There are many different types of Tables, each with their own unique features and capabilities. When you create a Table, you are telling the 3LC system how to interpret and interact with your data. This allows you to easily visualize, analyze, and modify your data in the 3LC Dashboard. Crucially, this allows any modifications made in the Dashboard to be consumed directly in your Python code.
While Tables themselves are always immutable, 3LC enables non-destructive editing of Tables through sparsely represented Table Revisions. Similar to a versioning system like Git, this allows you to make changes to your data without having to copy it, and at the same time have a full, reversible history of all changes made.
3LC contains several utilities for creating Tables from popular frameworks and formats such as Pandas, PyTorch, HuggingFace and COCO.
Data Access¶
Accessing data in a Table can be done in several ways:
Sample View: When indexed, table[i]
returns the “sample view” of the data. This format is suitable for feeding directly into a machine learning pipeline, making tlc.Table
a drop-in replacement for a PyTorch dataset. For more details, refer to the SampleType documentation.
Table Row View: Accessing table.table_rows[i]
provides the “table view” of the data, which only contains lightweight references to external data such as images and files. This is the format that is serialized and sent to the Dashboard.
Iteration: Iterating over a table directly yields the “sample view” of the data, as defined above.
Bulk Data Serialization: All the rows in a table can be serialized into a single bytestream using the get_rows_as_binary() method. This is useful for transferring or storing table data. It is the “table view” of the data that is serialized.
Serialization is the process of converting the table to a lightweight format for storage or transmission. It is lightweight, because whenever possible, bulk data is stored by reference, specifically through Urls. This should not be confused with the serialization of objects, which is simply the JSON representation of the object.
Examples¶
Create a Table¶
Let’s create a simple table from a list of columns and inspect its properties.
For educational purposes, our table will be created so that it can either return its data as a tuple, or as a dictionary. This is to highlight the duality between the “table rows view” of the table and the “sample view” of the table.
[47]:
import tlc
# Samples should be represented as tuples of ints when used in ML training
sample_structure = (tlc.Int("col_1"), tlc.Int("col_2"))
# Table.from_dict allows you to create a table directly from a python dictionary
table = tlc.Table.from_dict(
data={"col_1": [1, 2, 3], "col_2": [4, 5, 6]}, structure=sample_structure, table_name="sample_table"
)
Access a row by index¶
[48]:
print(table[0])
(1, 4)
Iterate over a rows in the table (sample view)¶
[49]:
for sample in table:
print(sample)
(1, 4)
(2, 5)
(3, 6)
Access a row by index (table row view)¶
[50]:
table.table_rows[0]
[50]:
ImmutableDict({'col_1': 1, 'col_2': 4, 'weight': 1.0})
Iterate over a rows in the table (table row view)¶
[51]:
for row in table.table_rows:
print(row.keys())
dict_keys(['col_1', 'col_2', 'weight'])
dict_keys(['col_1', 'col_2', 'weight'])
dict_keys(['col_1', 'col_2', 'weight'])
Serialization¶
[52]:
Read 1195 bytes from the table
Row Schema¶
The schema of the rows
attribute of a table is essential for describing how to share tables with the 3LC Dashboard, and how to represent the table data in the sample view. The row schema can be accessed through the rows_schema
attribute of the table.
[53]:
table.rows_schema
[53]:
{
"display_name":"Rows",
"description":"Schema describing the column layout of the rows within this table",
"sample_type":"horizontal_tuple",
"size0":{
"type":"int32",
"min":0,
"max":1000000000,
"enforce_min":true,
"number_role":"table_row_index",
"display_name":"Table row"
},
"values":{
"col_1":{
"sample_type":"int",
"value":{
"type":"int32"
}
},
"col_2":{
"sample_type":"int",
"value":{
"type":"int32"
}
},
"weight":{
"display_name":"Weight",
"description":"The weights of the samples in this table.",
"sample_type":"hidden",
"default_visible":false,
"value":{
"default_value":1.0,
"min":0,
"enforce_min":true,
"number_role":"sample_weight"
}
}
}
}
Row Cache¶
While the actual serialization format is an implementation detail, we currently use Parquet as the default format.
If the row_cache_url
attribute of a Table is set, a call to get_rows_as_binary
will first check if the data is already cached at the given URL. If so, the cached data is returned. Otherwise, the data is serialized and cached at the given URL.
[54]:
# Observe that the row cache url is empty and the row cache is not populated
print(f"Table row cache: {table.row_cache_url}")
print(f"Row cache is populated? {table.row_cache_populated}")
table.set_row_cache_url(tlc.Url("../row_cache.parquet"))
table.get_rows_as_binary()
# After the above call, the row cache is populated
print(f"Table row cache: {table.row_cache_url}")
print(f"Row cache is populated? {table.row_cache_populated}")
Table row cache: ../row_cache.parquet
Row cache is populated? True
Table row cache: ../row_cache.parquet
Row cache is populated? True
Table Operations¶
Tables are immutable. This ensures that the lineage of a dataset is always kept intact. This is important in order to be able to experiment with different dataset revisions. Performing modifications to tables is in practice accomplished by creating new tables deriving from input tables.
[55]:
# An example of adding a column to a table, which creates a new table inheriting from the input table
column_added_table = table.add_column("new_column", [7, 8, 9])
These “table operations” are the primary mechanism which enables 3LC to store and process sparse modifications to datasets.
Derived tables can also be the result of filtering, subsetting, and other transformations.
Finally, Tables are the primary mechanism by which 3LC stores metrics.
[56]:
# Creates a SubsetTable which includes 75% of the rows in the input table
subset_table = tlc.SubsetTable(
input_table_url=column_added_table,
range_factor_min=0.0,
range_factor_max=1.0,
include_probability=0.75,
)
[57]:
# Creates a FilteredTable which includes only rows where the value in the "new_column" column is greater than 7
filtered_table = tlc.FilteredTable(
input_table_url=column_added_table,
filter_criterion=tlc.NumericRangeFilterCriterion(
attribute="new_column",
min_value=7,
max_value=10,
),
)
Table Revisions¶
3LC maintains an index over all known projects and objects, including tables. This index is kept up-to-date using a background indexing system. The most up-to-date revision of any given dataset can be queried with the Table.latest
method.
[ ]:
from tlc import Table
initial_url = "..."
table = Table.from_url(initial_url)
latest_rev = table.latest() # make sure the latest revision is in use
Using Table.latest()
in Your Code¶
The Table.latest()
method is particularly important when working with evolving datasets, especially during iterative training workflows. When you make changes to your data in the Dashboard (such as adjusting sample weights, adding labels, or filtering data), these changes create new table revisions and training should (almost) always be performed on the most up-to-date version.
[ ]:
import tlc
# Load a table and ensure you're working with the most recent version
table = tlc.Table.from_names(project_name="my_project", dataset_name="training_data", table_name="images")
# Get the latest revision (includes any Dashboard edits)
latest_table = table.latest()
# Use in training - this will include the most recent weights/filters
The Table.latest
method relies on the 3LC indexing system to track relationships between different versions of your files.