tlc.core.objects.table
#
The abstract base class for all Table types.
Module Contents#
Classes#
Class |
Description |
---|---|
An immutable access interface to a nested dictionary representing a TableRow. |
|
An immutable access interface to the rows of a Table object |
|
The abstract base class for all Table types. |
Functions#
Function |
Description |
---|---|
Sort a list of tables chronologically. |
|
Create a copy of this table where all lineage is squashed. |
Data#
Data |
Description |
---|---|
Generic type for a row of a table. |
API#
- tlc.core.objects.table.TableRow = None#
Generic type for a row of a table.
- class tlc.core.objects.table.ImmutableDict(*args: Any, **kwargs: Any)#
Bases:
typing.Dict
[str
,object
]An immutable access interface to a nested dictionary representing a TableRow.
This class is used to make access to table rows immutable, and to provide a consistent interface for accessing nested column data.
- class tlc.core.objects.table.TableRows(table: tlc.core.objects.table.Table)#
An immutable access interface to the rows of a Table object
- class tlc.core.objects.table.Table(url: tlc.core.url.Url | None = None, created: str | None = None, description: str | None = None, row_cache_url: tlc.core.url.Url | None = None, row_cache_populated: bool | None = None, override_table_rows_schema: Any = None, init_parameters: Any = None, input_tables: list[tlc.core.url.Url] | None = None)#
Bases:
tlc.core.object.Object
The abstract base class for all Table types.
A Table is an object with two specific responsibilities:
Creating table rows on demand (Either through the row-based access interface
table_rows
, or through the sample-based access interface provided by__getitem__
).Creating a schema which describes the type of produced rows (through the
rows_schema
property)
Both types of produced data are determined by immutable properties defined by each particular Table type.
ALTERNATIVE INTERFACE/CACHING:
A full representation of all table rows can - for performance reasons - also be retrieved through the
get_rows_as_binary
method.This method will try to retrieve a cached version of the table rows if
row_cache_url
is non-empty ANDrow_cache_populated
isTrue
When this is the case, it is guaranteed that the
schema
property of the table is fully populated, including the nested ‘rows_schema’ property which defines the layout of all table rows.When this cached version is NOT defined, however, get_rows_as_binary() needs to iterate over all rows to produce the data.
If
row_cache_url
is non-empty, the produced binary data will be cached to the specified location. After successful caching, the updated Table object will be written to its backing URL exactly once, now with ‘row_cache_populated’ set to True and with the schema fully updated. Also, therow_count
property is guaranteed to be correct at this time.Whether accessing data from a Table object later refers to this cached version (or produces the data itself) is implementation specific.
STATE MUTABILITY:
As described above, Tables are constrained in how they are allowed to change state:
The data production parameters (“recipe”) of a table are immutable
The persisted JSON representation of a Table (e.g. on disk) can take on three different states, and each state can be written only once:
Bare-bones recipe
Bare-bones recipe + full schema + ‘row_count’ (‘row_cache_populated’ = False)
Bare-bones recipe + full schema + ‘row_count’ (‘row_cache_populated’ = True)
- Parameters:
url – The URL of the table.
created – The creation time of the table.
description – The description of the table.
row_cache_url – The URL of the row cache.
row_cache_populated – Whether the row cache is populated.
override_table_rows_schema – The schema to override the table rows schema.
init_parameters – The initial parameters of the table.
input_tables – A list of Table URLs that are considered direct predecessors in this table’s lineage. This parameter serves as an explicit mechanism for tracking table relationships beyond the automatic lineage tracing typically managed by subclasses.
- copy(table_name: str | None = None, dataset_name: str | None = None, project_name: str | None = None, root_url: tlc.core.url.Url | str | None = None, if_exists: typing.Literal[raise, rename, overwrite] = 'raise', *, destination_url: tlc.core.url.Url | None = None) tlc.core.objects.table.Table #
Create a copy of this table.
The copy is performed to:
A URL derived from the given table_name, dataset_name, project_name, and root_url if given
destination_url, if given
A generated URL derived from the tables’s URL, if none of the above are given
- Parameters:
table_name – The name of the table to copy to.
dataset_name – The name of the dataset to copy to.
project_name – The name of the project to copy to.
root_url – The root URL to copy to.
if_exists – The behavior to use if the destination URL already exists.
destination_url – The URL to copy the table to.
- Returns:
The copied table.
- ensure_dependent_properties() None #
Ensure that the table set row_count as required to reach fully defined state.
- ensure_data_production_is_ready() None #
A method that ensures that the table is ready to produce data
This method is called before any access to the Table’s data is made. It is used to ensure that the Table has preformed any necessary data production steps. Normally Tables don’t produce data until it is requested, but this method can be called to force data production.
Note that subsequent applications of this method will not change the data, as a Table is immutable.
- collection_mode() Iterator[None] #
Enable metrics-collection mode on the Table.
When collecting metrics mode is enabled, only maps defined by calls to
map_collect_metrics()
are applied to the table rows.
- property row_schema: tlc.core.schema.Schema#
Returns the schema for a single row of this table.
- property rows_schema: tlc.core.schema.Schema#
Returns the schema for all rows of this table.
- property table_rows: tlc.core.objects.table.TableRows#
Access the rows of this table as an immutable mapping.
- set_row_cache_url(row_cache_url: tlc.core.url.Url | str) bool #
Assign a new row_cache_url value.
Will set row_cache_populated to False if the cache file has changed.
- Parameters:
row_cache_url – The new row_cache_url value.
- Returns:
True if the row_cache_url value was changed, False otherwise.
- static transform_value(schema: tlc.core.schema.Schema | None, item: object) object #
Transform a single table value according to the schema.
3LC currently only uses pure string representations of datetime values. This helper function is used to convert any timestamps to strings.
- Parameters:
schema – The schema corresponding to the column of the value.
item – The value to transform.
- write_to_row_cache(create_url_if_empty: bool = False, overwrite_if_exists: bool = True) None #
Cache the table rows to the row cache Url.
If the table is already cached, or the Url of the Table is an API-Url, this method does nothing.
In the case where self.row_cache_url is empty, a new Url will be created and assigned to self.row_cache_url if create_url_if_empty is True, otherwise a ValueError will be raised.
- Parameters:
create_url_if_empty – Whether to create a new row cache Url if self.row_cache_url is empty.
overwrite_if_exists – Whether to overwrite the row cache file if it already exists.
- get_rows_as_binary(exclude_bulk_data: bool = False) bytes #
Return all rows of the table as a binary Parquet buffer
This method will return the ‘Table-representation’ of the table, which is the most efficient representation, since only references to the input data are stored.
- Parameters:
exclude_bulk_data – Whether to exclude bulk data columns from the serialized rows. This filter only applies to Tables that are fully cached on disk.
- Returns:
The rows of the table as a binary Parquet buffer.
- should_include_schema_in_json(schema: tlc.core.schema.Schema) bool #
Only include the schema in the JSON representation if it is not empty.
- latest(use_new_columns: bool = True, wait_for_rescan: bool = True, timeout: float | None = None) tlc.core.objects.table.Table #
Return the most recent version of the table, as indexed by the TableIndexingTable indexing mechanism.
This function retrieves the latest version of this table that has been indexed or exists in the ObjectRegistry. If desired it is possible to wait for the next indexing run to complete by setting wait_for_rescan to True together with a timeout in seconds.
- Example:
table_instance = Table() ... # working latest_table = table_instance.latest()
- Parameters:
use_new_columns – If new columns have been added to the latest revision of the Table, whether to include these values in the sample-view of the Table. Defaults to True.
rescan – Whether to rescan the TableIndexingTable (lineage) before trying to resolve latest revision. Defaults to True.
timeout – The timeout in seconds to block when waiting for the next indexing run to complete. Defaults to None meaning that indexing can run forever.
- Returns:
The latest version of the table.
- Raises:
ValueError – If the latest version of the table cannot be found in the dataset or if an error occurs when attempting to create an object from the latest Url.
- revision(tag: Literal[latest] | None = None, table_url: tlc.core.url.Url | str = '', table_name: str = '') tlc.core.objects.table.Table #
Return a specific revision of the table.
This function retrieves a specific revision of this table. The revision can be specified by tag, table_url, or table_name. If no arguments are provided, the current table is returned.
- Parameters:
tag – The tag of the revision to return. Currently only ‘latest’ is supported.
table_url – The URL of the revision to return.
table_name – The name of the revision to return.
- squash(output_url: tlc.core.url.Url, dataset_name: str | None = None, project_name: str | None = None) tlc.core.objects.table.Table #
Create a copy of this table where all lineage is squashed.
A squashed table is a table where all lineage is merged. This is useful for creating a table that is independent of its parent tables. This function creates a new table with the same rows as the original table, but with no lineage. The new table is written to the output url.
- Example:
table = Table() ... # working squashed_table = table.squash(Url("s3://bucket/path/to/table"), dataset_name="my_dataset_v2")
- Parameters:
table – The table to squash.
output_url – The output url for the squashed table.
dataset_name – The dataset name to use for the squashed table. If not provided, the dataset_name of the original table is used.
project_name – The project name to use for the squashed table. If not provided, the project_name of the original table is used.
- Returns:
The squashed table.
- property pyarrow_schema: pyarrow.Schema | None#
Returns a pyarrow schema for this table
- property bulk_data_url: tlc.core.url.Url#
Return the sample url for this table.
The bulk data url is the url to the folder containing any bulk data for this table.
- to_pandas() pandas.DataFrame #
Return a pandas DataFrame for this table.
- Returns:
A pandas DataFrame populated from the rows of this table.
- add_column(column_name: str, values: list[object] | object, schema: tlc.core.schema.Schema | None = None, url: tlc.core.url.Url | None = None) tlc.core.objects.table.Table #
Create a derived table with a column added.
This method creates and returns a new revision of the table with a new column added.
- Parameters:
column_name – The name of the column to add.
values – The values to add to the column. This can be a list of values, or a single value to be added to all rows.
schema – The schema of the column to add. If not provided, the schema will be inferred from the values.
url – The url to write the new table to. If not provided, the new table will be located next to the current table.
- Returns:
A new table with the column added.
- set_value_map(value_path: str, value_map: dict[float, Any], *, edited_table_url: tlc.core.url.Url | str = '') tlc.core.objects.table.Table #
Set a value map for a specified numeric value within the schema of the Table.
Sets a value map for a value within the schema of the Table, returning a new table revision with the applied value map.
This method creates and returns a new revision of the table with a overridden value map for a specific numeric value.
Any item in a
Schema
of typeNumericValue
can have a value map. A value map is a mapping from a numeric value to aMapElement
, where aMapElement
contains metadata about a categorical value such as category names and IDs.Partial Value Maps
Value maps may be partial, i.e. they may only contain a mapping for a subset of the possible numeric values. Indeed they can be floating point values, which can be useful for annotating continuous variables with categorical metadata, such as color or label.
For more fine-grained control over value map editing, see
Table.set_value_map_item
andTable.add_value_map_item
, andTable.delete_value_map_item
.- Parameters:
value_path – The path to the value to add the value map to. Can be the name of a column, or a dot-separated path to a sub-value in a composite column.
value_map – The value map to set on the value. The value will be converted to a a dictionary mapping from floating point values to
MapElement
if it is not already.edited_table_url – The url of the edited table. If not provided, the new table will be located next to the current table.
- Returns:
A new table with the value map set.
- Raises:
ValueError – If the value path does not exist or is not a
NumericValue
.
- delete_value_map(value_path: str, *, edited_table_url: tlc.core.url.Url | str = '') tlc.core.objects.table.Table #
Delete a value map for a specified numeric value within the schema of the Table.
This method creates and returns a new revision of the Table with a deleted value map for a specific numeric value.
- Parameters:
value_path – The path to the value to add the value map to. Can be the name of a column, or a dot-separated path to a sub-value in a composite column.
edited_table_url – The url of the edited table. If not provided, the new table will be located next to the current table.
- Returns:
A new table with the value map deleted.
- Raises:
ValueError – If the value path does not exist or is not a
NumericValue
.
- set_value_map_item(value_path: str, value: float | int, internal_name: str, display_name: str = '', description: str = '', display_color: str = '', url: tlc.core.url.Url | str = '', *, edited_table_url: tlc.core.url.Url | str = '') tlc.core.objects.table.Table #
Update an existing value map item for a specified numeric value within the schema of the Table.
This method creates and returns a new revision of the table with a value map item added to a value in a column.
- Example:
table = Table.from_url("cats-and-dogs") new_table = table.set_value_map_item("label", 0, "cat") # new_table is now a new revision of the table with a updated value map item added to the value 0 in the column assert table.latest() == new_table, "The new table is the latest revision of the table."
To add a new value map item at the next available value in the value map, see
Table.add_value_map_item
.To delete a value map item, see
Table.delete_value_map_item
.- Parameters:
value_path – The path to the value to add the value map item to. Can be the name of a column, or a dot-separated path to a sub-value in a composite column.
value – The numeric value to add the value map item to. If the value already exists, the value map item will be updated.
internal_name – The internal name of the value map item. This is the primary identifier of the value map item.
display_name – The display name of the value map item.
description – The description of the value map item.
display_color – The display color of the value map item.
url – The url of the value map item.
edited_table_url – The url of the edited table. If not provided, the new table will be located next to the current table.
- Raises:
ValueError – If the value path does not exist or is not a
NumericValue
.
- add_value_map_item(value_path: str, internal_name: str, display_name: str = '', description: str = '', display_color: str = '', url: tlc.core.url.Url | str = '', *, value: float | int | None = None, edited_table_url: tlc.core.url.Url | str = '') tlc.core.objects.table.Table #
Add a value map item for a specified numeric value within the schema of the Table.
Adds a new value map item to the schema of the Table without overwriting existing items.
If the specified value or internal name already exists in the value map, this method will raise an error to prevent overwriting.
For more details on value maps, refer to the documentation for
Table.set_value_map
.- Parameters:
value_path – The path to the value to add the value map item to. Can be the name of a column, or a dot-separated path to a sub-value in a composite column.
internal_name – The internal name of the value map item. This is the primary identifier of the value map item.
display_name – The display name of the value map item.
description – The description of the value map item.
display_color – The display color of the value map item.
url – The url of the value map item.
value – The numeric value to add the value map item to. If not provided, the value will be the next available value in the value map (starting from 0).
edited_table_url – The url of the edited table. If not provided, the new table will be located next to the current table.
- Returns:
A new table with the value map item added.
- Raises:
ValueError – If the value path does not exist or is not a
NumericValue
, or if the value or internal name already exists in the value map.
- delete_value_map_item(value_path: str, *, value: float | int | None = None, internal_name: str = '', edited_table_url: tlc.core.url.Url | str = '') tlc.core.objects.table.Table #
Delete a value map item for a specified numeric value within the schema of the Table.
Deletes a value map item from the schema of the Table, by numeric value or internal name.
For more details on value maps, refer to the documentation for
Table.set_value_map
.- Parameters:
value_path – The path to the value to add the value map item to. Can be the name of a column, or a dot-separated path to a sub-value in a composite column.
value – The numeric value of the value map item to delete. If not provided, the value map item will be deleted by internal name.
internal_name – The internal name of the value map item to delete. If not provided, the value map item will be deleted by numeric value.
edited_table_url – The url of the edited table. If not provided, the new table will be located next to the current table.
- Returns:
A new table with the value map item deleted.
- Raises:
ValueError – If the value path does not exist or is not a
NumericValue
, or if the value or internal name does not exist in the value map.
- get_value_map(value_path: str) dict[float, tlc.core.schema.MapElement] | None #
Get the value map for a value path.
- Parameters:
value_path – The path to the value to get the value map for. Can be the name of a column, or a dot-separated path to a sub-value in a composite column.
- Returns:
A value map for the value, or None if the value does not exist or does not have a value map.
- export(output_url: tlc.core.url.Url | str | pathlib.Path, format: str | None = None, weight_threshold: float = 0.0, **kwargs: object) None #
Export this table to the given output url.
- Parameters:
output_url – The output url to export to.
format – The format to export to. If not provided, the format will be inferred from the table and the output url.
weight_threshold – The weight threshold to use for exporting. If the table has a weights column, rows with a weight below this threshold will be excluded from the export.
kwargs – Additional arguments to pass to the exporter. Which arguments are valid depends on the format. See the documentation for the subclasses of Exporter for more information.
- is_descendant_of(other: tlc.core.objects.table.Table) bool #
Return True if this table is a descendent of the provided table.
- Parameters:
other – The table to check if this table is a descendant of.
- Returns:
True if this table is a descendant of the provided table, False otherwise.
- get_foreign_table_url(column: str = FOREIGN_TABLE_ID) tlc.core.url.Url | None #
Return the input table URL referenced by this table.
This method is intended for tables that reference a single input table. Typically, this would be a metrics table of per-example metrics collected using another table.
If the table contains a column named ‘input_table_id’ with value map indicating it references a input table by Url, this method returns the Url of that input table.
- Parameters:
column – The name of the column to check for a foreign key.
- Returns:
The URL of the foreign table, or None if no input table is found.
- property weights_column_name: str | None#
Return the name of the column containing the weights for this table, or None if no such column exists.
- create_sampler(exclude_zero_weights: bool = True, weighted: bool = True, shuffle: bool = True, repeat_by_weight: bool = False) torch.utils.data.sampler.Sampler[int] #
Returns a sampler based on the weights column of the table. The type and behavior of the returned Sampler also depends on the values of the argument flags.
- Parameters:
exclude_zero_weight – If True, rows with a weight of zero will be excluded from the sampler. This is useful for reducing the length of the sampler for datasets with zero-weighted samples, and thus the length of an epoch when using a PyTorch DataLoader.
weighted – If True, the sampler will use sample weights (beyond the exclusion of zero-weighted rows) to ensure that the distribution of the sampled rows matches the distribution of the weights. When
weighted
is set to True, you are no longer guaranteed that every row in the table will be sampled in a single epoch, even if all weights are equal.shuffle – If False, the valid indices will be returned in sequential order. A value of False is mutually exclusive with the
weighted
flag.repeat_by_weight – If True, the sampler will repeat the indices based on the weights. This is useful for ensuring that the distribution of the sampled rows matches the distribution of the weights, while still sampling every row in the table (with weight > 1) in a single epoch. The number of repeats of samples with fractional weights will be determined probabilistically. A value of True will set the length of the sampler (and thus an epoch) to the sum of the weights. This flag requires values of
True
for bothweighted
andexclude_zero_weights
.
- Returns:
A Sampler based on the weights column of the table.
- map(func: Callable[[Any], object]) tlc.core.objects.table.Table #
Add a function to the list of functions to be applied to each sample in the table before it is returned by the
__getitem__
method when not doing metrics collection.- Parameters:
func – The function to apply to each sample when not doing metrics collection.
- Returns:
The table with the function added to the list of functions to apply to each sample when not doing metrics collection.
- map_collect_metrics(func: Callable[[Any], object]) tlc.core.objects.table.Table #
Add a function to the list of functions to be applied to each sample in the table before it is returned by the
__getitem__
method when doing metrics collection. If this list is empty, themap
functions will be used instead.- Parameters:
func – The function to apply to each sample when doing metrics collection.
- Returns:
The table with the function added to the list of functions to apply to each sample when doing metrics collection.
- static from_url(url: tlc.core.url.Url | str) tlc.core.objects.table.Table #
Create a table from a url.
- Parameters:
url – The url to create the table from
- Returns:
A concrete Table subclass
- Raises:
ValueError – If the url does not point to a table.
FileNotFoundError – If the url cannot be found.
- static from_names(table_name: str | None = None, dataset_name: str | None = None, project_name: str | None = None, root_url: tlc.core.url.Url | str | None = None) tlc.core.objects.table.Table #
Create a table from the names specifying its url.
- Parameters:
table_name – The name of the table.
dataset_name – The name of the dataset.
project_name – The name of the project.
root_url – The root url.
- Returns:
The table at the resulting url.
- static from_torch_dataset(dataset: torch.utils.data.Dataset, structure: tlc.client.sample_type._SampleTypeStructure | None = None, table_name: str | None = None, dataset_name: str | None = None, project_name: str | None = None, root_url: tlc.core.url.Url | str | None = None, if_exists: typing.Literal[raise, reuse, rename, overwrite] = 'reuse', add_weight_column: bool = True, all_arrays_are_fixed_size: bool = False, description: str | None = None, extra_columns: dict[str, tlc.client.sample_type._SampleTypeStructure] | None = None, input_tables: list[tlc.core.url.Url | str | pathlib.Path] | None = None, *, table_url: tlc.core.url.Url | pathlib.Path | str | None = None) tlc.core.objects.tables.from_python_object.TableFromTorchDataset #
Create a Table from a Torch Dataset.
- Parameters:
dataset – The Torch Dataset to create the table from.
structure – The structure of a single sample in the table. This is used to infer the schema of the table, and perform any necessary conversions between the row representation and the sample representation of the data. If not provided, the structure will be inferred from the first sample in the table.
table_name – The name of the table.
dataset_name – The name of the dataset.
project_name – The name of the project.
root_url – The root url of the table.
if_exists – What to do if the table already exists at the provided url.
add_weight_column – Whether to add a column of sampling weights to the table, all initialized to 1.0.
all_arrays_are_fixed_size – Whether all arrays (tuples, lists, etc.) in the dataset are fixed size. This parameter is only used when generating a SampleType from a single sample in the dataset when no
structure
is provided.description – A description of the table.
extra_columns – A dictionary of extra columns to add to the table. The keys are the column names, and the values are the structures of the columns. These can typically be expressed as
Schemas
,ScalarValues
, orSampleTypes
.table_url – A custom Url for the table, mutually exclusive with {root_url, project_name, dataset_name, table_name}
- Returns:
A TableFromTorchDataset instance.
- static from_pandas(df: pandas.DataFrame, structure: tlc.client.sample_type._SampleTypeStructure | None = None, table_name: str | None = None, dataset_name: str | None = None, project_name: str | None = None, root_url: tlc.core.url.Url | str | None = None, if_exists: typing.Literal[raise, reuse, rename, overwrite] = 'reuse', add_weight_column: bool = True, description: str | None = None, extra_columns: dict[str, tlc.client.sample_type._SampleTypeStructure] | None = None, input_tables: list[tlc.core.url.Url | str | pathlib.Path] | None = None, *, table_url: tlc.core.url.Url | pathlib.Path | str | None = None) tlc.core.objects.tables.from_python_object.TableFromPandas #
Create a Table from a Pandas DataFrame.
- Parameters:
df – The Pandas DataFrame to create the table from.
structure – The structure of a single sample in the table. This is used to infer the schema of the table, and perform any necessary conversions between the row representation and the sample representation of the data. If not provided, the structure will be inferred from the first sample in the table.
table_name – The name of the table.
dataset_name – The name of the dataset.
project_name – The name of the project.
root_url – The root url of the table.
if_exists – What to do if the table already exists at the provided url.
add_weight_column – Whether to add a column of sampling weights to the table, all initialized to 1.0.
description – A description of the table.
extra_columns – A dictionary of extra columns to add to the table. The keys are the column names, and the values are the structures of the columns. These can typically be expressed as
Schemas
,ScalarValues
, orSampleTypes
.table_url – A custom Url for the table, mutually exclusive with {root_url, project_name, dataset_name, table_name}
- Returns:
A TableFromPandas instance.
- static from_dict(data: typing.Mapping[str, object], structure: tlc.client.sample_type._SampleTypeStructure | None = None, table_name: str | None = None, dataset_name: str | None = None, project_name: str | None = None, root_url: tlc.core.url.Url | str | None = None, if_exists: typing.Literal[raise, reuse, rename, overwrite] = 'reuse', add_weight_column: bool = True, description: str | None = None, extra_columns: dict[str, tlc.client.sample_type._SampleTypeStructure] | None = None, input_tables: list[tlc.core.url.Url | str | pathlib.Path] | None = None, *, table_url: tlc.core.url.Url | pathlib.Path | str | None = None) tlc.core.objects.tables.from_python_object.TableFromPydict #
Create a Table from a dictionary.
- Parameters:
data – The dictionary to create the table from.
structure – The structure of a single sample in the table. This is used to infer the schema of the table, and perform any necessary conversions between the row representation and the sample representation of the data. If not provided, the structure will be inferred from the first sample in the table.
table_name – The name of the table.
dataset_name – The name of the dataset.
project_name – The name of the project.
root_url – The root url of the table.
if_exists – What to do if the table already exists at the provided url.
add_weight_column – Whether to add a column of sampling weights to the table, all initialized to 1.0.
description – A description of the table.
extra_columns – A dictionary of extra columns to add to the table. The keys are the column names, and the values are the structures of the columns. These can typically be expressed as
Schemas
,ScalarValues
, orSampleTypes
.table_url – A custom Url for the table, mutually exclusive with {root_url, project_name, dataset_name, table_name}
- Returns:
A TableFromPydict instance.
- static from_csv(csv_file: str | pathlib.Path | tlc.core.url.Url, structure: tlc.client.sample_type._SampleTypeStructure | None = None, table_name: str | None = None, dataset_name: str | None = None, project_name: str | None = None, root_url: tlc.core.url.Url | str | None = None, if_exists: typing.Literal[raise, reuse, rename, overwrite] = 'reuse', add_weight_column: bool = True, description: str | None = None, extra_columns: dict[str, tlc.client.sample_type._SampleTypeStructure] | None = None, input_tables: list[tlc.core.url.Url | str | pathlib.Path] | None = None, *, table_url: tlc.core.url.Url | pathlib.Path | str | None = None) tlc.core.objects.tables.from_url.TableFromCsv #
Create a Table from a .csv file.
- Parameters:
csv_file – The url of the .csv file.
structure – The structure of a single sample in the table. This is used to infer the schema of the table, and perform any necessary conversions between the row representation and the sample representation of the data. If not provided, the structure will be inferred from the first sample in the table.
table_name – The name of the table.
dataset_name – The name of the dataset.
project_name – The name of the project.
root_url – The root url of the table.
if_exists – What to do if the table already exists at the provided url.
add_weight_column – Whether to add a column of sampling weights to the table, all initialized to 1.0.
description – A description of the table.
extra_columns – A dictionary of extra columns to add to the table. The keys are the column names, and the values are the structures of the columns. These can typically be expressed as
Schemas
,ScalarValues
, orSampleTypes
.table_url – A custom Url for the table, mutually exclusive with {root_url, project_name, dataset_name, table_name}
- Returns:
A TableFromCsv instance.
- static from_coco(annotations_file: str | pathlib.Path | tlc.core.url.Url, image_folder: str | pathlib.Path | tlc.core.url.Url | None = None, structure: tlc.client.sample_type._SampleTypeStructure | None = None, table_name: str | None = None, dataset_name: str | None = None, project_name: str | None = None, root_url: tlc.core.url.Url | str | None = None, if_exists: typing.Literal[raise, reuse, rename, overwrite] = 'reuse', add_weight_column: bool = True, description: str | None = None, extra_columns: dict[str, tlc.client.sample_type._SampleTypeStructure] | None = None, input_tables: list[tlc.core.url.Url | str | pathlib.Path] | None = None, *, table_url: tlc.core.url.Url | pathlib.Path | str | None = None) tlc.core.objects.tables.from_url.TableFromCoco #
Create a Table from a COCO annotations file.
- Parameters:
annotations_file – The url of the COCO annotations file.
image_folder – The url of the folder containing the images referenced in the COCO annotations file. If not provided, the image paths in the annotations file will be assumed to either be absolute OR relative to the annotations file.
structure – The structure of a single sample in the table. This is used to infer the schema of the table, and perform any necessary conversions between the row representation and the sample representation of the data. If not provided, the structure will be inferred from the first sample in the table.
table_name – The name of the table.
dataset_name – The name of the dataset.
project_name – The name of the project.
root_url – The root url of the table.
if_exists – What to do if the table already exists at the provided url.
add_weight_column – Whether to add a column of sampling weights to the table, all initialized to 1.0.
description – A description of the table.
extra_columns – A dictionary of extra columns to add to the table. The keys are the column names, and the values are the structures of the columns. These can typically be expressed as
Schemas
,ScalarValues
, orSampleTypes
.table_url – A custom Url for the table, mutually exclusive with {root_url, project_name, dataset_name, table_name}
- Returns:
A TableFromCoco instance.
- static from_parquet(parquet_file: str | pathlib.Path | tlc.core.url.Url, structure: tlc.client.sample_type._SampleTypeStructure | None = None, table_name: str | None = None, dataset_name: str | None = None, project_name: str | None = None, root_url: tlc.core.url.Url | str | None = None, if_exists: typing.Literal[raise, reuse, rename, overwrite] = 'reuse', add_weight_column: bool = True, description: str | None = None, extra_columns: dict[str, tlc.client.sample_type._SampleTypeStructure] | None = None, input_tables: list[tlc.core.url.Url | str | pathlib.Path] | None = None, *, table_url: tlc.core.url.Url | pathlib.Path | str | None = None) tlc.core.objects.tables.from_url.TableFromParquet #
Create a Table from a Parquet file.
- Parameters:
parquet_file – The url of the Parquet file.
structure – The structure of a single sample in the table. This is used to infer the schema of the table, and perform any necessary conversions between the row representation and the sample representation of the data. If not provided, the structure will be inferred from the first sample in the table.
table_name – The name of the table.
dataset_name – The name of the dataset.
project_name – The name of the project.
root_url – The root url of the table.
if_exists – What to do if the table already exists at the provided url.
add_weight_column – Whether to add a column of sampling weights to the table, all initialized to 1.0.
description – A description of the table.
extra_columns – A dictionary of extra columns to add to the table. The keys are the column names, and the values are the structures of the columns. These can typically be expressed as
Schemas
,ScalarValues
, orSampleTypes
.table_url – A custom Url for the table, mutually exclusive with {root_url, project_name, dataset_name, table_name}
- Returns:
A TableFromParquet instance.
- static from_yolo(dataset_yaml_file: str | pathlib.Path | tlc.core.url.Url, split: str = 'train', datasets_dir: str | pathlib.Path | tlc.core.url.Url | None = None, override_split_path: str | pathlib.Path | tlc.core.url.Url | typing.Iterable[str | pathlib.Path | tlc.core.url.Url] | None = None, structure: tlc.client.sample_type._SampleTypeStructure | None = None, table_name: str | None = None, dataset_name: str | None = None, project_name: str | None = None, root_url: tlc.core.url.Url | str | None = None, if_exists: typing.Literal[raise, reuse, rename, overwrite] = 'reuse', add_weight_column: bool = True, description: str | None = None, extra_columns: dict[str, tlc.client.sample_type._SampleTypeStructure] | None = None, input_tables: list[tlc.core.url.Url | str | pathlib.Path] | None = None, *, table_url: tlc.core.url.Url | pathlib.Path | str | None = None) tlc.core.objects.tables.from_url.TableFromYolo #
Create a Table from a YOLO annotations file.
- Parameters:
dataset_yaml_file – The url of the YOLO dataset .yaml file.
split – The split to load from the dataset.
datasets_dir – If
path
in the dataset_yaml_file is relative, this directory will be prepended to it. Not used ifpath
is absolute. Ifpath
is relative and datasets_dir is not provided, an error is raised.override_split_path – If provided, this will be used as the path to the directory with images and labels instead of the one specified in the dataset_yaml_file. Can be an iterable of such paths.
structure – The structure of a single sample in the table. This is used to infer the schema of the table, and perform any necessary conversions between the row representation and the sample representation of the data. If not provided, the structure will be inferred from the first sample in the table.
table_name – The name of the table.
dataset_name – The name of the dataset.
project_name – The name of the project.
root_url – The root url of the table.
if_exists – What to do if the table already exists at the provided url.
add_weight_column – Whether to add a column of sampling weights to the table, all initialized to 1.0.
description – A description of the table.
extra_columns – A dictionary of extra columns to add to the table. The keys are the column names, and the values are the structures of the columns. These can typically be expressed as
Schemas
,ScalarValues
, orSampleTypes
.table_url – A custom Url for the table, mutually exclusive with {root_url, project_name, dataset_name, table_name}
- Returns:
A TableFromYolo instance.
- static from_hugging_face(path: str, name: str | None = None, split: str = 'train', table_name: str | None = None, dataset_name: str | None = None, project_name: str | None = None, root_url: tlc.core.url.Url | str | None = None, if_exists: typing.Literal[raise, reuse, rename, overwrite] = 'reuse', add_weight_column: bool = True, description: str | None = None, extra_columns: dict[str, tlc.client.sample_type._SampleTypeStructure] | None = None, input_tables: list[tlc.core.url.Url | str | pathlib.Path] | None = None, *, table_url: tlc.core.url.Url | pathlib.Path | str | None = None) tlc.integration.hugging_face.TableFromHuggingFace #
Create a Table from a Hugging Face Hub dataset, similar to the
datasets.load_dataset
function.- Parameters:
path – Path or name of the dataset to load, same as in
datasets.load_dataset
.name – Name of the dataset to load, same as in
datasets.load_dataset
.split – The split to load, same as in
datasets.load_dataset
.table_name – The name of the table. If not provided, the
table_name
is set tosplit
.dataset_name – The name of the dataset. If not provided,
dataset_name
is set topath
ifname
is not provided, or to{path}-{name}
ifname
is provided.project_name – The name of the project. If not provided,
project_name
is set tohf-{path}
.root_url – The root url of the table.
if_exists – What to do if the table already exists at the provided url.
add_weight_column – Whether to add a column of sampling weights to the table, all initialized to 1.0.
description – A description of the table.
extra_columns – A dictionary of extra columns to add to the table. The keys are the column names, and the values are the structures of the columns. These can typically be expressed as
Schemas
,ScalarValues
, orSampleTypes
.table_url – A custom Url for the table, mutually exclusive with {root_url, project_name, dataset_name, table_name}
- Returns:
A TableFromHuggingFace instance.
- static from_image_folder(root: str | pathlib.Path | tlc.core.url.Url, image_column_name: str = 'image', label_column_name: str = 'label', include_label_column: bool = True, extensions: str | tuple[str, ...] | None = None, table_name: str | None = None, dataset_name: str | None = None, project_name: str | None = None, root_url: tlc.core.url.Url | str | None = None, if_exists: typing.Literal[raise, reuse, rename, overwrite] = 'reuse', add_weight_column: bool = True, description: str | None = None, extra_columns: dict[str, tlc.client.sample_type._SampleTypeStructure] | None = None, label_overrides: dict[str, tlc.core.schema.MapElement | str] | None = None, input_tables: list[tlc.core.url.Url | str | pathlib.Path] | None = None, *, table_url: tlc.core.url.Url | pathlib.Path | str | None = None) tlc.core.objects.table.Table #
Create a Table from an image folder.
This function can be used to load a folder containing subfolders where each subfolder represents a label, or to recursively load all matching images in a folder structure without labels. It extends the functionality of torchvision.datasets.ImageFolder.
When
include_label_column
is True, the dataset elements are returned as tuples of aPIL.Image
and the integer class label. Wheninclude_label_column
is False,PIL.Image
s are returned without labels. In this case,root
will be recursively scanned.- Parameters:
root – The root directory of the image folder.
image_column_name – The name of the column containing the images.
label_column_name – The name of the column containing the class labels.
include_label_column – Whether to include a column of class labels in the table.
extensions – A list of allowed image extensions. If not provided, a default list of image extensions is used.
table_name – The name of the table.
dataset_name – The name of the dataset.
project_name – The name of the project.
root_url – The root url of the table.
if_exists – What to do if the table already exists at the provided url.
add_weight_column – Whether to add a column of sampling weights to the table, all initialized to 1.0.
description – A description of the table.
extra_columns – A dictionary of extra columns to add to the table. The keys are the column names, and the values are the structures of the columns. These can typically be expressed as
Schemas
,ScalarValues
, orSampleTypes
.label_overrides – A sparse mapping of class names (the directory names) to new class names. A new class name can be a string with the new class name or a
MapElement
.table_url – A custom Url for the table, mutually exclusive with {root_url, project_name, dataset_name, table_name}
- static join_tables(tables: typing.Sequence[tlc.core.objects.table.Table] | typing.Sequence[tlc.core.url.Url | str | pathlib.Path], table_name: str = _3LC_FALLBACK_JOINED_TABLE_NAME, dataset_name: str | None = None, project_name: str | None = None, root_url: tlc.core.url.Url | str | None = None, if_exists: typing.Literal[raise, reuse, rename, overwrite] = 'reuse', add_weight_column: bool = True, description: str | None = None, extra_columns: dict[str, tlc.client.sample_type._SampleTypeStructure] | None = None, input_tables: list[tlc.core.url.Url | str | pathlib.Path] | None = None, *, table_url: tlc.core.url.Url | str | pathlib.Path | None = None) tlc.core.objects.table.Table #
Join multiple tables into a single table.
The tables will be joined vertically, meaning that the rows of the resulting table will be the concatenation of the rows of the input tables, in the order they are provided.
The schemas of the tables must be compatible for joining. If the tables have different schemas, the schemas will be attempted merged, and an error will be raised if this is not possible.
- Parameters:
tables – A list of Table instances to join.
table_name – The name of the table.
dataset_name – The name of the dataset.
project_name – The name of the project.
root_url – The root url of the table.
if_exists – What to do if the table already exists at the provided url.
add_weight_column – Whether to add a column of sampling weights to the table, all initialized to 1.0.
description – A description of the table.
extra_columns – A dictionary of extra columns to add to the table. The keys are the column names, and the values are the structures of the columns. These can typically be expressed as
Schemas
,ScalarValues
, orSampleTypes
.table_url – A custom Url for the table, mutually exclusive with {root_url, project_name, dataset_name, table_name}
- tlc.core.objects.table.sort_tables_chronologically(tables: list[tlc.core.objects.table.Table], reverse: bool = False) list[tlc.core.objects.table.Table] #
Sort a list of tables chronologically.
- Parameters:
tables – A list of tables to sort chronologically.
- Returns:
A list of tables sorted chronologically.
- tlc.core.objects.table.squash_table(table: tlc.core.objects.table.Table | tlc.core.url.Url, output_url: tlc.core.url.Url) tlc.core.objects.table.Table #
Create a copy of this table where all lineage is squashed.
- Example:
table_instance = Table() ... # working squashed_table = squash_table(table_instance, Url("s3://bucket/path/to/table"))
- Parameters:
table – The table to squash.
output_url – The output url for the squashed table.
- Returns:
The squashed table.