View source Download .ipynb

Weighted Table Subset Selection¶

This notebook demonstrates how to apply zero weights to a subset of table rows for selective data processing.

image1

This technique is particularly useful in active learning and data labeling workflows, where only a subset of rows should be utilized for training or considered for labeling in each iteration.

Specifically, this example demonstrates balanced coreset selection on a dataset, setting all non-coreset rows’ weights to zero. The coreset selection strategy can be adapted to employ different approaches, such as random sampling, uncertainty-based sampling, or other model-driven selection criteria.

Install dependencies¶

[ ]:
%pip install 3lc
%pip install git+https://github.com/3lc-ai/3lc-examples.git

Imports¶

[ ]:
import tlc

from tlc_tools.split import get_balanced_coreset_indices, set_value_in_column_to_fixed_value

Project setup¶

[ ]:
PROJECT_NAME = "3LC Tutorials - CIFAR-10"
DATASET_NAME = "CIFAR-10-train"
TABLE_NAME = "initial"

Load input table¶

This assumes CIFAR-10-train has been created by running the notebook create-table-from-torch.ipynb.

[ ]:
table = tlc.Table.from_names(TABLE_NAME, DATASET_NAME, PROJECT_NAME)

Compute coreset¶

[ ]:
# This function ensures the coreset is exactly balanced in terms of the split_by column.
# The size parameter is the fraction of the minority class that should be included in the coreset.
coreset_indices, non_coreset_indices = get_balanced_coreset_indices(
    table,
    size=0.01,  # CIFAR-10-train has 5000 samples per class, so 0.01 will result in 500 samples per class
    split_by="Label",
    random_seed=42,
)

Weight non-coreset rows to 0¶

[ ]:
coreset_table = set_value_in_column_to_fixed_value(
    table,
    "weight",
    non_coreset_indices,
    0.0,
)
[ ]:
coreset_table
[ ]:
# During training, we can now use a sampler that only samples non-zero weight rows
sampler = coreset_table.create_sampler(
    exclude_zero_weights=True,
)
print(len(sampler))

Remove non-coreset samples¶

[ ]:
from tlc_tools.split import keep_indices

subset = keep_indices(
    table, coreset_indices, table_name="balanced-subset", table_description="Keep only a size 500 coreset"
)