.ipynb
Add new data to existing Table lineage¶
Adding new data to an existing dataset is a common task, as more data is collected and we want to leverage it to improve the model. This notebook demonstrates how to add new data to an existing 3LC dataset by creating a new table that merges two or more existing tables.
We will cover two examples:
Adding new data with the same classes.
Adding new data with different classes, requiring a new, merged schema.
Project setup¶
[ ]:
DATA_PATH = "../../data"
PROJECT_NAME = "3LC Tutorials - Cats & Dogs"
DATASET_NAME = "cats-and-dogs"
Install dependencies¶
[ ]:
%pip install 3lc
%pip install git+https://github.com/3lc-ai/3lc-examples.git
Imports¶
Add new data with the same classes¶
We will reuse the cats and dogs dataset from the previous section and add a new batch of data.
Before we add it, we need to create a Table with the new data. Notice also that we set the weight_column_value=0.0, this is to keep track of which samples were added in the resulting table.
[ ]:
data_path = Path(DATA_PATH) / "more-cats-and-dogs"
assert data_path.exists()
[ ]:
new_data_table = tlc.Table.from_image_folder(
data_path,
table_name="new-data",
dataset_name=DATASET_NAME,
project_name=PROJECT_NAME,
add_weight_column=True,
weight_column_value=0.0,
if_exists="overwrite",
)
new_data_table
Let’s also get the cats and dogs dataset from the notebook create-table-from-image-folder.ipynb to use as a base for the new data.
[ ]:
initial_table = tlc.Table.from_names(table_name="initial-cls", dataset_name=DATASET_NAME, project_name=PROJECT_NAME)
initial_table
Now that we have the two tables, we are ready to combine them using Table.join_tables(). We specify a list of tables to join, and the name of the new table resulting from joining them.
[ ]:
joined_table = tlc.Table.join_tables([initial_table, new_data_table], table_name="added-more-data")
joined_table
[ ]:
for row in joined_table.table_rows:
print(row)
Add new data with different classes¶
We will now create a new image folder table containing animals in the categories “bats” and “frogs”. In order for this table to be joined with our existing table, we need to remap the labels “bat” and “frog”, and their corresponding values.
[ ]:
data_path = Path(DATA_PATH) / "bats-and-frogs"
more_new_data_table = tlc.Table.from_image_folder(
data_path,
table_name="more-new-data",
dataset_name=DATASET_NAME,
project_name=PROJECT_NAME,
add_weight_column=True,
weight_column_value=0.0,
if_exists="overwrite",
)
more_new_data_table
[ ]:
[ ]:
# Update the value map
remap_value_map_table = more_new_data_table.set_value_map("label", {0: "cats", 1: "dogs", 2: "bats", 3: "frogs"})
[ ]:
import numpy as np
from tlc_tools.split import set_value_in_column_to_fixed_value
# Update the row values: 0->2 and 1->3
label_column = remap_value_map_table.get_column("label").to_numpy()
zero_indices = np.where(label_column == 0)[0].tolist()
one_indices = np.where(label_column == 1)[0].tolist()
remapped_bats_table = set_value_in_column_to_fixed_value(remap_value_map_table, "label", zero_indices, 2)
remapped_frogs_table = set_value_in_column_to_fixed_value(remapped_bats_table, "label", one_indices, 3)
We now create yet another table by joining the previous joined table with the remapped bats and frogs.
[ ]:
joined_again_table = tlc.Table.join_tables([joined_table, remapped_frogs_table], table_name="added-bats-and-frogs-data")
Originally, the two tables had different value maps. Let’s inspect them:
[ ]:
joined_table.get_simple_value_map("label")
[ ]:
final_value_map = joined_again_table.get_simple_value_map("label")
Now inspect the row data of the final joined table.
[ ]:
for i, row in enumerate(joined_again_table.table_rows):
image_path = row["image"]
label = row["label"]
weight = row["weight"]
print(f"Row {i}: {image_path}, {final_value_map[label]}, weight: {weight}")