Write a Table Directly from Row Data#

In this sample, we will write a tlc.Table directly by adding rows from a JSON file one by one.

We will use a tlc.TableWriter instance to build up the table, and then write it to a file.

The written table will be of type tlc.TableFromParquet, and the table data will be backed by a parquet file.

Project Setup#

[2]:

PROJECT_NAME = "Table Writer Examples"
DATASET_NAME = "Mammoth"
TABLE_NAME = "mammoth-10k"
TLC_PUBLIC_EXAMPLES_DEVELOPER_MODE = True
INSTALL_DEPENDENCIES = False

[4]:

%%capture
if INSTALL_DEPENDENCIES:
    %pip --quiet install torch --index-url https://download.pytorch.org/whl/cu118
    %pip --quiet install torchvision --index-url https://download.pytorch.org/whl/cu118
    %pip --quiet install tlc

Download Source Data#

We will use the 3D mammoth data as an example. This is a popular toy dataset commonly used in the dimensionality reduction literature.

The original data can be found in the PaCMAP github repository.

[7]:

import requests

response = requests.get("https://raw.githubusercontent.com/YingfanWang/PaCMAP/master/data/mammoth_3d.json")
input_data = response.json()

[8]:

# The input data is represented as a JSON list of lists, where each sublist is a 3D point:
print(type(input_data))
print(len(input_data))
print(len(input_data[0]))

<class 'list'>
10000
3

Writing the Table#

We construct a TableWriter, which will determine the URL and the schema of the table we want to write. In our case, the table will contain a single column of 3-vectors.

[9]:

import tlc

column_name = "points"

table_writer = tlc.TableWriter(
    dataset_name=DATASET_NAME,
    project_name=PROJECT_NAME,
    table_name=TABLE_NAME,
    description="A table containing 10,000 3D points of a mammoth",
    if_exists="overwrite",
    column_schemas={column_name: tlc.FloatVector3Schema("3D Points")},
)

[10]:

# Next we add the data to the table row by row.
for point in input_data:
    table_writer.add_row({column_name: point})

# Finally, we flush the table writer to ensure that all data is written to disk.
table = table_writer.finalize()

[11]:

# Inspect the first row of the table:
table[0]

[11]:

{'points': [430.82598876953125, 106.86399841308594, 24.492000579833984]}

Next Steps#

The written table can now be viewed in the 3LC Dashboard.

Some ideas for further exploration:

Visualize the data in a scatter plot
Apply dimensionality reduction to the data
Segment the data into clusters