Combining Data and Models in 3LC#

In this article, we will take a closer look at how to combine data and models for metrics collection in 3LC. This is crucial to ensure all the pieces of the data pipeline fit together.

Standard PyTorch Data Pipeline#

We assume the reader is familiar with PyTorch basics, such as Dataset, DataLoader, and Module. For more information on these topics, please refer to the PyTorch data documentation.

We also assume the reader is familiar with the the tlc.Table object, in particular the “sample-view” vs. “row-view” of the Table data, as well as using maps to modify data on-the-fly. We refer to the table and sample-type user guide pages for more information.

The plain data pipeline is as follows: individual samples are fetched from a Dataset, batched by a DataLoader, and sent to a Module for prediction. Ensuring the batched output is compatible with the signature of the model’s forward method is crucial, and might require some preprocessing in the form of transforms or custom collate functions.

Data Pipeline Using 3LC#

The pipeline in 3LC is similar to the standard PyTorch pipeline described above, with some important differences:

  • Instead of using a Dataset, a tlc.Table is used.

  • Instead of using a Module, a tlc.Predictor is used.

When collecting metrics using tlc.collect_metrics(), your model will be wrapped in a tlc.Predictor object, which handles calling the model’s forward method and converting the output to a standard format which will be passed to the metrics collectors.

Because models can expect various different input signatures in the call method, and a dataset can return samples in various different formats, and collating can be done in various different ways, the tlc.Predictor object provides a set of extra parameters to control how the model will be called. We illustrate this in several examples below.

Examples#

For the following examples, we will work with three different models and pass data from two different tables to them.

First we will work with an unmapped table returning samples as plain dictionaries. Next we will map the table so that it instead returns tuples. We will batch the data from both tables and create suitable Predictors for each model. These examples do not exhaust all the possible combinations of models and data but should give a good idea of how to customize the tlc.Predictor object.

First, we define the models:

import torch

class MyModelWithKeywordArgs(torch.nn.Module):
    """Model that sums two arguments passed as keyword arguments"""
    def forward(self, arg_1=None, arg_2=None):
        return arg_1 + arg_2

class MyModelWithPositionalArgs(torch.nn.Module):
    """Model that sums two arguments passed as positional arguments"""
    def forward(self, arg_1, arg_2):
        return arg_1 + arg_2

class MyModelWithSingleArgument(torch.nn.Module):
    """Model that sums two arguments passed as a single tuple argument"""
    def forward(self, arg):
        arg_1, arg_2 = arg
        return arg_1 + arg_2

my_model_with_keyword_args = MyModelWithKeywordArgs()
my_model_with_positional_args = MyModelWithPositionalArgs()
my_model_with_single_argument = MyModelWithSingleArgument()

Next, we define the base table:

"""Define a table with two columns of integer data"""
import tlc

table_data = {
    "arg_1": list(range(10)),
    "arg_2": list(range(10, 20)),
}

table = tlc.Table.from_dict(
    table_data,
    table_name="table",
)

print("First row of table:")
print(table[0])

# Output:
# First row of table:
# {'arg_1': 0, 'arg_2': 10}

PyTorch’s default collation of dictionary samples is to return a dictionary of tensors:

"""Pass the table to a DataLoader and inspect the batch"""
import torch

dataloader = torch.utils.data.DataLoader(table, batch_size=2)

batch = next(iter(dataloader))
print(batch)

# Output:
# {'arg_1': tensor([0, 1]), 'arg_2': tensor([10, 11])}

We will now see how we can send this collated batch to each of the three models.

For the model with keyword arguments, we can use the unpack_dicts parameter to unpack the dictionary into keyword arguments:

predictor_with_keyword_args = tlc.Predictor(
    my_model_with_keyword_args,
    unpack_dicts=True,
)
output = predictor_with_keyword_args(batch)
print(output)

# Output:
# PredictorOutput(forward=tensor([10, 12]), hidden_layers={}, metadata=None)

For the model with positional arguments, we can use a custom preprocess function to unpack the dictionary into a list, and then use the unpack_lists parameter to unpack the list into positional arguments:

def predictor_preprocess_fn(batch):
    return (batch["arg_1"], batch["arg_2"])

predictor_with_positional_args = tlc.Predictor(
    my_model_with_positional_args,
    unpack_lists=True,
    preprocess_fn=predictor_preprocess_fn,
)
output = predictor_with_positional_args(batch)
print(output)

# Output:
# PredictorOutput(forward=tensor([10, 12]), hidden_layers={}, metadata=None)

Finally, for the model with a single argument, we use a preprocess function to unpack the dictionary into a tuple, which is sent directly into the model:

def predictor_preprocess_fn(batch):
    return (batch["arg_1"], batch["arg_2"])

predictor_with_single_argument = tlc.Predictor(
    my_model_with_single_argument,
    preprocess_fn=predictor_preprocess_fn,
)
output = predictor_with_single_argument(batch)
print(output)

# Output:
# PredictorOutput(forward=tensor([10, 12]), hidden_layers={}, metadata=None)

Now we will map the table so that it returns tuples instead of dictionaries:

"""Map the Table so that it returns tuples instead of dictionaries"""
table.map(lambda row: (row["arg_1"], row["arg_2"]))

print("First row of table:")
print(table[0])

# Output:
# First row of table:
# (0, 10)

The collated batch will now be a list of tuples:

dataloader = torch.utils.data.DataLoader(table, batch_size=2)

batch = next(iter(dataloader))
print(batch)

# Output:
# [tensor([0, 1]), tensor([10, 11])]

We can now define a new set of Predictors to handle the new data format.

First, we use a custom preprocessor to wrap the batch in a dictionary, and then use the unpack_dicts parameter to unpack the dictionary into keyword arguments:

def predictor_preprocess_fn(batch):
    return {"arg_1": batch[0], "arg_2": batch[1]}

predictor_with_keyword_args = tlc.Predictor(
    my_model_with_keyword_args,
    unpack_dicts=True,
    preprocess_fn=predictor_preprocess_fn,
)
output = predictor_with_keyword_args(batch)
print(output)

# Output:
# PredictorOutput(forward=tensor([10, 12]), hidden_layers={}, metadata=None)

For the model with positional arguments, we can use the unpack_lists parameter to unpack the list into positional arguments. Notice we need to disable the Predictor’s default preprocessing, as the default behavior when encountering list-like batches is to only return the first element. This is convenient for the common practice of returning samples as (input, target) tuples, where only the input should be sent to the model, but not in this case:

predictor_with_positional_args = tlc.Predictor(
    my_model_with_positional_args,
    unpack_lists=True,
    disable_preprocess=True,
)
output = predictor_with_positional_args(batch)
print(output)

# Output:
# PredictorOutput(forward=tensor([10, 12]), hidden_layers={}, metadata=None)

For the final case, we again need to disable default preprocessing to ensure the batch is sent to the model as-is:

predictor_with_single_argument = tlc.Predictor(
    my_model_with_single_argument,
    disable_preprocess=True,
)
output = predictor_with_single_argument(batch)
print(output)

# Output:
# PredictorOutput(forward=tensor([10, 12]), hidden_layers={}, metadata=None)

Intermediate activations and embeddings#

A tlc.Predictor can also be used to extract intermediate activations of your model. On passing the predictor argument layers, you specify which layer indices to get the activations from:

predictor = tlc.Predictor(model, layers=(0, 2)) # Get activations from the first and third layer
output = predictor(batch)
print(output)

# Output:
# PredictorOutput(forward=tensor([...]), hidden_layers={0: tensor([...]), 2: tensor([...])}, metadata=None)

The EmbeddingsMetricsCollector expects the predictor to have produced such hidden_layers when it is used.

Under the hood, forward hooks are added to the model to extract the intermediate activations of a forward pass. 3LC only applies these hooks when it is used for metrics collection, as the model instance can’t be serialized when it has hooks applied.

Using a tlc.Predictor directly in your own code, it is recommended to use the predictor.with_hooks() context manager to ensure hooks are removed after it is used:

predictor = tlc.Predictor(model, layers=(0, 2)) # Get activations from the first and third layer

with predictor.with_hooks(): # Add and remove hooks once
    for batch in batches:
        output = predictor(batch)

If the predictor is used repeatedly to collect metrics without the with_hooks() context manager, the hooks are added and removed for each forward pass. This has a negative performance implication.

Conclusion#

In this article, we have seen how to combine data and models in 3LC to ensure all the pieces of the data pipeline fit together. By using the tlc.Predictor object, we can customize how the model is called and how the data is passed to it. We have also seen how a tlc.Predictor can be used to extract hidden layer activations of a model.