Integrating 3LC with 🤗#

This document describes how to integrate 3LC in projects using Hugging Face (🤗) Python packages. Hugging Face hosts an ever growing set of datasets, pre-trained models, and various other functionality on their platform, made available through Python packages such as 🤗 Datasets and 🤗 Transformers. 3LC provides a convenient workflow that integrates with these libraries.

The core concept is the method Table.from_hugging_face, which connects 🤗 Hub datasets with the 3LC universe by returning a corresponding Table.

See our 🤗 examples for full working examples of using 3LC with Hugging Face:

Note

In order to use the Hugging Face 🤗 integration, the 🤗 Datasets and 🤗 Transformers packages must be installed in the Python environment. You may install them directly or install the 3lc[huggingface] extra, which will install them for you. If they are not installed, the modules in tlc.integration.hugging_face will not be available.

Datasets#

In order to use a datasets.Dataset in 3LC, you can use Table.from_hugging_face similarly to how you would call datasets.load_dataset. The difference is that the dataset is registered with 3LC and a Table is returned instead of a datasets.Dataset.

Therefore, you may have to change some of your code which is specific to datasets.Dataset, such as preprocessing and transforms. Table.map can be used instead.

In the following example we load the train split of the IMDb dataset:

import tlc

train_dataset = tlc.Table.from_hugging_face("imdb", split="train", table_name="imdb-train")

Any arguments accepted by datasets.load_dataset can be provided. These are then forwarded to the internal TableFromHuggingFace.

Note

Using the integration with datasets with certain features not yet supported by 3LC, such as datasets.Audio, will raise an error.

Warning

Note that while Dataset.map and Table.map share the same name and are intended to be used in similar contexts, their behavior differs slightly.

For Dataset objects, .map retains any columns that are not explicitly modified by the transformation. In contrast, when working with a Table, the transformation passed to .map must provide a full row where all columns used in training, including those that remain unchanged, are explicitly defined in the output.

Collecting metrics#

The nature of metrics collection depends on how the 🤗 Transformers library is used.

Custom workflows#

For custom training loops which do not use a transformers.Trainer, refer to the metrics collection documentation.

Trainer#

The transformers.Trainer class can be used for most standard use cases. To integrate 3LC, use TLCTrainer as a replacement, and provide a function which accepts batched model outputs and targets and returns the desired per-sample metrics for the batch to the TLCTrainer constructor.

For the train_dataset and eval_dataset parameters, provide instances of Table. They will typically be loaded from the Hugging Face Hub as illustrated above.

from transformers import TrainingArguments
from tlc.integration.hugging_face import TLCTrainer

def compute_tlc_metrics(logits, labels):
    loss = nn.CrossEntropyLoss(reduction="none")(logits, labels)
    predicted = logits.argmax(dim=-1)
    correct = labels.eq(predicted)
    return {
        "loss": loss,
        "predicted": predicted,
        "correct": correct
    }

training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=2,
    learning_rate=2e-5,
    weight_decay=0.01,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    evaluation_strategy="epoch",
)

trainer = TLCTrainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_hf_metrics=compute_metrics,
    compute_tlc_metrics=compute_tlc_metrics,
    compute_tlc_metrics_on_train_begin=True,
    compute_tlc_metrics_on_train_end=False,
    tlc_metrics_collection_start=0,
    tlc_metrics_collection_epoch_frequency=1,
)

trainer.train()

The integration overrides the .evaluate() method and runs evaluation on both the training and evaluation datasets. The frequency with which this is called is the same as for a regular transformers.Trainer. Therefore, in the above example, evaluation is performed every epoch with a batch size of per_device_eval_batch_size as defined in transformers.TrainingArguments.