Integrating 3LC with 🤗¶

This document describes how to integrate 3LC in projects using Hugging Face (🤗) Python packages. Hugging Face hosts an ever growing set of datasets, pre-trained models, and various other functionality on their platform, made available through Python packages such as 🤗 Datasets and Transformers. 3LC provides a convenient workflow that integrates with these libraries.

The core functionality comes from the methods Table.from_hugging_face_hub and Table.from_hugging_face_dataset, which connect 🤗 datasets with the 3LC universe by returning a corresponding tlc.Table.

Note

In order to use the Hugging Face 🤗 integration, the 🤗 Datasets and Transformers packages must be installed in the Python environment. You may install them directly or install the 3lc[huggingface] extra, which will install them for you. If they are not installed, the modules in tlc.integration.hugging_face will not be available.

Datasets¶

Loading from the Hugging Face Hub¶

To load a dataset directly from the Hugging Face Hub, use tlc.Table.from_hugging_face_hub() similarly to how you would call datasets.load_dataset. The difference is that the dataset is registered with 3LC and a Table is returned instead of a datasets.Dataset.

Therefore, you may have to change some of your code which is specific to datasets.Dataset, such as preprocessing and transforms. tlc.Table.with_transform() can be used instead to obtain a non-mutating view that applies a transform on read.

In the following example we load the train split of the IMDb dataset:

import tlc

train_dataset = tlc.Table.from_hugging_face_hub("imdb", split="train", table_name="imdb-train")

Using an in-memory Dataset¶

If you already have an in-memory datasets.Dataset (e.g. constructed programmatically, filtered, or loaded locally), use Table.from_hugging_face_dataset:

import datasets
import tlc

hf_dataset = datasets.load_dataset("imdb", split="train")
hf_dataset_filtered = hf_dataset.filter(lambda x: len(x["text"]) < 1000)

train_dataset = tlc.Table.from_hugging_face_dataset(hf_dataset_filtered, table_name="imdb-short-train")

Note

Using the integration with datasets with certain features not yet supported by 3LC, such as datasets.Audio, will raise an error.

Collecting metrics¶

The nature of metrics collection depends on how the Transformers library is used.

Custom workflows¶

For custom training loops which do not use a transformers.Trainer, refer to the metrics collection documentation.

Trainer¶

The transformers.Trainer class can be used for most standard use cases. To integrate 3LC, use Trainer as a replacement, and provide a function which accepts batched model outputs and targets and returns the desired per-sample metrics for the batch to the Trainer constructor.

For the train_dataset and eval_dataset parameters, provide instances of Table. They will typically be loaded from the Hugging Face Hub as illustrated above.

from transformers import TrainingArguments
from tlc.integration.hugging_face.trainer import Trainer

def compute_tlc_metrics(logits, labels):
    loss = nn.CrossEntropyLoss(reduction="none")(logits, labels)
    predicted = logits.argmax(dim=-1)
    correct = labels.eq(predicted)
    return {
        "loss": loss,
        "predicted": predicted,
        "correct": correct
    }

training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=2,
    learning_rate=2e-5,
    weight_decay=0.01,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    eval_strategy="epoch",
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_hf_metrics=compute_metrics,
    compute_tlc_metrics=compute_tlc_metrics,
    compute_tlc_metrics_on_train_begin=True,
    compute_tlc_metrics_on_train_end=False,
    tlc_metrics_collection_start=0,
    tlc_metrics_collection_epoch_frequency=1,
)

trainer.train()

The integration overrides the .evaluate() method and runs evaluation on both the training and evaluation datasets. The frequency with which this is called is the same as for a regular transformers.Trainer. Therefore, in the above example, evaluation is performed every epoch with a batch size of per_device_eval_batch_size as defined in transformers.TrainingArguments.