Integrating 3LC with 🤗#
This document describes how to integrate 3LC in projects using Hugging Face (🤗) Python packages. Hugging Face hosts an ever growing set of datasets, pre-trained models, and various other functionality on their platform, made available through Python packages such as 🤗 Datasets and 🤗 Transformers. 3LC provides a convenient workflow that integrates with these libraries.
The core concept is the method Table.from_hugging_face
, which
connects 🤗 Hub datasets with the 3LC universe by returning a corresponding Table
.
See our 🤗 examples for full working examples of using 3LC with Hugging Face:
Note
In order to use the Hugging Face 🤗 integration, the 🤗 Datasets and 🤗 Transformers packages
must be installed in the Python environment. You may install them directly or install the 3lc[huggingface]
extra,
which will install them for you. If they are not installed, the modules in tlc.integration.hugging_face
will not
be available.
Datasets#
In order to use a datasets.Dataset
in 3LC, you can use
Table.from_hugging_face
similarly to how you would call
datasets.load_dataset
. The difference is that the dataset is registered with 3LC and a
Table
is returned instead of a datasets.Dataset
.
Therefore, you may have to change some of your code which is specific to datasets.Dataset
, such as
preprocessing and transforms. Table.map
can be used instead.
In the following example we load the train split of the IMDb dataset:
import tlc
train_dataset = tlc.Table.from_hugging_face("imdb", split="train", table_name="imdb-train")
Any arguments accepted by datasets.load_dataset
can be provided. These are then forwarded to the internal
TableFromHuggingFace
.
Note
Using the integration with datasets with certain features not yet supported by 3LC, such as datasets.Audio
, will raise
an error.
Warning
Note that while Dataset.map and
Table.map
share the same name and are intended to be used in similar contexts,
their behavior differs slightly.
For Dataset objects, .map
retains any columns that are not explicitly modified by
the transformation. In contrast, when working with a Table
, the transformation
passed to .map
must provide a full row where all columns used in training, including those that remain unchanged, are
explicitly defined in the output.
Collecting metrics#
The nature of metrics collection depends on how the 🤗 Transformers library is used.
Custom workflows#
For custom training loops which do not use a transformers.Trainer
, refer to the
metrics collection documentation.
Trainer#
The transformers.Trainer
class can be used for most standard use cases. To integrate 3LC,
use TLCTrainer
as a replacement, and provide a function which
accepts batched model outputs and targets and returns the desired per-sample metrics for the batch to the TLCTrainer
constructor.
For the train_dataset
and eval_dataset
parameters, provide instances of
Table
. They will typically be loaded from the Hugging Face Hub as illustrated
above.
from transformers import TrainingArguments
from tlc.integration.hugging_face import TLCTrainer
def compute_tlc_metrics(logits, labels):
loss = nn.CrossEntropyLoss(reduction="none")(logits, labels)
predicted = logits.argmax(dim=-1)
correct = labels.eq(predicted)
return {
"loss": loss,
"predicted": predicted,
"correct": correct
}
training_args = TrainingArguments(
output_dir="./results",
num_train_epochs=2,
learning_rate=2e-5,
weight_decay=0.01,
per_device_train_batch_size=16,
per_device_eval_batch_size=16,
evaluation_strategy="epoch",
)
trainer = TLCTrainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
tokenizer=tokenizer,
data_collator=data_collator,
compute_hf_metrics=compute_metrics,
compute_tlc_metrics=compute_tlc_metrics,
compute_tlc_metrics_on_train_begin=True,
compute_tlc_metrics_on_train_end=False,
tlc_metrics_collection_start=0,
tlc_metrics_collection_epoch_frequency=1,
)
trainer.train()
The integration overrides the .evaluate()
method and runs evaluation on both the training and evaluation datasets. The
frequency with which this is called is the same as for a regular transformers.Trainer
. Therefore,
in the above example, evaluation is performed every epoch with a batch size of per_device_eval_batch_size
as defined
in transformers.TrainingArguments
.