Integrating 3LC with 🤗#
This document describes how to integrate 3LC in projects using Hugging Face (🤗) Python packages. Hugging Face hosts an ever growing set of datasets, pre-trained models, and various other functionality on their platform, made available through Python packages such as 🤗 Datasets and 🤗 Transformers. 3LC provides a convenient workflow that integrates with these libraries.
The core concept is the method Table.from_hugging_face
, which
connects 🤗 Hub datasets with the 3LC universe by returning a corresponding Table
.
See our 🤗 examples for full working examples of using 3LC with Hugging Face:
Note
In order to use the Hugging Face 🤗 integration, the 🤗 Datasets and 🤗 Transformers packages
must be installed in the Python environment. You may install them directly or install the 3lc[huggingface]
extra,
which will install them for you. If they are not installed, the modules in tlc.integration.hugging_face
will not
be available.
Datasets#
In order to use a datasets.Dataset
in 3LC, you can use
Table.from_hugging_face
similarly to how you would call
datasets.load_dataset
. The difference is that the dataset is registered with 3LC and a
Table
is returned instead of a datasets.Dataset
.
Therefore, you may have to change some of your code which is specific to datasets.Dataset
, such as
preprocessing and transforms. Table.map
can be used instead.
In the following example we load the train split of the IMDb dataset:
import tlc
train_dataset = tlc.Table.from_hugging_face("imdb", split="train", table_name="imdb-train")
Any arguments accepted by datasets.load_dataset
can be provided. These are then forwarded to the internal
TableFromHuggingFace
.
Note
Using the integration with datasets with certain features not yet supported by 3LC, such as datasets.Audio
, will raise
an error.
Collecting metrics#
The nature of metrics collection depends on how the 🤗 Transformers library is used.
Custom workflows#
For custom training loops which do not use a transformers.Trainer
, refer to the
metrics collection documentation.
Trainer#
The transformers.Trainer
class can be used for most standard use cases. To integrate 3LC,
use TLCTrainer
as a replacement, and provide a function which
accepts batched model outputs and targets and returns the desired per-sample metrics for the batch to the TLCTrainer
constructor.
For the train_dataset
and eval_dataset
parameters, provide instances of
Table
. They will typically be loaded from the Hugging Face Hub as illustrated
above.
from transformers import TrainingArguments
from tlc.integration.hugging_face import TLCTrainer
def compute_tlc_metrics(logits, labels):
loss = nn.CrossEntropyLoss(reduction="none")(logits, labels)
predicted = logits.argmax(dim=-1)
correct = labels.eq(predicted)
return {
"loss": loss,
"predicted": predicted,
"correct": correct
}
training_args = TrainingArguments(
output_dir="./results",
num_train_epochs=2,
learning_rate=2e-5,
weight_decay=0.01,
per_device_train_batch_size=16,
per_device_eval_batch_size=16,
evaluation_strategy="epoch",
)
trainer = TLCTrainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
tokenizer=tokenizer,
data_collator=data_collator,
compute_hf_metrics=compute_metrics,
compute_tlc_metrics=compute_tlc_metrics,
compute_tlc_metrics_on_train_begin=True,
compute_tlc_metrics_on_train_end=False,
tlc_metrics_collection_start=0,
tlc_metrics_collection_epoch_frequency=1,
)
trainer.train()
The integration overrides the .evaluate()
method and runs evaluation on both the training and evaluation datasets. The
frequency with which this is called is the same as for a regular transformers.Trainer
. Therefore,
in the above example, evaluation is performed every epoch with a batch size of per_device_eval_batch_size
as defined
in transformers.TrainingArguments
.