Fine-tuning a model with the 🤗 TLC Trainer API¶
This notebook demonstrates how to use our hugging face TLC Trainer API and finetuning a model called bert-base-uncased
[ ]:
PROJECT_NAME = "bert-base-uncased"
RUN_NAME = "finetuning-run"
DESCRIPTION = "Fine-tuning BERT on MRPC"
TRAIN_DATASET_NAME = "hugging-face-train"
VAL_DATASET_NAME = "hugging-face-val"
CHECKPOINT = "bert-base-uncased"
DEVICE = None
TRAIN_BATCH_SIZE = 64
EVAL_BATCH_SIZE = 256
EPOCHS = 4
OPTIMIZER = "adamw_torch"
TRANSIENT_DATA_PATH = "../transient_data"
TLC_PUBLIC_EXAMPLES_DEVELOPER_MODE = True
INSTALL_DEPENDENCIES = False
[ ]:
%%capture
if INSTALL_DEPENDENCIES:
%pip --quiet install torch --index-url https://download.pytorch.org/whl/cu118
%pip --quiet install torchvision --index-url https://download.pytorch.org/whl/cu118
%pip --quiet install accelerate
%pip --quiet install scikit-learn
%pip --quiet install 3lc[huggingface]
[ ]:
import os
import datasets
import evaluate
import numpy as np
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer, DataCollatorWithPadding, TrainingArguments
import tlc
os.environ["TRANSFORMERS_NO_ADVISORY_WARNINGS"] = "true" # Removing BertTokenizerFast tokenizer warning
datasets.utils.logging.disable_progress_bar()
[ ]:
if DEVICE is None:
if torch.cuda.is_available():
DEVICE = "cuda"
elif torch.backends.mps.is_available():
DEVICE = "mps"
else:
DEVICE = "cpu"
Initialize a 3LC Run¶
We initialize a Run with a call to tlc.init
, and add the configuration to the Run object.
[ ]:
With the 3LC integration, you can use tlc.Table.from_hugging_face()
as a drop-in replacement for datasets.load_dataset()
to create a tlc.Table
. Notice .latest()
, which gets the latest version of the 3LC dataset.
[ ]:
tlc_train_dataset = tlc.Table.from_hugging_face(
"glue",
"mrpc",
split="train",
project_name=PROJECT_NAME,
dataset_name=TRAIN_DATASET_NAME,
if_exists="overwrite",
).latest()
tlc_val_dataset = tlc.Table.from_hugging_face(
"glue",
"mrpc",
split="validation",
project_name=PROJECT_NAME,
dataset_name=VAL_DATASET_NAME,
if_exists="overwrite",
).latest()
Table
provides a method map
to apply both preprocessing and on-the-fly transforms to your data before it is sent to the model.
It is different from huggingface where it generates a new reference of the data directly including the example
[ ]:
tokenizer = AutoTokenizer.from_pretrained(CHECKPOINT)
def tokenize_function_tlc(example):
return {**example, **tokenizer(example["sentence1"], example["sentence2"], truncation=True)}
tlc_tokenized_dataset_train = tlc_train_dataset.map(tokenize_function_tlc)
tlc_tokenized_dataset_val = tlc_val_dataset.map(tokenize_function_tlc)
[ ]:
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
Here we define our model with two labels
[ ]:
# For demonstration purposes, we use the bert-base-uncased model with a different set of labels than
# it was trained on. As a result, there will be a warning about the inconsistency of the classifier and
# pre_classifier weights. This is expected and can be ignored.
model = AutoModelForSequenceClassification.from_pretrained(CHECKPOINT, num_labels=2)
Setup Metrics Collection¶
Computing metrics is done by implementing a function which returns per-sample metrics you would like to see in the 3LC Dashboard.
This is different from the original compute_metrics of Huggingface which compute per batch the metrics. Here we want to find results with a granularity of per sample basis.
[ ]:
def compute_tlc_metrics(logits, labels):
probabilities = torch.nn.functional.softmax(logits, dim=-1)
predictions = logits.argmax(dim=-1)
loss = torch.nn.functional.cross_entropy(logits, labels, reduction="none")
confidence = probabilities.gather(dim=-1, index=predictions.unsqueeze(-1)).squeeze()
return {
"predicted": predictions,
"loss": loss,
"confidence": confidence,
}
id2label = {0: "not_equivalent", 1: "equivalent"}
schemas = {
"predicted": tlc.CategoricalLabelSchema(
display_name="Predicted Label", class_names=id2label.values(), display_importance=4005
),
"loss": tlc.Schema(display_name="Loss", writable=False, value=tlc.Float32Value()),
"confidence": tlc.Schema(display_name="Confidence", writable=False, value=tlc.Float32Value()),
}
compute_tlc_metrics.column_schemas = schemas
[ ]:
# Add references to the input datasets used by the Run.
run.add_input_table(tlc_train_dataset)
run.add_input_table(tlc_val_dataset)
Train the model with TLCTrainer¶
To perform model training, we replace the usual Trainer
with TLCTrainer
and provide the per-sample metrics collection function.
In this example, we still compute the glue MRPC per batch thanks to the compute_hf_metrics method (compute_metrics is changed to compute_hf_metrics to avoid confusion).
We also compute our special per sample tlc metrics thanks to the compute_tlc_metrics method.
With this latter, we can choose when to start to collect the metrics, here at epoch 2 (indexed from 0 with tlc_metrics_collection_start) with a frequency of 1 epoch (with tlc_metrics_collection_epoch_frequency).
You also can switch the strategy to compute the metrics to “steps” in the eval_strategy and specify the frequency with eval_steps. At this stage, if you use tlc_metrics_collection_start, it should be a multiple of eval_steps. Note that tlc_metrics_collection_epoch_frequency is disable in this case because we use the original eval_steps variable.
We also specify that we would like to collect metrics prior to training with compute_tlc_metrics_on_train_begin.
[ ]:
from tlc.integration.hugging_face import TLCTrainer
def compute_metrics(eval_preds):
metric = evaluate.load("glue", "mrpc")
logits, labels = eval_preds
predictions = np.argmax(logits, axis=-1)
return metric.compute(predictions=predictions, references=labels)
training_args = TrainingArguments(
output_dir=TRANSIENT_DATA_PATH,
per_device_train_batch_size=TRAIN_BATCH_SIZE,
per_device_eval_batch_size=EVAL_BATCH_SIZE,
optim=OPTIMIZER,
num_train_epochs=EPOCHS,
report_to="none", # Disable wandb logging
use_cpu=DEVICE == "cpu",
eval_strategy="epoch",
disable_tqdm=True,
# eval_strategy="steps", # For running metrics on steps
# eval_steps=20, # For running metrics on steps
)
trainer = TLCTrainer(
model=model,
args=training_args,
train_dataset=tlc_tokenized_dataset_train,
eval_dataset=tlc_tokenized_dataset_val,
tokenizer=tokenizer,
data_collator=data_collator,
compute_hf_metrics=compute_metrics,
compute_tlc_metrics=compute_tlc_metrics,
compute_tlc_metrics_on_train_begin=True,
compute_tlc_metrics_on_train_end=False,
tlc_metrics_collection_start=2,
tlc_metrics_collection_epoch_frequency=1,
)
[ ]: