Fine-tuning a model with the 🤗 TLC Trainer API#

This notebook demonstrates how to use our hugging face TLC Trainer API and finetuning a model called bert-base-uncased

[2]:
PROJECT_NAME = "bert-base-uncased"
RUN_NAME = "finetuning-run"
DESCRIPTION = "Fine-tuning BERT on MRPC"
TRAIN_DATASET_NAME = "hugging-face-train"
VAL_DATASET_NAME = "hugging-face-val"
CHECKPOINT = "bert-base-uncased"
DEVICE = None
TRAIN_BATCH_SIZE = 64
EVAL_BATCH_SIZE = 256
EPOCHS = 4
OPTIMIZER = "adamw_torch"
TRANSIENT_DATA_PATH = "../transient_data"
TLC_PUBLIC_EXAMPLES_DEVELOPER_MODE = True
INSTALL_DEPENDENCIES = False
[4]:
%%capture
if INSTALL_DEPENDENCIES:
    %pip --quiet install torch --index-url https://download.pytorch.org/whl/cu118
    %pip --quiet install torchvision --index-url https://download.pytorch.org/whl/cu118
    %pip --quiet install accelerate
    %pip --quiet install scikit-learn
    %pip --quiet install 3lc[huggingface]
[7]:
import os

import datasets
import evaluate
import numpy as np
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer, DataCollatorWithPadding, TrainingArguments

import tlc

os.environ["TRANSFORMERS_NO_ADVISORY_WARNINGS"] = "true"  # Removing BertTokenizerFast tokenizer warning

datasets.utils.logging.disable_progress_bar()
[8]:
if DEVICE is None:
    if torch.cuda.is_available():
        DEVICE = "cuda"
    elif torch.backends.mps.is_available():
        DEVICE = "mps"
    else:
        DEVICE = "cpu"

Initialize a 3LC Run#

We initialize a Run with a call to tlc.init, and add the configuration to the Run object.

[9]:
run = tlc.init(
    project_name=PROJECT_NAME,
    run_name=RUN_NAME,
    description=DESCRIPTION,
    if_exists="overwrite",
)

With the 3LC integration, you can use tlc.Table.from_hugging_face() as a drop-in replacement for datasets.load_dataset() to create a tlc.Table. Notice .latest(), which gets the latest version of the 3LC dataset.

[10]:
tlc_train_dataset = tlc.Table.from_hugging_face(
    "glue",
    "mrpc",
    split="train",
    project_name=PROJECT_NAME,
    dataset_name=TRAIN_DATASET_NAME,
    if_exists="overwrite",
).latest()

tlc_val_dataset = tlc.Table.from_hugging_face(
    "glue",
    "mrpc",
    split="validation",
    project_name=PROJECT_NAME,
    dataset_name=VAL_DATASET_NAME,
    if_exists="overwrite",
).latest()

Table provides a method map to apply both preprocessing and on-the-fly transforms to your data before it is sent to the model.

It is different from huggingface where it generates a new reference of the data directly including the example

[11]:
tokenizer = AutoTokenizer.from_pretrained(CHECKPOINT)


def tokenize_function_tlc(example):
    return {**example, **tokenizer(example["sentence1"], example["sentence2"], truncation=True)}


tlc_tokenized_dataset_train = tlc_train_dataset.map(tokenize_function_tlc)
tlc_tokenized_dataset_val = tlc_val_dataset.map(tokenize_function_tlc)
[12]:
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

Here we define our model with two labels

[13]:
# For demonstration purposes, we use the bert-base-uncased model with a different set of labels than
# it was trained on. As a result, there will be a warning about the inconsistency of the classifier and
# pre_classifier weights. This is expected and can be ignored.
model = AutoModelForSequenceClassification.from_pretrained(CHECKPOINT, num_labels=2)
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

Setup Metrics Collection#

Computing metrics is done by implementing a function which returns per-sample metrics you would like to see in the 3LC Dashboard.

This is different from the original compute_metrics of Huggingface which compute per batch the metrics. Here we want to find results with a granularity of per sample basis.

[14]:
def compute_tlc_metrics(logits, labels):
    probabilities = torch.nn.functional.softmax(logits, dim=-1)

    predictions = logits.argmax(dim=-1)
    loss = torch.nn.functional.cross_entropy(logits, labels, reduction="none")
    confidence = probabilities.gather(dim=-1, index=predictions.unsqueeze(-1)).squeeze()

    return {
        "predicted": predictions,
        "loss": loss,
        "confidence": confidence,
    }


id2label = {0: "not_equivalent", 1: "equivalent"}
schemas = {
    "predicted": tlc.CategoricalLabelSchema(
        display_name="Predicted Label", class_names=id2label.values(), display_importance=4005
    ),
    "loss": tlc.Schema(display_name="Loss", writable=False, value=tlc.Float32Value()),
    "confidence": tlc.Schema(display_name="Confidence", writable=False, value=tlc.Float32Value()),
}
compute_tlc_metrics.column_schemas = schemas
[15]:
# Add references to the input datasets used by the Run.
run.add_input_table(tlc_train_dataset)
run.add_input_table(tlc_val_dataset)

Train the model with TLCTrainer#

To perform model training, we replace the usual Trainer with TLCTrainer and provide the per-sample metrics collection function.

In this example, we still compute the glue MRPC per batch thanks to the compute_hf_metrics method (compute_metrics is changed to compute_hf_metrics to avoid confusion).

We also compute our special per sample tlc metrics thanks to the compute_tlc_metrics method.

With this latter, we can choose when to start to collect the metrics, here at epoch 2 (indexed from 0 with tlc_metrics_collection_start) with a frequency of 1 epoch (with tlc_metrics_collection_epoch_frequency).

You also can switch the strategy to compute the metrics to “steps” in the evaluation_strategy and specify the frequency with eval_steps. At this stage, if you use tlc_metrics_collection_start, it should be a multiple of eval_steps. Note that tlc_metrics_collection_epoch_frequency is disable in this case because we use the original eval_steps variable.

We also specify that we would like to collect metrics prior to training with compute_tlc_metrics_on_train_begin.

[16]:
from tlc.integration.hugging_face import TLCTrainer


def compute_metrics(eval_preds):
    metric = evaluate.load("glue", "mrpc")
    logits, labels = eval_preds
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)


training_args = TrainingArguments(
    output_dir=TRANSIENT_DATA_PATH,
    per_device_train_batch_size=TRAIN_BATCH_SIZE,
    per_device_eval_batch_size=EVAL_BATCH_SIZE,
    optim=OPTIMIZER,
    num_train_epochs=EPOCHS,
    report_to="none",  # Disable wandb logging
    use_cpu=DEVICE == "cpu",
    evaluation_strategy="epoch",
    disable_tqdm=True,
    # evaluation_strategy="steps",  # For running metrics on steps
    # eval_steps=20,  # For running metrics on steps
)

trainer = TLCTrainer(
    model=model,
    args=training_args,
    train_dataset=tlc_tokenized_dataset_train,
    eval_dataset=tlc_tokenized_dataset_val,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_hf_metrics=compute_metrics,
    compute_tlc_metrics=compute_tlc_metrics,
    compute_tlc_metrics_on_train_begin=True,
    compute_tlc_metrics_on_train_end=False,
    tlc_metrics_collection_start=2,
    tlc_metrics_collection_epoch_frequency=1,
)
/home/build/ado/w/1/huggingface-finetuning_venv/lib/python3.9/site-packages/transformers/training_args.py:1575: FutureWarning: `evaluation_strategy` is deprecated and will be removed in version 4.46 of 🤗 Transformers. Use `eval_strategy` instead
  warnings.warn(
<frozen tlc.integration.hugging_face.trainer>:88: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `TLCTrainer.__init__`. Use `processing_class` instead.
[17]:
trainer.train()
{'eval_loss': 0.6377372741699219, 'eval_model_preparation_time': 0.0041, 'eval_accuracy': 0.6747546346782988, 'eval_f1': 0.8057319654779352, 'eval_runtime': 16.0147, 'eval_samples_per_second': 229.04, 'eval_steps_per_second': 0.937}
{'eval_loss': 0.6342514753341675, 'eval_model_preparation_time': 0.0041, 'eval_accuracy': 0.6838235294117647, 'eval_f1': 0.8122270742358079, 'eval_runtime': 1.7978, 'eval_samples_per_second': 226.945, 'eval_steps_per_second': 1.112}
{'eval_loss': 0.3899233937263489, 'eval_model_preparation_time': 0.0041, 'eval_accuracy': 0.8186274509803921, 'eval_f1': 0.8664259927797834, 'eval_runtime': 1.79, 'eval_samples_per_second': 227.936, 'eval_steps_per_second': 1.117, 'epoch': 1.0}
{'eval_loss': 0.3683539628982544, 'eval_model_preparation_time': 0.0041, 'eval_accuracy': 0.8455882352941176, 'eval_f1': 0.8941176470588236, 'eval_runtime': 1.725, 'eval_samples_per_second': 236.516, 'eval_steps_per_second': 1.159, 'epoch': 2.0}
{'eval_loss': 0.062457989901304245, 'eval_model_preparation_time': 0.0041, 'eval_accuracy': 0.9869138495092693, 'eval_f1': 0.9902794653705954, 'eval_runtime': 16.056, 'eval_samples_per_second': 228.451, 'eval_steps_per_second': 0.934, 'epoch': 3.0}
{'eval_loss': 0.3645003139972687, 'eval_model_preparation_time': 0.0041, 'eval_accuracy': 0.8700980392156863, 'eval_f1': 0.9065255731922398, 'eval_runtime': 1.737, 'eval_samples_per_second': 234.881, 'eval_steps_per_second': 1.151, 'epoch': 3.0}
{'eval_loss': 0.044442228972911835, 'eval_model_preparation_time': 0.0041, 'eval_accuracy': 0.9899127589967285, 'eval_f1': 0.9925448317549869, 'eval_runtime': 16.5243, 'eval_samples_per_second': 221.976, 'eval_steps_per_second': 0.908, 'epoch': 4.0}
{'eval_loss': 0.4523918628692627, 'eval_model_preparation_time': 0.0041, 'eval_accuracy': 0.8676470588235294, 'eval_f1': 0.9078498293515358, 'eval_runtime': 1.709, 'eval_samples_per_second': 238.735, 'eval_steps_per_second': 1.17, 'epoch': 4.0}
{'train_runtime': 217.0511, 'train_samples_per_second': 67.597, 'train_steps_per_second': 1.069, 'train_loss': 0.26430149736075564, 'epoch': 4.0}
[17]:
TrainOutput(global_step=232, training_loss=0.26430149736075564, metrics={'train_runtime': 217.0511, 'train_samples_per_second': 67.597, 'train_steps_per_second': 1.069, 'train_loss': 0.26430149736075564, 'epoch': 4.0})