🤗 and 3LC example on the IMDb dataset#

This notebook demonstrates fine-tuning a pretrained DistilBERT model from transformers on the IMDb dataset, using the 3LC integrations with Trainer and datasets from Hugging Face. 3LC metrics are collected before and after one epoch of training.

The notebook covers:

  • Creating a Table from a datasets dataset.

  • Fine-tuning a pretrained transformers model on the IMDb dataset with TLCTrainer.

  • Using a custom function for metrics collection.

Project Setup#

[2]:
EPOCHS = 1
TRAIN_BATCH_SIZE = 16
EVAL_BATCH_SIZE = 256
TRAIN_DATASET_NAME = "hf-imdb-train"
EVAL_DATASET_NAME = "hf-imdb-test"
TRANSIENT_DATA_PATH = "../transient_data"
DEVICE = "cuda:0"
PROJECT_NAME = "hf-imdb"
RUN_NAME = "Train DistilBERT on IMDB"
DESCRIPTION = "Example notebook for training a DistilBERT model on the IMDB dataset"
TLC_PUBLIC_EXAMPLES_DEVELOPER_MODE = True
INSTALL_DEPENDENCIES = False
[4]:
%%capture
if INSTALL_DEPENDENCIES:
    %pip --quiet install torch --index-url https://download.pytorch.org/whl/cu118
    %pip --quiet install torchvision --index-url https://download.pytorch.org/whl/cu118
    %pip --quiet install accelerate
    %pip --quiet install scikit-learn
    %pip --quiet install 3lc[huggingface]

Imports#

[8]:
import os

import datasets
import evaluate
import numpy as np
import tlc
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer, DataCollatorWithPadding, TrainingArguments

os.environ["TRANSFORMERS_NO_ADVISORY_WARNINGS"] = "true"  # Removing DistilBertTokenizerFast tokenizer warning

datasets.utils.logging.disable_progress_bar()

Initialize a 3LC Run#

We initialize a Run with a call to tlc.init, and add the configuration to the Run object.

[9]:
config = {
    "epochs": EPOCHS,
    "train_batch_size": TRAIN_BATCH_SIZE,
    "eval_batch_size": EVAL_BATCH_SIZE,
}

run = tlc.init(
    project_name=PROJECT_NAME,
    run_name=RUN_NAME,
    description=DESCRIPTION,
    parameters=config,
    if_exists="overwrite",
)

With the 3LC integration, you can use tlc.Table.from_hugging_face() as a drop-in replacement for datasets.load_dataset() to create a tlc.Table. Notice .latest(), which gets the latest version of the 3LC dataset.

[10]:
train_dataset = tlc.Table.from_hugging_face(
    "imdb",
    split="train",
    project_name=PROJECT_NAME,
    dataset_name=TRAIN_DATASET_NAME,
    description="IMDB train dataset",
    if_exists="overwrite",
).latest()

eval_dataset = tlc.Table.from_hugging_face(
    "imdb",
    split="test",
    project_name=PROJECT_NAME,
    dataset_name=EVAL_DATASET_NAME,
    description="IMDB test dataset",
    if_exists="overwrite",
).latest()

You can use the data produced by these Tables like you would with a 🤗 dataset.

[11]:
train_dataset_hf = datasets.load_dataset("imdb", split="train")
train_dataset_hf[0]
[11]:
{'text': 'I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scenes are few and far between, even then it\'s not shot like some cheaply made porno. While my countrymen mind find it shocking, in reality sex and nudity are a major staple in Swedish cinema. Even Ingmar Bergman, arguably their answer to good old boy John Ford, had sex scenes in his films.<br /><br />I do commend the filmmakers for the fact that any sex shown in the film is shown for artistic purposes rather than just to shock people and make money to be shown in pornographic theaters in America. I AM CURIOUS-YELLOW is a good film for anyone wanting to study the meat and potatoes (no pun intended) of Swedish cinema. But really, this film doesn\'t have much of a plot.',
 'label': 0}
[12]:
train_dataset[0]
[12]:
{'text': 'I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scenes are few and far between, even then it\'s not shot like some cheaply made porno. While my countrymen mind find it shocking, in reality sex and nudity are a major staple in Swedish cinema. Even Ingmar Bergman, arguably their answer to good old boy John Ford, had sex scenes in his films.<br /><br />I do commend the filmmakers for the fact that any sex shown in the film is shown for artistic purposes rather than just to shock people and make money to be shown in pornographic theaters in America. I AM CURIOUS-YELLOW is a good film for anyone wanting to study the meat and potatoes (no pun intended) of Swedish cinema. But really, this film doesn\'t have much of a plot.',
 'label': 0}

Table provides a method map to apply both preprocessing and on-the-fly transforms to your data before it is sent to the model.

[13]:
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased", model_max_length=512)
tokenize = lambda sample: {**sample, **tokenizer(sample["text"], truncation=True)}
/home/build/ado/w/2/huggingface-imdb_venv/lib/python3.9/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
[14]:
train_tokenized = train_dataset.map(tokenize)
eval_tokenized = eval_dataset.map(tokenize)
[15]:
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
[16]:
id2label = {0: "neg", 1: "pos"}
label2id = {"neg": 0, "pos": 1}

# For demonstration purposes, we use the distilbert-base-uncased model with a different set of labels than
# it was trained on. As a result, there will be a warning about the inconsistency of the classifier and
# pre_classifier weights. This is expected and can be ignored.
model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased", num_labels=2, id2label=id2label, label2id=label2id
)
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

Setup Metrics Collection#

Computing metrics is done by implementing a function which returns per-sample metrics you would like to see in the 3LC Dashboard.

We keep the metrics function in Hugging Face to see the intermediate aggregate metrics.

For special metrics such as the predicted category we specify that we would like this to be shown as a CategoricalLabel.

[17]:
accuracy = evaluate.load("accuracy")


def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return accuracy.compute(predictions=predictions, references=labels)


def compute_tlc_metrics(logits, labels):
    probabilities = torch.nn.functional.softmax(logits, dim=-1)

    predictions = logits.argmax(dim=-1)
    loss = torch.nn.functional.cross_entropy(logits, labels, reduction="none")
    confidence = probabilities.gather(dim=-1, index=predictions.unsqueeze(-1)).squeeze()

    return {
        "predicted": predictions,
        "loss": loss,
        "confidence": confidence,
    }


compute_tlc_metrics.column_schemas = {
    "predicted": tlc.CategoricalLabelSchema(
        display_name="Predicted Label", class_names=id2label.values(), display_importance=4005
    ),
    "loss": tlc.Schema(display_name="Loss", writable=False, value=tlc.Float32Value()),
    "confidence": tlc.Schema(display_name="Confidence", writable=False, value=tlc.Float32Value()),
}
[18]:
# Add references to the input datasets used by the Run.
run.add_input_table(train_dataset)
run.add_input_table(eval_dataset)

Train the model with TLCTrainer#

To perform model training, we replace the usual Trainer with TLCTrainer and provide the per-sample metrics collection function. We also specify that we would like to collect metrics prior to training.

[19]:
from tlc.integration.hugging_face import TLCTrainer

training_args = TrainingArguments(
    output_dir=TRANSIENT_DATA_PATH,
    learning_rate=2e-5,
    per_device_train_batch_size=TRAIN_BATCH_SIZE,
    per_device_eval_batch_size=EVAL_BATCH_SIZE,
    num_train_epochs=EPOCHS,
    weight_decay=0.01,
    report_to="none",  # Disable wandb logging
    evaluation_strategy="epoch",
    no_cuda=DEVICE == "cpu",
    disable_tqdm=True,
)

trainer = TLCTrainer(
    model=model,
    args=training_args,
    train_dataset=train_tokenized,
    eval_dataset=eval_tokenized,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_hf_metrics=compute_metrics,
    compute_tlc_metrics=compute_tlc_metrics,
    compute_tlc_metrics_on_train_begin=True,
)
[20]:
trainer.train()
{'eval_loss': 0.6962834596633911, 'eval_accuracy': 0.42024, 'eval_runtime': 390.6603, 'eval_samples_per_second': 63.994, 'eval_steps_per_second': 0.251}
{'eval_loss': 0.6968120336532593, 'eval_accuracy': 0.40564, 'eval_runtime': 391.1293, 'eval_samples_per_second': 63.917, 'eval_steps_per_second': 0.251}
{'loss': 0.3229, 'grad_norm': 17.37851333618164, 'learning_rate': 1.3602047344849649e-05, 'epoch': 0.3198976327575176}
{'loss': 0.2408, 'grad_norm': 7.709912300109863, 'learning_rate': 7.204094689699297e-06, 'epoch': 0.6397952655150352}
{'loss': 0.2172, 'grad_norm': 10.758538246154785, 'learning_rate': 8.061420345489445e-07, 'epoch': 0.9596928982725528}
{'eval_loss': 0.1934163123369217, 'eval_accuracy': 0.92732, 'eval_runtime': 389.3663, 'eval_samples_per_second': 64.207, 'eval_steps_per_second': 0.252, 'epoch': 1.0}
{'train_runtime': 1481.545, 'train_samples_per_second': 16.874, 'train_steps_per_second': 1.055, 'train_loss': 0.2584903842733216, 'epoch': 1.0}
[20]:
TrainOutput(global_step=1563, training_loss=0.2584903842733216, metrics={'train_runtime': 1481.545, 'train_samples_per_second': 16.874, 'train_steps_per_second': 1.055, 'train_loss': 0.2584903842733216, 'epoch': 1.0})