Skip to main content

HuggingFace Ecosystem

Reading time: ~35 minutes | Level: ML with Python | Role: MLE, AI Engineer, Research Engineer


The Bloomberg NLP Team, 2022

Picture an NLP team at Bloomberg in early 2022. They need to fine-tune a financial sentiment model - something that can distinguish between "revenue exceeded expectations" (bullish) and "management issued a profit warning" (bearish). Financial text is different enough from general-domain text that off-the-shelf sentiment models miss the nuance.

Before HuggingFace was the standard: first, you track down the paper - probably FinBERT, a BERT variant pre-trained on Reuters and SEC filings. The paper links to a university server hosting the weights. You download them, pray the server is still up. The checkpoint is in a TensorFlow format but your stack is PyTorch, so you find a conversion script on GitHub, discover it was written for TensorFlow 1.x, and spend a day making it work. The tokenizer was custom - the paper describes it but doesn't release code. You implement it from scratch, test it against the paper's examples, find a discrepancy, dig through the financial wordpiece vocabulary file, fix the bug. Then you adapt a generic BERT fine-tuning script, wire up your DataLoader for the Bloomberg financial corpus, set up evaluation metrics, debug NaN losses (forgot to scale the learning rate for the larger batch). Two weeks of engineering work before a single training epoch.

With HuggingFace in 2022: three lines to load the model, three lines to tokenize, one Trainer call. The fine-tuned checkpoint is running in staging before lunch.

# The 2022 workflow - everything the 2019 workflow replaced
model = AutoModelForSequenceClassification.from_pretrained("ProsusAI/finbert", num_labels=3)
tokenizer = AutoTokenizer.from_pretrained("ProsusAI/finbert")
trainer = Trainer(model=model, args=training_args, train_dataset=train_ds, eval_dataset=val_ds)
trainer.train()

This is what ecosystem design looks like when it works. The HuggingFace ecosystem - transformers, datasets, peft, evaluate, and the Hub - reduced a two-week task to two hours not by making ML easier to think about, but by standardizing every layer of the workflow so that the engineering work disappears and only the ML decisions remain.

This lesson covers the ecosystem from first principles: how the tokenizers actually work, what the AutoModel family does under the hood, the full Trainer API with all its important parameters, PEFT and LoRA with the math, and the Hub as a deployment target. By the end you will be able to fine-tune any model on any dataset and share it in a reproducible way.

Core Libraries

pip install transformers datasets peft accelerate evaluate bitsandbytes
LibraryWhat it provides
transformers200+ pretrained models, tokenizers, AutoClass API
datasetsStandardized dataset loading, streaming, batched mapping
peftParameter-Efficient Fine-Tuning (LoRA, prefix tuning, adapters)
accelerateDistributed training across GPUs/TPUs with minimal code change
evaluateStandard metrics (BLEU, F1, AUC) with consistent API
bitsandbytesQuantization (8-bit, 4-bit) for loading large models

The HuggingFace Hub

The Hub at huggingface.co is the central registry for models, datasets, and applications. As of 2024, it hosts over 500,000 model checkpoints across virtually every architecture and modality.

What Lives on the Hub

Every model repository on the Hub contains:

  • config.json - the model architecture specification (hidden size, number of layers, vocab size, etc.)
  • pytorch_model.bin or model.safetensors - the actual weights (safetensors is faster to load and safer)
  • tokenizer.json and tokenizer_config.json - the tokenizer state
  • vocab.txt or merges.txt - vocabulary files depending on tokenizer type
  • README.md - the model card describing intended use, training data, limitations, and evaluation results

Model cards are not just documentation - they are the metadata that the Hub uses to surface models in search, filter by task type, and display evaluation benchmarks. A well-written model card includes training data description, evaluation metrics on standard benchmarks, intended use cases, out-of-scope uses, and known biases.

from huggingface_hub import HfApi, ModelCard

api = HfApi()

# Search for financial NLP models
models = api.list_models(
filter="text-classification",
sort="downloads",
direction=-1,
limit=10,
)
for m in models:
print(m.modelId, m.downloads)

# Fetch a model card
card = ModelCard.load("ProsusAI/finbert")
print(card.content[:500])

# List datasets
from huggingface_hub import list_datasets
datasets = list(list_datasets(filter="finance", limit=5))

Spaces

Spaces are hosted ML demos running on HuggingFace infrastructure. You can deploy a Gradio or Streamlit app that loads your model and serves inference, with zero infrastructure management. This is the fastest path from "trained model" to "shareable demo".

Pushing to the Hub

from huggingface_hub import login

# Authenticate (save token to ~/.cache/huggingface/token)
login(token="hf_your_token_here")

# After Trainer.train(), push model + tokenizer in one call
trainer.push_to_hub("your-username/finbert-bloomberg-sentiment")
tokenizer.push_to_hub("your-username/finbert-bloomberg-sentiment")

# Push a PEFT (LoRA) adapter - only a few MB, not the full model
peft_model.push_to_hub("your-username/finbert-lora-adapter")

# Push a dataset you curated
from datasets import Dataset
ds = Dataset.from_dict({"text": [...], "label": [...]})
ds.push_to_hub("your-username/bloomberg-financial-sentiment")

# Load back from anywhere
model = AutoModelForSequenceClassification.from_pretrained(
"your-username/finbert-bloomberg-sentiment"
)

Under the hood, the Hub uses git-lfs (Large File Storage) - model weights are stored as LFS objects, while config files are regular git objects. This means model versioning is real git versioning: you can checkout specific commits, diff configurations, and roll back.

Tokenizers: Deep Dive

A tokenizer converts raw text to integer token IDs that the model can process. This seems simple but the details matter for model performance, handling rare words, and multi-lingual text. There are three dominant tokenization algorithms in modern NLP.

WordPiece (BERT)

WordPiece builds a vocabulary by iteratively merging the pair of symbols that maximizes the likelihood of the training data under a language model. Unlike BPE which maximizes frequency, WordPiece maximizes the likelihood improvement per merge.

The key behavioral difference: unknown sub-words are prefixed with ## to indicate they are continuation tokens. The word "Bloomberg" might tokenize as ['bloom', '##berg']. Financial terms like "securitization" become ['secur', '##iti', '##zation'].

from transformers import AutoTokenizer

bert_tok = AutoTokenizer.from_pretrained("bert-base-uncased")

text = "The Bloomberg earnings securitization report was positive."
tokens = bert_tok.tokenize(text)
print(tokens)
# ['the', 'bloomberg', 'earnings', 'secur', '##iti', '##zation', 'report', 'was', 'positive', '.']

ids = bert_tok.encode(text)
print(ids)
# [101, 1996, 16386, 14596, 24667, 3989, 18418, 3189, 2001, 3893, 1012, 102]
# 101 = [CLS], 102 = [SEP]

BPE - Byte Pair Encoding (GPT-2, RoBERTa, LLaMA)

BPE starts with a character-level vocabulary and iteratively merges the most frequent adjacent pair of tokens. The final vocabulary contains whole words for common words and sub-word pieces for rare ones.

GPT-2 uses byte-level BPE - the base vocabulary is 256 byte values, so every possible character sequence can be encoded. There are no unknown tokens. This is why GPT models handle arbitrary text (code, emoji, foreign scripts) without special treatment.

gpt2_tok = AutoTokenizer.from_pretrained("gpt2")

text = "The Bloomberg earnings securitization report was positive."
tokens = gpt2_tok.tokenize(text)
print(tokens)
# ['The', 'ĠBloomberg', 'Ġearnings', 'Ġsecurit', 'ization', 'Ġreport', 'Ġwas', 'Ġpositive', '.']
# Ġ = space before token (GPT-2 encodes spaces as part of the following token)

# GPT-2 has no [CLS]/[SEP] - it's a causal LM
ids = gpt2_tok.encode(text)
print(ids)
# [464, 44831, 16803, 3218, 1634, 945, 373, 3967, 13]

SentencePiece (T5, mT5, LLaMA-2)

SentencePiece treats the input as a raw byte stream - no pre-tokenization (no splitting on spaces). It uses either BPE or a unigram language model over this byte stream. The advantage: it is language-agnostic. Japanese, Arabic, and English are all treated uniformly.

LLaMA-2 uses SentencePiece with a BPE model over bytes. T5 uses SentencePiece with a unigram model. The prefix (underscore) indicates a word-initial token.

t5_tok = AutoTokenizer.from_pretrained("t5-base")

text = "The Bloomberg earnings report was positive."
tokens = t5_tok.tokenize(text)
print(tokens)
# ['▁The', '▁Bloomberg', '▁earnings', '▁report', '▁was', '▁positive', '.']

Slow vs Fast Tokenizers

HuggingFace ships two tokenizer implementations for most models:

  • Slow tokenizer - pure Python, implemented in transformers. Easy to debug, slower.
  • Fast tokenizer - backed by the tokenizers library written in Rust. 10-100x faster for large batches. Returns offset mappings for token-to-character alignment (essential for NER).
# Fast tokenizer (default when available)
fast_tok = AutoTokenizer.from_pretrained("bert-base-uncased", use_fast=True)

# Returns offset_mapping - which character positions each token covers
encoding = fast_tok(
"The Bloomberg report.",
return_offsets_mapping=True,
)
print(encoding["offset_mapping"])
# [(0,0), (0,3), (4,13), (14,20), (20,21), (0,0)]
# (0,0) = special tokens [CLS] and [SEP]

Special Tokens

Every model has a set of special tokens that carry structural meaning:

TokenModelsPurpose
[CLS]BERT familyClassification token - first position, used for sequence-level tasks
[SEP]BERT familySeparator between segments or end of sequence
[PAD]BERT familyPadding to fixed length
[MASK]BERTMasked token for masked language modeling
<s>RoBERTa, LLaMABeginning of sequence
</s>RoBERTa, T5End of sequence
<pad>T5, RoBERTaPadding
[EOS]GPT-2 (as pad)End of sequence
print(bert_tok.special_tokens_map)
# {'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]',
# 'cls_token': '[CLS]', 'mask_token': '[MASK]'}

print(bert_tok.cls_token_id) # 101
print(bert_tok.sep_token_id) # 102
print(bert_tok.pad_token_id) # 0

Handling Long Documents

BERT-style models have a maximum sequence length (512 tokens for BERT). Financial filings, legal documents, and research papers far exceed this. Two strategies:

Strategy 1 - Sliding window with stride

def tokenize_long_doc(text, tokenizer, max_length=512, stride=128):
"""
Split a long document into overlapping chunks for BERT.
Returns list of encodings, each fitting in max_length tokens.
"""
encoding = tokenizer(
text,
max_length=max_length,
stride=stride,
truncation=True,
return_overflowing_tokens=True, # key: returns all windows
return_offsets_mapping=True,
padding="max_length",
)
# encoding["input_ids"] is now a list of windows, not a single sequence
print(f"Document split into {len(encoding['input_ids'])} windows")
return encoding

long_text = "..." * 5000 # 5000 characters ~ 1000+ tokens
windows = tokenize_long_doc(long_text, bert_tok)

Strategy 2 - Truncate to first N tokens (appropriate for tasks where the answer is likely in the first paragraph)

encoding = tokenizer(
text,
max_length=512,
truncation=True, # silently truncates anything beyond max_length
padding="max_length",
return_tensors="pt",
)

encode() vs tokenizer() - what is the difference

# tokenizer.encode() - returns list of token IDs, no tensors
ids = tokenizer.encode("Hello world", add_special_tokens=True)
print(type(ids), ids) # list, [101, 7592, 2088, 102]

# tokenizer() - returns a BatchEncoding with all fields
out = tokenizer("Hello world", return_tensors="pt", return_attention_mask=True)
print(out.keys()) # input_ids, attention_mask, token_type_ids
print(out["input_ids"]) # tensor([[101, 7592, 2088, 102]])

# tokenizer() handles batches, padding, truncation together
batch = tokenizer(
["Hello world", "The Bloomberg report was excellent."],
padding=True, # pad shorter sequences to length of longest
truncation=True,
max_length=32,
return_tensors="pt",
)

Model Architectures in Transformers

The AutoModel Family

The AutoClass API selects the correct model class from config.json. This is what makes checkpoint swaps a one-line change.

from transformers import (
AutoModel, # base model, no task head
AutoModelForSequenceClassification, # +linear head for classification
AutoModelForTokenClassification, # +linear head per token (NER)
AutoModelForCausalLM, # +LM head for next-token prediction
AutoModelForSeq2SeqLM, # encoder-decoder (T5, BART)
AutoModelForQuestionAnswering, # +span prediction head (start/end)
AutoModelForMaskedLM, # +MLM head for BERT-style pretraining
)

# Classification - adds a dropout + linear layer on top of [CLS]
clf = AutoModelForSequenceClassification.from_pretrained(
"bert-base-uncased",
num_labels=3, # negative / neutral / positive
id2label={0: "NEG", 1: "NEU", 2: "POS"},
label2id={"NEG": 0, "NEU": 1, "POS": 2},
)

# Causal LM - GPT-2, LLaMA
causal = AutoModelForCausalLM.from_pretrained("gpt2")

# Seq2Seq - T5 (input: "summarize: ...", output: summary)
seq2seq = AutoModelForSeq2SeqLM.from_pretrained("t5-base")

# NER
ner_model = AutoModelForTokenClassification.from_pretrained(
"dslim/bert-base-NER",
num_labels=9,
)

Modifying Model Config

You can change the architecture before loading weights, or inspect what the current config looks like:

from transformers import AutoConfig

config = AutoConfig.from_pretrained("bert-base-uncased")
print(config.hidden_size) # 768
print(config.num_hidden_layers) # 12
print(config.num_attention_heads) # 12
print(config.intermediate_size) # 3072

# Smaller model for fast experimentation
config.hidden_size = 256
config.num_hidden_layers = 4
config.num_attention_heads = 4
config.intermediate_size = 512

from transformers import BertModel
tiny_bert = BertModel(config) # random weights, your architecture
print(sum(p.numel() for p in tiny_bert.parameters()) / 1e6, "M params")

Loading in Lower Precision

Large models (7B+) cannot fit in standard GPU memory at full float32 precision. HuggingFace provides several options:

import torch
from transformers import AutoModelForCausalLM

# float16 - halves memory, fast on modern GPUs (A100, H100)
model_fp16 = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
torch_dtype=torch.float16,
device_map="auto", # automatically distributes across available GPUs
)

# bfloat16 - better numerical stability than float16 (recommended for training)
model_bf16 = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
torch_dtype=torch.bfloat16,
device_map="auto",
)

# 8-bit quantization (bitsandbytes)
# ~7GB → ~4GB for a 7B model
model_8bit = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
load_in_8bit=True,
device_map="auto",
)

# 4-bit quantization (QLoRA's base model)
# ~7GB → ~3.5GB for a 7B model
model_4bit = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True, # nested quantization saves ~0.4 bits/param
bnb_4bit_quant_type="nf4", # NormalFloat4 - best for normally distributed weights
device_map="auto",
)

The memory formula for a model with PP parameters:

  • float32: 4P4P bytes
  • float16 / bfloat16: 2P2P bytes
  • int8: PP bytes
  • int4 (4-bit): 0.5P0.5P bytes

For a 7B parameter model: 28GB → 14GB → 7GB → 3.5GB.

Fine-Tuning with Trainer: Full Parameter Reference

The Trainer API wraps the training loop, evaluation, checkpointing, logging, and distributed training. The full TrainingArguments class has 100+ parameters - these are the ones that matter in practice.

from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
# --- Output ---
output_dir="./results", # where to save checkpoints
run_name="finbert-bloomberg-v1", # for W&B / logging

# --- Training schedule ---
num_train_epochs=3,
per_device_train_batch_size=16,
per_device_eval_batch_size=32,
gradient_accumulation_steps=4, # effective batch = 16 * 4 = 64
max_steps=-1, # if > 0, overrides num_train_epochs

# --- Optimizer ---
learning_rate=2e-5,
weight_decay=0.01, # L2 regularization on non-bias params
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-8,
max_grad_norm=1.0, # gradient clipping

# --- Learning rate schedule ---
lr_scheduler_type="linear", # "cosine", "cosine_with_restarts", "polynomial"
warmup_steps=500, # steps where LR linearly increases from 0
warmup_ratio=0.06, # alternative: 6% of total steps for warmup

# --- Evaluation and saving ---
evaluation_strategy="epoch", # "steps" or "epoch"
eval_steps=500, # only used if evaluation_strategy="steps"
save_strategy="epoch", # must match evaluation_strategy for best model
save_steps=500,
save_total_limit=3, # keep only 3 most recent checkpoints
load_best_model_at_end=True, # restore best checkpoint after training
metric_for_best_model="f1", # which metric to use for best model
greater_is_better=True, # higher f1 = better

# --- Mixed precision ---
fp16=True, # float16 training (NVIDIA GPU)
bf16=False, # bfloat16 (A100/H100/TPU preferred)
fp16_opt_level="O1", # apex optimization level

# --- Logging ---
logging_dir="./logs",
logging_steps=100,
logging_first_step=True,
report_to="wandb", # "tensorboard", "wandb", "none"

# --- Distributed / hardware ---
dataloader_num_workers=4,
dataloader_pin_memory=True,
group_by_length=True, # batch sequences of similar length (faster)
ddp_find_unused_parameters=False, # speed up DDP when all params used
)

compute_metrics - Custom Evaluation

import evaluate
import numpy as np

accuracy_metric = evaluate.load("accuracy")
f1_metric = evaluate.load("f1")
precision_metric = evaluate.load("precision")
recall_metric = evaluate.load("recall")

def compute_metrics(eval_pred):
"""
eval_pred is an EvalPrediction namedtuple with:
- predictions: numpy array of logits, shape (N, num_labels)
- label_ids: numpy array of true labels, shape (N,)
"""
logits, labels = eval_pred
predictions = np.argmax(logits, axis=-1)

return {
"accuracy": accuracy_metric.compute(
predictions=predictions, references=labels
)["accuracy"],
"f1": f1_metric.compute(
predictions=predictions, references=labels, average="weighted"
)["f1"],
"precision": precision_metric.compute(
predictions=predictions, references=labels, average="weighted"
)["precision"],
"recall": recall_metric.compute(
predictions=predictions, references=labels, average="weighted"
)["recall"],
}

Custom Trainer Subclass

When the default Trainer loop does not fit - multi-task learning, custom loss, curriculum learning - subclass it:

from transformers import Trainer
import torch
import torch.nn as nn

class WeightedLossTrainer(Trainer):
"""
Custom Trainer that uses class-weighted cross-entropy.
Useful for imbalanced datasets where negative examples far outnumber positives.
"""

def __init__(self, class_weights, *args, **kwargs):
super().__init__(*args, **kwargs)
self.class_weights = torch.tensor(class_weights, dtype=torch.float)

def compute_loss(self, model, inputs, return_outputs=False):
labels = inputs.pop("labels")
outputs = model(**inputs)
logits = outputs.logits

# Move weights to same device as logits
weights = self.class_weights.to(logits.device)
loss_fn = nn.CrossEntropyLoss(weight=weights)
loss = loss_fn(logits, labels)

return (loss, outputs) if return_outputs else loss


# Imbalanced dataset: 80% negative, 15% neutral, 5% positive
class_weights = [0.4, 1.3, 4.0] # inverse frequency

weighted_trainer = WeightedLossTrainer(
class_weights=class_weights,
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
compute_metrics=compute_metrics,
)
weighted_trainer.train()

PEFT: Parameter-Efficient Fine-Tuning

Fine-tuning a 7B model at float32 requires approximately 28GB for weights plus another 84GB for optimizer states (Adam stores momentum + variance = 2x weights). Full fine-tuning of 70B models is impossible on any single GPU. Parameter-efficient methods freeze the base model and train only a small set of added or selected parameters.

LoRA: Low-Rank Adaptation

The core insight: weight updates during fine-tuning have low intrinsic rank. Rather than updating WW directly, LoRA parameterizes the update as a product of two low-rank matrices.

For a weight matrix W0Rd×kW_0 \in \mathbb{R}^{d \times k}, LoRA introduces:

W=W0+ΔW=W0+BAW = W_0 + \Delta W = W_0 + BA

where BRd×rB \in \mathbb{R}^{d \times r} and ARr×kA \in \mathbb{R}^{r \times k} with rank rmin(d,k)r \ll \min(d, k).

The number of trainable parameters with LoRA is r(d+k)r(d + k) instead of dkdk.

For BERT-large with d=k=1024d = k = 1024 and r=8r = 8:

  • Full fine-tuning: 1024×1024=1,048,5761024 \times 1024 = 1{,}048{,}576 parameters per matrix
  • LoRA with r=8r=8: 8×(1024+1024)=16,3848 \times (1024 + 1024) = 16{,}384 parameters - a 64x reduction

The scaling factor lora_alpha controls the learning rate of LoRA weights relative to base weights. The effective update magnitude is αrBA\frac{\alpha}{r} \cdot BA. Setting α=2r\alpha = 2r (e.g., r=8, alpha=16) is the standard starting point.

At initialization: AA is sampled from a Gaussian, BB is initialized to zero. This ensures ΔW=BA=0\Delta W = BA = 0 at step 0 - LoRA starts as the original model and adapts from there.

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig, get_peft_model, TaskType, PeftModel
import torch

base_model_name = "gpt2"
model = AutoModelForCausalLM.from_pretrained(
base_model_name,
torch_dtype=torch.float16,
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(base_model_name)
tokenizer.pad_token = tokenizer.eos_token

# Inspect module names to find attention matrices
for name, module in model.named_modules():
if hasattr(module, "weight") and "attn" in name:
print(name)

lora_config = LoraConfig(
r=8, # rank - 4, 8, 16, 32, 64 are common
lora_alpha=16, # scaling: effective LR = alpha / r
target_modules=["c_attn", "c_proj"], # which modules to apply LoRA to
lora_dropout=0.05, # dropout on LoRA weights
bias="none", # "none", "all", or "lora_only"
task_type=TaskType.CAUSAL_LM,
)

peft_model = get_peft_model(model, lora_config)
peft_model.print_trainable_parameters()
# trainable params: 294,912 || all params: 124,736,256 || trainable%: 0.2364

# Only LoRA matrices are trainable
for name, param in peft_model.named_parameters():
if param.requires_grad:
print(f"TRAINABLE: {name}")

QLoRA: Quantized LoRA

QLoRA (Dettmers et al., 2023) combines 4-bit quantization of the base model with LoRA adapters in float16. A 65B model that required 780GB of GPU memory for full fine-tuning can be fine-tuned on a single 48GB GPU.

The three components of QLoRA:

  1. 4-bit NormalFloat (NF4) - a quantization format optimized for normally distributed weights (which neural network weights typically are)
  2. Double quantization - quantize the quantization constants themselves, saving ~0.37 bits per parameter
  3. Paged optimizers - use CUDA unified memory to page optimizer states to CPU when GPU memory is insufficient
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training

# Step 1: Load base model in 4-bit
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
)

base_model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
quantization_config=bnb_config,
device_map="auto",
)

# Step 2: Prepare for k-bit training (enables gradient checkpointing, etc.)
base_model = prepare_model_for_kbit_training(base_model)

# Step 3: Add LoRA adapters in float16
lora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj"], # LLaMA attention
lora_dropout=0.05,
bias="none",
task_type=TaskType.CAUSAL_LM,
)

qlora_model = get_peft_model(base_model, lora_config)
qlora_model.print_trainable_parameters()
# trainable params: 33,554,432 || all params: 3,540,389,888 || trainable%: 0.95

Prefix Tuning and Prompt Tuning

Two lighter-weight alternatives when LoRA is still too large:

Prefix tuning prepends trainable "virtual token" embeddings to every layer's key-value cache. The model learns to condition on these prefixes without changing any weights. Trainable parameters: num_layers × prefix_length × 2 × hidden_size (2 for key and value).

Prompt tuning is even lighter - it only prepends trainable embeddings to the input layer (not every layer). Effective for very large models (11B+) where even 1% of parameters is too many.

from peft import PrefixTuningConfig, PromptTuningConfig, get_peft_model

# Prefix tuning
prefix_config = PrefixTuningConfig(
task_type=TaskType.SEQ_CLS,
num_virtual_tokens=20, # 20 trainable prefix tokens per layer
encoder_hidden_size=768,
)

# Prompt tuning
prompt_config = PromptTuningConfig(
task_type=TaskType.CAUSAL_LM,
num_virtual_tokens=8, # 8 trainable tokens prepended to input only
prompt_tuning_init="TEXT",
prompt_tuning_init_text="Classify the financial sentiment: ",
tokenizer_name_or_path="gpt2",
)

Datasets Library: Deep Dive

The datasets library stores data in the Apache Arrow columnar format. Arrow uses memory mapping - the dataset "lives" on disk and is accessed in memory as if it were already loaded, without actually copying it into RAM. A 100GB dataset is readable on a machine with 16GB of RAM because only the pages you access are loaded.

from datasets import load_dataset, Dataset, DatasetDict, concatenate_datasets
import pandas as pd

# Load from Hub - downloads and caches locally (~/.cache/huggingface/datasets)
imdb = load_dataset("imdb")
print(imdb)
# DatasetDict({
# train: Dataset({features: ['text', 'label'], num_rows: 25000})
# test: Dataset({features: ['text', 'label'], num_rows: 25000})
# })

# Load specific split
train_only = load_dataset("imdb", split="train")

# Load with percentage split
small_train = load_dataset("imdb", split="train[:10%]") # first 10%
val_split = load_dataset("imdb", split="train[80%:90%]") # rows 80-90%

# Load from local files
custom = load_dataset(
"csv",
data_files={"train": "train.csv", "val": "val.csv", "test": "test.csv"},
)

# From Pandas
df = pd.read_csv("bloomberg_financial.csv")
ds = Dataset.from_pandas(df)

# From dict
ds = Dataset.from_dict({
"text": ["Revenue beat expectations", "Profit warning issued"],
"label": [1, 0],
})

map() for Preprocessing

map() is the workhorse of dataset preprocessing. With batched=True, it processes rows in chunks and is significantly faster because tokenizers are optimized for batch inputs.

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

def tokenize_function(examples):
"""
`examples` is a dict of lists when batched=True.
e.g., examples["text"] = ["text1", "text2", ...]
Returns a dict of lists - merged back into the dataset.
"""
return tokenizer(
examples["text"],
truncation=True,
max_length=256,
padding="max_length",
)

tokenized = imdb.map(
tokenize_function,
batched=True,
batch_size=1000,
num_proc=4, # parallel across 4 CPU cores
remove_columns=["text"], # drop original text column
desc="Tokenizing dataset", # progress bar label
)

# Caching: map() caches its output to disk
# Second call with same function is instant (reads cache)
tokenized_cached = imdb.map(tokenize_function, batched=True) # loads from cache

# filter() - keep only examples meeting a condition
long_examples = imdb["train"].filter(lambda x: len(x["text"].split()) > 100)

# select() - keep specific indices
first_1000 = imdb["train"].select(range(1000))

# sort() - by a column value
sorted_ds = imdb["train"].sort("label")

# Set format for PyTorch training
tokenized.set_format("torch", columns=["input_ids", "attention_mask", "label"])

# Push to Hub
tokenized.push_to_hub("your-username/imdb-tokenized-bert")

Streaming for Large Datasets

Some datasets (The Pile, C4, LAION) are hundreds of gigabytes. streaming=True lets you iterate without downloading:

# C4 - 750GB, 156B tokens
stream_ds = load_dataset("c4", "en", split="train", streaming=True)

# Returns an IterableDataset - not Arrow, but a Python generator
for i, example in enumerate(stream_ds):
print(example["text"][:100])
if i >= 4:
break

# Apply map to streaming datasets (same API, different execution)
tokenized_stream = stream_ds.map(tokenize_function, batched=True, batch_size=1000)

# Shuffle streaming dataset with a buffer
shuffled = stream_ds.shuffle(seed=42, buffer_size=10_000)

Pipelines for Inference

Pipelines wrap model + tokenizer + post-processing into a single callable. They are the fastest path to inference.

from transformers import pipeline

# Text classification
clf = pipeline(
"text-classification",
model="ProsusAI/finbert",
device=0, # GPU 0, or -1 for CPU
)
results = clf(["Revenue exceeded expectations.", "Profit warning issued."])
# [{'label': 'positive', 'score': 0.987}, {'label': 'negative', 'score': 0.991}]

# Batch inference - pass a list, pipeline handles batching
texts = ["text1", "text2", ...] * 100
results = clf(texts, batch_size=32)

# Question answering
qa = pipeline("question-answering", model="deepset/roberta-base-squad2")
answer = qa(
question="What was the revenue growth?",
context="Bloomberg reported 12% revenue growth in Q3, beating analyst estimates.",
)
print(answer) # {'answer': '12%', 'start': 24, 'end': 27, 'score': 0.94}

# Text generation with streaming
from transformers import AutoTokenizer, AutoModelForCausalLM, TextIteratorStreamer
import threading

model = AutoModelForCausalLM.from_pretrained("gpt2")
tokenizer = AutoTokenizer.from_pretrained("gpt2")
streamer = TextIteratorStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)

inputs = tokenizer("The future of AI is", return_tensors="pt")

# Run generation in a thread so we can stream in main thread
thread = threading.Thread(
target=model.generate,
kwargs={
**inputs,
"streamer": streamer,
"max_new_tokens": 100,
"do_sample": True,
"temperature": 0.8,
},
)
thread.start()

# Stream output token by token
for token_text in streamer:
print(token_text, end="", flush=True)
thread.join()

The Full Fine-Tuning Pipeline - Diagram

End-to-End: BERT for Financial Sentiment Classification

"""
Complete, runnable fine-tuning of FinBERT on a financial sentiment dataset.
Covers: data loading, tokenization, LoRA fine-tuning, evaluation, Hub push.
"""

from transformers import (
AutoTokenizer,
AutoModelForSequenceClassification,
TrainingArguments,
Trainer,
)
from datasets import load_dataset
from peft import LoraConfig, get_peft_model, TaskType
import evaluate
import numpy as np
import torch

# --- Configuration ---
MODEL_NAME = "ProsusAI/finbert"
NUM_LABELS = 3 # positive, negative, neutral
MAX_LENGTH = 128
OUTPUT_DIR = "./finbert-bloomberg"
HUB_REPO = "your-username/finbert-bloomberg-v1"

# --- Data ---
# Using financial_phrasebank - 4,845 financial news sentences with sentiment labels
raw = load_dataset("financial_phrasebank", "sentences_50agree")
# Split 80/20 - dataset only has train split
train_test = raw["train"].train_test_split(test_size=0.2, seed=42)

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

def preprocess(batch):
return tokenizer(
batch["sentence"],
truncation=True,
max_length=MAX_LENGTH,
padding="max_length",
)

tokenized = train_test.map(preprocess, batched=True, remove_columns=["sentence"])
tokenized = tokenized.rename_column("label", "labels")
tokenized.set_format("torch")

# --- Model with LoRA ---
base_model = AutoModelForSequenceClassification.from_pretrained(
MODEL_NAME,
num_labels=NUM_LABELS,
id2label={0: "positive", 1: "negative", 2: "neutral"},
label2id={"positive": 0, "negative": 1, "neutral": 2},
)

lora_config = LoraConfig(
r=8,
lora_alpha=16,
target_modules=["query", "key", "value"], # BERT attention names
lora_dropout=0.1,
bias="none",
task_type=TaskType.SEQ_CLS,
)
model = get_peft_model(base_model, lora_config)
model.print_trainable_parameters()

# --- Metrics ---
accuracy = evaluate.load("accuracy")
f1 = evaluate.load("f1")

def compute_metrics(eval_pred):
logits, labels = eval_pred
preds = np.argmax(logits, axis=-1)
return {
"accuracy": accuracy.compute(predictions=preds, references=labels)["accuracy"],
"f1": f1.compute(predictions=preds, references=labels, average="weighted")["f1"],
}

# --- Training ---
args = TrainingArguments(
output_dir=OUTPUT_DIR,
num_train_epochs=5,
per_device_train_batch_size=16,
per_device_eval_batch_size=32,
gradient_accumulation_steps=2,
learning_rate=3e-4, # higher LR appropriate for LoRA (not updating base)
warmup_ratio=0.06,
weight_decay=0.01,
evaluation_strategy="epoch",
save_strategy="epoch",
load_best_model_at_end=True,
metric_for_best_model="f1",
fp16=torch.cuda.is_available(),
logging_steps=50,
report_to="wandb",
run_name="finbert-lora-bloomberg",
)

trainer = Trainer(
model=model,
args=args,
train_dataset=tokenized["train"],
eval_dataset=tokenized["test"],
compute_metrics=compute_metrics,
)

trainer.train()

# --- Evaluate ---
results = trainer.evaluate()
print(f"Final F1: {results['eval_f1']:.4f}")
print(f"Final Accuracy: {results['eval_accuracy']:.4f}")

# --- Save and Push ---
model.save_pretrained(OUTPUT_DIR)
tokenizer.save_pretrained(OUTPUT_DIR)
trainer.push_to_hub(HUB_REPO)

Production Engineering Notes

:::tip gradient_accumulation_steps

When GPU memory prevents large batch sizes, use gradient_accumulation_steps. With per_device_train_batch_size=8 and gradient_accumulation_steps=8, you get the gradient signal of batch size 64 without needing 64 samples in GPU memory simultaneously. Training is slower (more forward passes) but memory usage is controlled.

:::

:::note When to merge LoRA adapters

After fine-tuning with LoRA, you have two options for deployment:

  1. Keep adapters separate: load base model + adapter at inference. Requires PEFT library. Adapter is a few MB - easy to version and swap.
  2. Merge and unload: merged = lora_model.merge_and_unload(). Merges W0+BAW_0 + BA into a single weight matrix. No PEFT dependency at inference. Cannot be unmerged.

For production serving, merge if you need maximum inference speed and don't plan to swap adapters. Keep separate if you maintain multiple task-specific adapters on the same base model.

:::

:::danger Flash Attention compatibility

When using load_in_4bit=True or load_in_8bit=True with bitsandbytes, some models require attn_implementation="flash_attention_2" for correct gradient computation. Without it, bitsandbytes quantized layers may produce incorrect gradients during backward pass. Check the model's documentation and set attn_implementation explicitly.

:::

:::warning tokenizer.pad_token for GPT-style models

GPT-2 and LLaMA do not have a dedicated [PAD] token. If you try to batch sequences without setting a pad token, the tokenizer raises a ValueError. The standard fix:

tokenizer.pad_token = tokenizer.eos_token
model.config.pad_token_id = tokenizer.eos_token_id

This works but means the model sees [EOS] both as actual end-of-sequence and as padding. For fine-tuning, always set the loss to ignore pad positions (use DataCollatorForLanguageModeling with mlm=False, which handles this automatically).

:::

YouTube Resources

VideoCreatorWhat You'll Learn
HuggingFace Transformers TutorialAndrej KarpathyState of GPT - model architecture and fine-tuning philosophy
Fine-Tuning BERTAbhishek ThakurComplete BERT fine-tuning walkthrough
LoRA and PEFTHuggingFaceParameter-efficient fine-tuning in depth
HuggingFace DatasetsLysandre DebutDatasets library tour and Arrow format

Interview Q&A

Q1: What is the difference between WordPiece, BPE, and SentencePiece tokenization?

All three are subword tokenization algorithms that handle out-of-vocabulary words without an [UNK] token (or rarely use it), but they differ in how they build and apply the vocabulary.

WordPiece (BERT) builds the vocabulary by maximizing the likelihood of training data under a language model - it prefers merges that improve the model more, not just the most frequent pairs. At inference, it uses a longest-match-first algorithm, marking continuation tokens with ##. BPE (GPT-2, RoBERTa) iteratively merges the most frequent adjacent pair of bytes or characters. GPT-2 uses byte-level BPE which guarantees no unknown tokens for any Unicode input. SentencePiece (T5, LLaMA) operates directly on the raw byte stream without pre-tokenization - no language-specific rules about spaces or punctuation. This makes it fully language-agnostic. In practice: if you are working with English financial text, the differences are minor. If you are working with multilingual or code-heavy text, SentencePiece or byte-level BPE is more robust.

Q2: Explain the LoRA math and calculate the parameter savings for a BERT-large attention layer.

LoRA (Low-Rank Adaptation) freezes the pretrained weight matrix W0W_0 and adds a trainable low-rank decomposition: W=W0+BAW = W_0 + BA where BRd×rB \in \mathbb{R}^{d \times r} and ARr×kA \in \mathbb{R}^{r \times k}.

BERT-large has d=k=1024d = k = 1024 (the query, key, value projection matrices). With rank r=8r = 8:

  • Full fine-tuning trainable params per matrix: 1024×1024=1,048,5761024 \times 1024 = 1{,}048{,}576
  • LoRA trainable params per matrix: r(d+k)=8×(1024+1024)=16,384r(d+k) = 8 \times (1024 + 1024) = 16{,}384
  • Reduction: 64x fewer trainable parameters per matrix

BERT-large has 24 layers × 3 matrices (Q, K, V) = 72 matrices. Full fine-tuning: 75M trainable params. LoRA (r=8r=8): ~1.2M. The frozen weights still participate in the forward pass; only BB and AA receive gradients, so optimizer state (Adam's momentum and variance) is also reduced by 64x. This is why LoRA fits on a single GPU where full fine-tuning does not.

Q3: When would you use Trainer vs a custom training loop?

Use Trainer for: standard supervised fine-tuning (classification, NER, sequence-to-sequence), when you want built-in distributed training with minimal code change, and when W&B or TensorBoard integration via report_to is sufficient.

Write a custom loop when: you need custom loss functions that combine multiple model outputs (e.g., contrastive loss + classification loss), you are doing reinforcement learning from human feedback (RLHF) where the training loop interleaves model inference and reward computation, you need curriculum learning (dynamically changing what examples the model sees based on current performance), or you need fine-grained per-layer learning rates that Trainer does not support natively. The Accelerator from accelerate is worth using in custom loops - it handles device placement and distributed training without the Trainer abstraction.

Q4: What is QLoRA and how does it differ from regular LoRA in memory savings?

LoRA trains adapter matrices BB and AA while keeping the base model frozen in its original dtype (usually float16 or bfloat16). For a 7B model at float16, the base model uses 14GB of GPU memory plus the small LoRA adapter weights.

QLoRA quantizes the base model to 4-bit (NF4 format) before adding LoRA adapters. The base model now uses ~3.5GB instead of 14GB. The LoRA adapters are kept in float16 for numerical stability during training. Gradients flow through the quantized weights to the float16 adapters during backprop using a technique called compute in higher precision - the 4-bit weights are dequantized to float16 only for the matrix multiplication, then the result accumulates in float16.

The practical result: a 65B model that required 780GB for full fine-tuning can be fine-tuned on a single 48GB A100 with QLoRA. The accuracy tradeoff is small - QLoRA papers show <1% degradation vs full fine-tuning on most benchmarks.

Q5: How do you handle a document that is 2000 tokens with a BERT model that has a 512-token limit?

Several approaches depending on the task:

Truncation - If the answer or label is likely in the first part of the document (e.g., news sentiment from the headline), simply truncate to 512. Fast and effective for many real-world cases.

Sliding window - Split into overlapping windows of length 512 with stride 128. Each window gets a prediction; aggregate by averaging logits or taking the majority vote. The return_overflowing_tokens=True tokenizer argument handles this automatically. Best for tasks where relevant information can be anywhere (e.g., legal document NER).

Hierarchical models - Encode each sentence independently with BERT to get sentence embeddings, then feed the sequence of sentence embeddings to a second (often smaller) model. Handles arbitrarily long documents but is more complex.

Long-context models - Use a model with longer context windows: LongFormer (4096 tokens), BigBird (4096), or Mistral/LLaMA variants with rope scaling (8192+). For new projects, prefer a long-context model over sliding window engineering.

Q6: Walk me through the workflow for fine-tuning a model and making it publicly available on the Hub.

  1. Authenticate: huggingface_hub.login(token="hf_...") - saves token to ~/.cache/huggingface/token.
  2. Load: AutoModelForSequenceClassification.from_pretrained(base_checkpoint, num_labels=N) and matching tokenizer.
  3. Prepare data: load_dataset(), dataset.map(tokenize_function, batched=True), set PyTorch format.
  4. Configure: TrainingArguments(output_dir=..., push_to_hub=True, hub_model_id="username/repo").
  5. Train: Trainer(...).train() - saves checkpoints to output_dir locally and pushes to Hub at each save.
  6. Evaluate: trainer.evaluate() - results logged to W&B and printed.
  7. Push final: trainer.push_to_hub() - pushes model weights, tokenizer files, and training args. The Hub creates or updates the git repository.
  8. Model card: Edit README.md on the Hub to describe training data, metrics, intended use, and limitations. Without this, the model will not appear in Hub searches for specific tasks.

The entire workflow is reproducible: anyone with the Hub repo URL can from_pretrained("username/repo") and get the identical model.

:::tip 🎮 Interactive Playground

Visualize this concept: Try the Transformer Attention demo on the EngineersOfAI Playground - no code required.

:::

© 2026 EngineersOfAI. All rights reserved.