Chapter 7: Finetuning

Finetuning Overview

Finetuning = adapting a model by further training its weights. Part of transfer learning — transferring knowledge from pre-training to a new task.

Transfer learning sample efficiency: train from scratch for legal QA might need millions of examples; finetuning a good base might need only a few hundred.

Types of finetuning:

  • Continued pre-training (self-supervised): finetune on raw domain data first (legal docs, Vietnamese text) before expensive instruction finetuning
  • Supervised finetuning (SFT): (input, output) pairs — teach the model behavior
  • Preference finetuning: (instruction, winning, losing) — align with human preference
  • Infilling finetuning: predict missing tokens (for text editing, code debugging)
  • Long-context finetuning: modify positional embeddings; harder; can degrade short sequences

When to Finetune

Reasons to Finetune

  • Improve task-specific performance (model wasn’t sufficiently trained on your task)
  • Common dialect/syntax not well covered (proprietary SQL dialect, DSL)
  • Improve instruction-following (structured output formats)
  • Bias mitigation (finetune on dataset with female CEOs to reduce gender bias)
  • Distillation (small model imitates large model): Grammarly Flan-T5 60× smaller than GPT-3 variant but outperforms it

Reasons Not to Finetune

  • May degrade performance on other tasks (catastrophic forgetting risk)
  • Expensive data: annotated instruction data is slow and costly
  • Requires ML expertise: optimizer choice, learning rate, overfitting diagnosis
  • Serving complexity: need to host and optimize finetuned models
  • Maintenance burden: new base models may outperform your finetuned model

Rule: try prompting → add examples → RAG → then finetune. Many teams finetune prematurely with unsystematic prompting experiments.

Finetuning vs. RAG

Failure typeSolution
Information-based (factually wrong, outdated)RAG
Behavior-based (wrong format, irrelevant style)Finetuning

RAG is for facts; finetuning is for form.

Ovadia et al. (2024): for current-events QA, base model + RAG outperforms finetuned model alone. RAG is generally more impactful per unit of effort.

Combining RAG + finetuning can further boost performance (43% of the time in the experiment).

Recommended workflow:

  1. Prompting + best practices
  2. Add few-shot examples
  3. Connect to data sources (term-based retrieval first)
  4. If info failures → advanced RAG; if behavior failures → finetuning
  5. Combine RAG + finetuning for max performance

Memory Bottlenecks

Backpropagation and Trainable Parameters

Training requires two phases:

  1. Forward pass: compute output from input
  2. Backward pass: compute gradients, update weights via optimizer

Each trainable parameter needs: its gradient value + optimizer states.

  • SGD: 0 optimizer states
  • Momentum: 1 optimizer state
  • Adam: 2 optimizer states (used by most transformer models)

So: 3 values total (gradient + 2 Adam states) per trainable parameter.

Memory Math

Inference: N × M × 1.2 (adds ~20% for activations and KV cache)

Training: weights + activations + gradients + optimizer states

Example: 7B model, FP16 (2 bytes), full finetuning with Adam:

  • Weights: 7B × 2 = 14 GB
  • Gradients + Adam states: 7B × 3 × 2 = 42 GB
  • Total: 56 GB (exceeds most consumer GPUs)

Activation memory can dwarf model weight memory — use gradient checkpointing (recompute activations instead of storing) to reduce at the cost of more compute.

Numerical Representations

FormatBitsBytesNotes
FP64648Rarely used in NN
FP32324Standard precision
TF3219~2.4NVIDIA-designed, GPU-friendly
BF16162More range than FP16, less precision; Google-designed for TPU
FP16162More precision than BF16, less range
FP881Emerging
INT881Integer
INT440.5Integer

Warning: loading Llama 2 (BF16 format) in FP16 causes quality degradation.

Quantization

Quantization = converting to lower precision to reduce memory footprint and speed up inference.

Post-training quantization (PTQ): quantize after training; most common; PyTorch/HuggingFace support it out-of-the-box.

Training with quantization:

  • QAT (Quantization-Aware Training): simulates low-precision during training → better inference quality in low precision, but no speed-up during training
  • Mixed precision training: weights kept in higher precision; gradients/activations in lower precision; standard in modern training

1-bit LLMs: BitNet b1.58 (2024) — 1.58 bits per parameter; comparable to 16-bit Llama 2 up to 3.9B parameters.


PEFT (Parameter-Efficient Finetuning)

Full finetuning → all parameters trainable → too much memory.
PEFT → same performance as full finetuning but with orders of magnitude fewer trainable parameters.

Houlsby et al. (2019): insert adapter modules into transformer blocks → 0.4% performance difference from full finetuning using only 3% of parameters.

Two PEFT families:

  1. Adapter-based (additive): add trainable modules to the architecture
  2. Soft prompt-based: insert trainable continuous tokens into the input

LoRA (Low-Rank Adaptation)

Most popular PEFT technique. No extra inference latency (adapters merge back into base weights).

How it works: For weight matrix W (n×m):

  1. Create A (n×r) and B (r×m), where r = LoRA rank (much smaller than n or m)
  2. New weight: W’ = W + (α/r) × A×B
  3. During finetuning: only A and B are updated; W is frozen

GPT-3 175B: LoRA with 4.7M trainable params (0.0027% of full) → comparable performance.

Why LoRA works: Pre-training implicitly compresses a model’s intrinsic dimension. Better-trained models have lower intrinsic dimensions → easier to finetune with fewer parameters. Larger models are actually easier to PEFT than smaller ones.

LoRA configurations:

  • Apply to Q, K, V, O matrices (attention) and/or feedforward layers
  • Databricks: biggest boost from feedforward layers; but in memory-constrained settings, attention matrices typically offer better bang-for-buck
  • Small rank (r = 4–64) usually sufficient; higher rank rarely improves performance and can overfit
  • Common α/r ratio: between 1:8 and 8:1

Multi-LoRA serving: Keep W separate from A, B → share base model across 100 customers with only 100 tiny adapters (~6 MB each vs. 26 GB each)

QLoRA: quantize model weights to 4-bit (NF4 format) during finetuning; dequantize for forward/backward pass. Enables 65B model on a single 48 GB GPU. Guanaco 65B often preferred over ChatGPT in human evaluation.

Soft Prompts

  • Hard prompts: human-readable discrete tokens
  • Soft prompts: trainable continuous embedding vectors prepended to input
  • Prefix tuning: prepend soft tokens at every transformer layer
  • Prompt tuning: prepend soft tokens only at the input embedding layer
  • Less popular than LoRA but useful when you want more than prompting but less than full finetuning

Model Merging

Combining multiple models into one → better performance or reduced memory footprint.

Use cases:

  • Combine complementary models (one answers first 60% of questions; another answers last 60%)
  • Multi-task finetuning without catastrophic forgetting (finetune separately then merge)
  • On-device deployment (merge multiple small models into one)
  • Federated learning (merge copies trained on different user data)
  • Model upscaling (create larger model from existing model)

Merging without GPU is possible (no backprop needed) → attractive for indie developers.

Three merging approaches:

1. Summing:

  • Linear combination (weighted average): W’ = (wA × A + wB × B) / (wA + wB)
  • Task arithmetic (Ilharco et al., 2022): task vector = finetuned model − base model; add task vectors to combine capabilities; subtract to remove capabilities
  • SLERP (spherical linear interpolation): merge along geodesic on parameter sphere; works for exactly 2 models

Pruning before merging:

  • TIES (Yadav et al., 2023) and DARE (Yu et al., 2023): prune redundant task vector params (top 20% may be sufficient) before merging → improves merged model quality

2. Layer stacking (frankenmerging):

  • Take layers from different models and stack them
  • Creates unique architectures; needs further finetuning
  • Used to create MoE models (Sparse Upcycling)
  • Together AI’s Mixture-of-Agents matched GPT-4o from 6 weaker open source models
  • SOLAR 10.7B: created from 7B model by stacking/summing layers

3. Concatenation: Merge LoRA adapters by concatenating (rank = r1 + r2). Not recommended (increases memory like serving two separate models).


Finetuning Tactics

Choosing a base model:

  • Start with the strongest model to establish upper bound
  • Progression path: cheap model → medium model → strong model → map price/performance frontier
  • Distillation path: start with strong model + small data → generate more data → train cheaper model

Finetuning method selection:

  • Start with LoRA; attempt full finetuning later
  • Full finetuning: needs thousands to millions of examples; best performance
  • LoRA: works well with hundreds to thousands of examples; enables efficient multi-model serving

Key hyperparameters:

  • Learning rate: 1e-7 to 1e-3; common: last pre-training LR × 0.1 to 1.0; use learning rate schedules
  • Batch size: ≥8 for stable training; use gradient accumulation if memory-constrained
  • Number of epochs: 1–2 for millions of examples; 4–10 for thousands; watch training vs. validation loss for overfitting
  • Prompt loss weight: 0% means model learns only from responses; default ~10%

Key Takeaways

  • Finetuning is for form (behavior, style, format); RAG is for facts (information)
  • Start with prompting → RAG → finetuning; don’t jump to finetuning prematurely
  • Memory is the primary constraint; PEFT (especially LoRA) makes finetuning accessible with minimal quality loss
  • LoRA works because pre-training compresses intrinsic dimensions; better base = easier PEFT
  • Multi-LoRA serving enables efficient per-customer customization with minimal storage overhead
  • QLoRA enables large model finetuning on consumer hardware (65B model on single 48GB GPU)
  • Model merging enables multi-task models, on-device deployment, and federated learning without retraining