Chapter 7: Finetuning
Finetuning Overview
Finetuning = adapting a model by further training its weights. Part of transfer learning — transferring knowledge from pre-training to a new task.
Transfer learning sample efficiency: train from scratch for legal QA might need millions of examples; finetuning a good base might need only a few hundred.
Types of finetuning:
- Continued pre-training (self-supervised): finetune on raw domain data first (legal docs, Vietnamese text) before expensive instruction finetuning
- Supervised finetuning (SFT): (input, output) pairs — teach the model behavior
- Preference finetuning: (instruction, winning, losing) — align with human preference
- Infilling finetuning: predict missing tokens (for text editing, code debugging)
- Long-context finetuning: modify positional embeddings; harder; can degrade short sequences
When to Finetune
Reasons to Finetune
- Improve task-specific performance (model wasn’t sufficiently trained on your task)
- Common dialect/syntax not well covered (proprietary SQL dialect, DSL)
- Improve instruction-following (structured output formats)
- Bias mitigation (finetune on dataset with female CEOs to reduce gender bias)
- Distillation (small model imitates large model): Grammarly Flan-T5 60× smaller than GPT-3 variant but outperforms it
Reasons Not to Finetune
- May degrade performance on other tasks (catastrophic forgetting risk)
- Expensive data: annotated instruction data is slow and costly
- Requires ML expertise: optimizer choice, learning rate, overfitting diagnosis
- Serving complexity: need to host and optimize finetuned models
- Maintenance burden: new base models may outperform your finetuned model
Rule: try prompting → add examples → RAG → then finetune. Many teams finetune prematurely with unsystematic prompting experiments.
Finetuning vs. RAG
| Failure type | Solution |
|---|---|
| Information-based (factually wrong, outdated) | RAG |
| Behavior-based (wrong format, irrelevant style) | Finetuning |
RAG is for facts; finetuning is for form.
Ovadia et al. (2024): for current-events QA, base model + RAG outperforms finetuned model alone. RAG is generally more impactful per unit of effort.
Combining RAG + finetuning can further boost performance (43% of the time in the experiment).
Recommended workflow:
- Prompting + best practices
- Add few-shot examples
- Connect to data sources (term-based retrieval first)
- If info failures → advanced RAG; if behavior failures → finetuning
- Combine RAG + finetuning for max performance
Memory Bottlenecks
Backpropagation and Trainable Parameters
Training requires two phases:
- Forward pass: compute output from input
- Backward pass: compute gradients, update weights via optimizer
Each trainable parameter needs: its gradient value + optimizer states.
- SGD: 0 optimizer states
- Momentum: 1 optimizer state
- Adam: 2 optimizer states (used by most transformer models)
So: 3 values total (gradient + 2 Adam states) per trainable parameter.
Memory Math
Inference: N × M × 1.2 (adds ~20% for activations and KV cache)
Training: weights + activations + gradients + optimizer states
Example: 7B model, FP16 (2 bytes), full finetuning with Adam:
- Weights: 7B × 2 = 14 GB
- Gradients + Adam states: 7B × 3 × 2 = 42 GB
- Total: 56 GB (exceeds most consumer GPUs)
Activation memory can dwarf model weight memory — use gradient checkpointing (recompute activations instead of storing) to reduce at the cost of more compute.
Numerical Representations
| Format | Bits | Bytes | Notes |
|---|---|---|---|
| FP64 | 64 | 8 | Rarely used in NN |
| FP32 | 32 | 4 | Standard precision |
| TF32 | 19 | ~2.4 | NVIDIA-designed, GPU-friendly |
| BF16 | 16 | 2 | More range than FP16, less precision; Google-designed for TPU |
| FP16 | 16 | 2 | More precision than BF16, less range |
| FP8 | 8 | 1 | Emerging |
| INT8 | 8 | 1 | Integer |
| INT4 | 4 | 0.5 | Integer |
Warning: loading Llama 2 (BF16 format) in FP16 causes quality degradation.
Quantization
Quantization = converting to lower precision to reduce memory footprint and speed up inference.
Post-training quantization (PTQ): quantize after training; most common; PyTorch/HuggingFace support it out-of-the-box.
Training with quantization:
- QAT (Quantization-Aware Training): simulates low-precision during training → better inference quality in low precision, but no speed-up during training
- Mixed precision training: weights kept in higher precision; gradients/activations in lower precision; standard in modern training
1-bit LLMs: BitNet b1.58 (2024) — 1.58 bits per parameter; comparable to 16-bit Llama 2 up to 3.9B parameters.
PEFT (Parameter-Efficient Finetuning)
Full finetuning → all parameters trainable → too much memory.
PEFT → same performance as full finetuning but with orders of magnitude fewer trainable parameters.
Houlsby et al. (2019): insert adapter modules into transformer blocks → 0.4% performance difference from full finetuning using only 3% of parameters.
Two PEFT families:
- Adapter-based (additive): add trainable modules to the architecture
- Soft prompt-based: insert trainable continuous tokens into the input
LoRA (Low-Rank Adaptation)
Most popular PEFT technique. No extra inference latency (adapters merge back into base weights).
How it works: For weight matrix W (n×m):
- Create A (n×r) and B (r×m), where r = LoRA rank (much smaller than n or m)
- New weight: W’ = W + (α/r) × A×B
- During finetuning: only A and B are updated; W is frozen
GPT-3 175B: LoRA with 4.7M trainable params (0.0027% of full) → comparable performance.
Why LoRA works: Pre-training implicitly compresses a model’s intrinsic dimension. Better-trained models have lower intrinsic dimensions → easier to finetune with fewer parameters. Larger models are actually easier to PEFT than smaller ones.
LoRA configurations:
- Apply to Q, K, V, O matrices (attention) and/or feedforward layers
- Databricks: biggest boost from feedforward layers; but in memory-constrained settings, attention matrices typically offer better bang-for-buck
- Small rank (r = 4–64) usually sufficient; higher rank rarely improves performance and can overfit
- Common α/r ratio: between 1:8 and 8:1
Multi-LoRA serving: Keep W separate from A, B → share base model across 100 customers with only 100 tiny adapters (~6 MB each vs. 26 GB each)
QLoRA: quantize model weights to 4-bit (NF4 format) during finetuning; dequantize for forward/backward pass. Enables 65B model on a single 48 GB GPU. Guanaco 65B often preferred over ChatGPT in human evaluation.
Soft Prompts
- Hard prompts: human-readable discrete tokens
- Soft prompts: trainable continuous embedding vectors prepended to input
- Prefix tuning: prepend soft tokens at every transformer layer
- Prompt tuning: prepend soft tokens only at the input embedding layer
- Less popular than LoRA but useful when you want more than prompting but less than full finetuning
Model Merging
Combining multiple models into one → better performance or reduced memory footprint.
Use cases:
- Combine complementary models (one answers first 60% of questions; another answers last 60%)
- Multi-task finetuning without catastrophic forgetting (finetune separately then merge)
- On-device deployment (merge multiple small models into one)
- Federated learning (merge copies trained on different user data)
- Model upscaling (create larger model from existing model)
Merging without GPU is possible (no backprop needed) → attractive for indie developers.
Three merging approaches:
1. Summing:
- Linear combination (weighted average): W’ = (wA × A + wB × B) / (wA + wB)
- Task arithmetic (Ilharco et al., 2022): task vector = finetuned model − base model; add task vectors to combine capabilities; subtract to remove capabilities
- SLERP (spherical linear interpolation): merge along geodesic on parameter sphere; works for exactly 2 models
Pruning before merging:
- TIES (Yadav et al., 2023) and DARE (Yu et al., 2023): prune redundant task vector params (top 20% may be sufficient) before merging → improves merged model quality
2. Layer stacking (frankenmerging):
- Take layers from different models and stack them
- Creates unique architectures; needs further finetuning
- Used to create MoE models (Sparse Upcycling)
- Together AI’s Mixture-of-Agents matched GPT-4o from 6 weaker open source models
- SOLAR 10.7B: created from 7B model by stacking/summing layers
3. Concatenation: Merge LoRA adapters by concatenating (rank = r1 + r2). Not recommended (increases memory like serving two separate models).
Finetuning Tactics
Choosing a base model:
- Start with the strongest model to establish upper bound
- Progression path: cheap model → medium model → strong model → map price/performance frontier
- Distillation path: start with strong model + small data → generate more data → train cheaper model
Finetuning method selection:
- Start with LoRA; attempt full finetuning later
- Full finetuning: needs thousands to millions of examples; best performance
- LoRA: works well with hundreds to thousands of examples; enables efficient multi-model serving
Key hyperparameters:
- Learning rate: 1e-7 to 1e-3; common: last pre-training LR × 0.1 to 1.0; use learning rate schedules
- Batch size: ≥8 for stable training; use gradient accumulation if memory-constrained
- Number of epochs: 1–2 for millions of examples; 4–10 for thousands; watch training vs. validation loss for overfitting
- Prompt loss weight: 0% means model learns only from responses; default ~10%
Key Takeaways
- Finetuning is for form (behavior, style, format); RAG is for facts (information)
- Start with prompting → RAG → finetuning; don’t jump to finetuning prematurely
- Memory is the primary constraint; PEFT (especially LoRA) makes finetuning accessible with minimal quality loss
- LoRA works because pre-training compresses intrinsic dimensions; better base = easier PEFT
- Multi-LoRA serving enables efficient per-customer customization with minimal storage overhead
- QLoRA enables large model finetuning on consumer hardware (65B model on single 48GB GPU)
- Model merging enables multi-task models, on-device deployment, and federated learning without retraining