Chapter 7: Finetuning

Finetuning Overview

Finetuning = adapting a model by further training its weights. Part of transfer learning — transferring knowledge from pre-training to a new task.

Transfer learning sample efficiency: train from scratch for legal QA might need millions of examples; finetuning a good base might need only a few hundred.

Types of finetuning:

Continued pre-training (self-supervised): finetune on raw domain data first (legal docs, Vietnamese text) before expensive instruction finetuning
Supervised finetuning (SFT): (input, output) pairs — teach the model behavior
Preference finetuning: (instruction, winning, losing) — align with human preference
Infilling finetuning: predict missing tokens (for text editing, code debugging)
Long-context finetuning: modify positional embeddings; harder; can degrade short sequences

When to Finetune

Reasons to Finetune

Improve task-specific performance (model wasn’t sufficiently trained on your task)
Common dialect/syntax not well covered (proprietary SQL dialect, DSL)
Improve instruction-following (structured output formats)
Bias mitigation (finetune on dataset with female CEOs to reduce gender bias)
Distillation (small model imitates large model): Grammarly Flan-T5 60× smaller than GPT-3 variant but outperforms it

Reasons Not to Finetune

May degrade performance on other tasks (catastrophic forgetting risk)
Expensive data: annotated instruction data is slow and costly
Requires ML expertise: optimizer choice, learning rate, overfitting diagnosis
Serving complexity: need to host and optimize finetuned models
Maintenance burden: new base models may outperform your finetuned model

Rule: try prompting → add examples → RAG → then finetune. Many teams finetune prematurely with unsystematic prompting experiments.

Finetuning vs. RAG

Failure type	Solution
Information-based (factually wrong, outdated)	RAG
Behavior-based (wrong format, irrelevant style)	Finetuning

RAG is for facts; finetuning is for form.

Ovadia et al. (2024): for current-events QA, base model + RAG outperforms finetuned model alone. RAG is generally more impactful per unit of effort.

Combining RAG + finetuning can further boost performance (43% of the time in the experiment).

Recommended workflow:

Prompting + best practices
Add few-shot examples
Connect to data sources (term-based retrieval first)
If info failures → advanced RAG; if behavior failures → finetuning
Combine RAG + finetuning for max performance

Memory Bottlenecks

Backpropagation and Trainable Parameters

Training requires two phases:

Forward pass: compute output from input
Backward pass: compute gradients, update weights via optimizer

Each trainable parameter needs: its gradient value + optimizer states.

SGD: 0 optimizer states
Momentum: 1 optimizer state
Adam: 2 optimizer states (used by most transformer models)

So: 3 values total (gradient + 2 Adam states) per trainable parameter.

Memory Math

Inference: N × M × 1.2 (adds ~20% for activations and KV cache)

Training: weights + activations + gradients + optimizer states

Example: 7B model, FP16 (2 bytes), full finetuning with Adam:

Weights: 7B × 2 = 14 GB
Gradients + Adam states: 7B × 3 × 2 = 42 GB
Total: 56 GB (exceeds most consumer GPUs)

Activation memory can dwarf model weight memory — use gradient checkpointing (recompute activations instead of storing) to reduce at the cost of more compute.

Numerical Representations

Format	Bits	Bytes	Notes
FP64	64	8	Rarely used in NN
FP32	32	4	Standard precision
TF32	19	~2.4	NVIDIA-designed, GPU-friendly
BF16	16	2	More range than FP16, less precision; Google-designed for TPU
FP16	16	2	More precision than BF16, less range
FP8	8	1	Emerging
INT8	8	1	Integer
INT4	4	0.5	Integer

Warning: loading Llama 2 (BF16 format) in FP16 causes quality degradation.

Quantization

Quantization = converting to lower precision to reduce memory footprint and speed up inference.

Post-training quantization (PTQ): quantize after training; most common; PyTorch/HuggingFace support it out-of-the-box.

Training with quantization:

QAT (Quantization-Aware Training): simulates low-precision during training → better inference quality in low precision, but no speed-up during training
Mixed precision training: weights kept in higher precision; gradients/activations in lower precision; standard in modern training

1-bit LLMs: BitNet b1.58 (2024) — 1.58 bits per parameter; comparable to 16-bit Llama 2 up to 3.9B parameters.

PEFT (Parameter-Efficient Finetuning)

Full finetuning → all parameters trainable → too much memory.
PEFT → same performance as full finetuning but with orders of magnitude fewer trainable parameters.

Houlsby et al. (2019): insert adapter modules into transformer blocks → 0.4% performance difference from full finetuning using only 3% of parameters.

Two PEFT families:

Adapter-based (additive): add trainable modules to the architecture
Soft prompt-based: insert trainable continuous tokens into the input

LoRA (Low-Rank Adaptation)

Most popular PEFT technique. No extra inference latency (adapters merge back into base weights).

How it works: For weight matrix W (n×m):

Create A (n×r) and B (r×m), where r = LoRA rank (much smaller than n or m)
New weight: W’ = W + (α/r) × A×B
During finetuning: only A and B are updated; W is frozen

GPT-3 175B: LoRA with 4.7M trainable params (0.0027% of full) → comparable performance.

Why LoRA works: Pre-training implicitly compresses a model’s intrinsic dimension. Better-trained models have lower intrinsic dimensions → easier to finetune with fewer parameters. Larger models are actually easier to PEFT than smaller ones.

LoRA configurations:

Apply to Q, K, V, O matrices (attention) and/or feedforward layers
Databricks: biggest boost from feedforward layers; but in memory-constrained settings, attention matrices typically offer better bang-for-buck
Small rank (r = 4–64) usually sufficient; higher rank rarely improves performance and can overfit
Common α/r ratio: between 1:8 and 8:1

Multi-LoRA serving: Keep W separate from A, B → share base model across 100 customers with only 100 tiny adapters (~6 MB each vs. 26 GB each)

QLoRA: quantize model weights to 4-bit (NF4 format) during finetuning; dequantize for forward/backward pass. Enables 65B model on a single 48 GB GPU. Guanaco 65B often preferred over ChatGPT in human evaluation.

Soft Prompts

Hard prompts: human-readable discrete tokens
Soft prompts: trainable continuous embedding vectors prepended to input
Prefix tuning: prepend soft tokens at every transformer layer
Prompt tuning: prepend soft tokens only at the input embedding layer
Less popular than LoRA but useful when you want more than prompting but less than full finetuning

Model Merging

Combining multiple models into one → better performance or reduced memory footprint.

Use cases:

Combine complementary models (one answers first 60% of questions; another answers last 60%)
Multi-task finetuning without catastrophic forgetting (finetune separately then merge)
On-device deployment (merge multiple small models into one)
Federated learning (merge copies trained on different user data)
Model upscaling (create larger model from existing model)

Merging without GPU is possible (no backprop needed) → attractive for indie developers.

Three merging approaches:

1. Summing:

Linear combination (weighted average): W’ = (wA × A + wB × B) / (wA + wB)
Task arithmetic (Ilharco et al., 2022): task vector = finetuned model − base model; add task vectors to combine capabilities; subtract to remove capabilities
SLERP (spherical linear interpolation): merge along geodesic on parameter sphere; works for exactly 2 models

Pruning before merging:

TIES (Yadav et al., 2023) and DARE (Yu et al., 2023): prune redundant task vector params (top 20% may be sufficient) before merging → improves merged model quality

2. Layer stacking (frankenmerging):

Take layers from different models and stack them
Creates unique architectures; needs further finetuning
Used to create MoE models (Sparse Upcycling)
Together AI’s Mixture-of-Agents matched GPT-4o from 6 weaker open source models
SOLAR 10.7B: created from 7B model by stacking/summing layers

3. Concatenation: Merge LoRA adapters by concatenating (rank = r1 + r2). Not recommended (increases memory like serving two separate models).

Finetuning Tactics

Choosing a base model:

Start with the strongest model to establish upper bound
Progression path: cheap model → medium model → strong model → map price/performance frontier
Distillation path: start with strong model + small data → generate more data → train cheaper model

Finetuning method selection:

Start with LoRA; attempt full finetuning later
Full finetuning: needs thousands to millions of examples; best performance
LoRA: works well with hundreds to thousands of examples; enables efficient multi-model serving

Key hyperparameters:

Learning rate: 1e-7 to 1e-3; common: last pre-training LR × 0.1 to 1.0; use learning rate schedules
Batch size: ≥8 for stable training; use gradient accumulation if memory-constrained
Number of epochs: 1–2 for millions of examples; 4–10 for thousands; watch training vs. validation loss for overfitting
Prompt loss weight: 0% means model learns only from responses; default ~10%

Key Takeaways

Finetuning is for form (behavior, style, format); RAG is for facts (information)
Start with prompting → RAG → finetuning; don’t jump to finetuning prematurely
Memory is the primary constraint; PEFT (especially LoRA) makes finetuning accessible with minimal quality loss
LoRA works because pre-training compresses intrinsic dimensions; better base = easier PEFT
Multi-LoRA serving enables efficient per-customer customization with minimal storage overhead
QLoRA enables large model finetuning on consumer hardware (65B model on single 48GB GPU)
Model merging enables multi-task models, on-device deployment, and federated learning without retraining

Study Notes by Niladri & AI

Explorer

07-finetuning