Chapter 8: Dataset Engineering
Data-Centric vs. Model-Centric AI
- Model-centric: improve performance by improving models (architectures, size, techniques)
- Data-centric: improve performance by improving data (quality, processing, diversity)
Trend: more benchmarks are data-centric — given the same model, improve the dataset.
Andrew Ng (2021): data-centric AI competition — improve dataset, not model.
DataComp (2023): best dataset for CLIP model wins on 38 downstream tasks.
“Three golden goals for training data: Quantity, Quality, Diversity.”
Data Curation
Data Quality
6 characteristics of high-quality data:
- Relevant: training examples match the target task (19th-century law dataset is only relevant if task is historical law)
- Aligned with task requirements: responses match what the task actually needs (factually correct, creative, concise — as required)
- Consistent: same example annotated by different people → similar annotations; clear annotation guidelines are essential
- Correctly formatted: no extraneous HTML tags, trailing whitespace, inconsistent casing; Databricks: removing extraneous Markdown/HTML improved accuracy 20%, reduced input tokens 60%
- Sufficiently unique: no duplicates that introduce bias or training/test contamination
- Compliant: no PII, no copyrighted data, follows all applicable laws/regulations
Less is more: Yi model (2024): 10K carefully crafted instructions > hundreds of thousands of noisy ones.
LIMA (2023): 1K carefully curated prompts → 65B Llama outperforms or ties GPT-4 in 43% of cases.
Data Coverage (Diversity)
Training data should cover the range of problems users will have.
Diversity dimensions (vary by application):
- Task diversity: QA, summarization, classification, translation, etc.
- Topic diversity: technology, fashion, finance, law, etc.
- Linguistic/cultural diversity for multilingual apps
- Format diversity: long/short, with/without typos, different response lengths
- Style diversity: casual, formal, Victorian English, etc.
Llama 3 insight: Performance gains primarily driven by data quality and diversity improvements, not architecture changes.
Llama 3 pre-training: math + code = ~42% of data (far more than internet proportion); high-quality code/math data boosts reasoning.
“Adding heterogeneous data can sometimes hurt performance” (Shen et al., 2024) — data diversity without quality control backfires.
Data Quantity
General rules:
- Start small (50 examples) to validate finetuning viability before scaling
- If no improvement with small data → more data won’t help either (first check hyperparameters, quality, format)
- Clear improvement with small data → more data will likely continue to help
Factors affecting data needed:
| Factor | Low data needed | High data needed |
|---|---|---|
| Finetuning technique | PEFT (LoRA) | Full finetuning |
| Task complexity | Simple (sentiment) | Complex (financial QA) |
| Base model strength | Stronger base | Weaker base |
With 100 examples: advanced models >> weaker models.
With 550K examples: all models perform similarly after finetuning.
Rule: few examples + PEFT on advanced models OR many examples + full finetuning on smaller models.
Performance gain curve: plot performance vs. dataset size; diminishing returns are typical. First 1K examples → +10%; next 1K → +5%.
Staging approach: self-supervised → less relevant supervised → relevant supervised; or synthetic → real data.
Data Acquisition and Annotation
Best data source: your own users — perfectly relevant and aligned, matches real distribution.
Data flywheel: user data → model improvement → better product → more users → more data.
Annotation challenges:
- Need clear guidelines (what makes a response good vs. just correct?)
- Guidelines are the same as evaluation guidelines — invest in evaluation first
- Many teams abandon careful annotation midway → risky shortcut
- Human labeler fatigue: annotations made in second half of session are lower quality (Kern et al., 2024)
Format considerations:
- Single-turn data: simpler, more available
- Multi-turn data: needed for tasks requiring clarification and back-and-forth
Data Augmentation and Synthesis
Augmentation: new data from existing data (flip image of cat → still cat).
Synthesis: generate data mimicking real data properties (simulate rare weather events).
Why Synthesize Data
- Increase quantity: scale data programmatically (AlphaGeometry: 100M synthetic geometry examples)
- Increase coverage: targeted data for specific behaviors; adversarial examples; rare classes; toxic content for detectors
- Increase quality: sometimes AI generates better data than humans (tool use data, complex math problems); more consistent preference ratings
- Privacy: synthetic patient records instead of real ones
- Distillation: small model imitates large model
Traditional Techniques
Rule-based:
- Templates + random generators (Faker): transactions, invoices, contracts
- Word replacement with synonyms (bias mitigation: female CEO examples)
- Perturbation (add noise): BERT trained with 1.5% random token replacement → performance boost
Simulation:
- Self-driving: CARLA, Waymo SimulationCity
- Robotics: simulate joint movements, test in virtual environment
- Sim2Real gap: simulated algorithms may not transfer perfectly to real world
- Tool use: simulate action sequences to find optimal paths
AI-Powered Synthesis
Paraphrasing/translation: MetaMath rewrote 15K examples into 400K → outperforms larger models.
Back-translation for quality verification: translate → translate back → compare to original.
Self-play: AI plays against itself (OpenAI Dota 2: 180 years of games per day).
Instruction data synthesis:
- UltraChat: asked ChatGPT to generate 30 topics → 30-50 subtopics → instructions + responses per subtopic
- Alpaca: 175 seed examples → GPT-3 generates 52K (instruction, response) pairs
- Reverse instruction: take existing high-quality content → AI generates prompts that would elicit this content → avoids AI hallucination in responses
Llama 3 coding data synthesis (2.7M examples):
- AI generates programming problems
- AI generates solutions + unit tests
- Linter/parser checks syntax
- Unit tests check correctness
- AI fixes failures
- Code translation to other languages + back-translation verification
Data Verification for Synthetic Data
- Functional correctness: run code, check test cases
- AI judges: score 1-5 or classify good/bad; factual consistency checkers
- Heuristics: filter too short/long; duplicate responses; output = input; repetitive examples
- Back-translation: quality proxy for translations
Limitations of AI-Generated Data
- Quality control: garbage in, garbage out; hard to evaluate quality automatically
- Superficial imitation (Gudibande et al., 2023): imitation models mimic style but struggle with factual accuracy and generalization; teaches student to hallucinate answers it can’t solve
- Model collapse (Shumailov et al., 2023): recursive AI training → models forget rare events → over-represent probable events; mixing real + synthetic prevents collapse
- Obscured data lineage: AI models may have been trained on copyrighted/contaminated data → downstream risks
Model Distillation
Train small “student” model to mimic large “teacher” model.
- DistilBERT: 40% smaller than BERT, 97% of language comprehension, 60% faster
- Alpaca (Llama-7B): finetuned on 52K examples from GPT-3 (175B) → behaves similarly at 4% the size
- Nemotron-4 340B (NVIDIA): finetuned on data from Mixtral-8x7B (56B params) → outperformed teacher
Rule: not all training with synthetic data = distillation; distillation implies teacher is the gold standard.
License check: many models prohibit using their outputs to train other models (OpenAI, Meta Llama licenses).
Data Processing
Steps
- Inspect: distributions of token lengths, response lengths, topics; plot by annotator/source; manual inspection (15 minutes often saves hours); fact-check responses
- Deduplicate: whole doc duplications, intra-doc duplications, cross-doc duplications
- Methods: pairwise comparison (expensive), MinHash, Bloom filter, dimensionality reduction
- Anthropic: repeating 0.1% of data 100 times → 800M model degrades to 400M performance
- Clean and filter: remove HTML tags, PII, toxic content; low-quality data; Databricks: 20% accuracy boost from removing extraneous formatting
- Format: match model’s expected chat template; if finetuning a base model, training data needn’t include few-shot examples (model learns from examples directly); shorter prompts are possible after finetuning
Processing Tips
- Order steps to save compute (filter first if cheaper than cleaning)
- Always do trial runs before full-scale processing
- Keep a copy of original data
- Prompt format at inference must match training format exactly (even trailing space matters)
Key Takeaways
- Data is the biggest differentiator as models become commodities
- Quality > quantity: 10K carefully crafted >> hundreds of thousands of noisy examples
- Diversity is equally important: LIMA-style small + high-quality data lacks robustness
- Start with a small evaluation dataset and clear annotation guidelines — these become training guidelines
- Synthetic data solves the quantity/coverage problem but requires careful verification and mixing with real data
- Model distillation uses AI-generated data strategically but requires attention to license restrictions and superficial imitation risks
- Manual data inspection is indispensable — “staring at data for 15 minutes usually gives insight that saves hours” (Greg Brockman)