Chapter 8: Dataset Engineering

Data-Centric vs. Model-Centric AI

  • Model-centric: improve performance by improving models (architectures, size, techniques)
  • Data-centric: improve performance by improving data (quality, processing, diversity)

Trend: more benchmarks are data-centric — given the same model, improve the dataset.

Andrew Ng (2021): data-centric AI competition — improve dataset, not model.
DataComp (2023): best dataset for CLIP model wins on 38 downstream tasks.

“Three golden goals for training data: Quantity, Quality, Diversity.”


Data Curation

Data Quality

6 characteristics of high-quality data:

  1. Relevant: training examples match the target task (19th-century law dataset is only relevant if task is historical law)
  2. Aligned with task requirements: responses match what the task actually needs (factually correct, creative, concise — as required)
  3. Consistent: same example annotated by different people → similar annotations; clear annotation guidelines are essential
  4. Correctly formatted: no extraneous HTML tags, trailing whitespace, inconsistent casing; Databricks: removing extraneous Markdown/HTML improved accuracy 20%, reduced input tokens 60%
  5. Sufficiently unique: no duplicates that introduce bias or training/test contamination
  6. Compliant: no PII, no copyrighted data, follows all applicable laws/regulations

Less is more: Yi model (2024): 10K carefully crafted instructions > hundreds of thousands of noisy ones.
LIMA (2023): 1K carefully curated prompts → 65B Llama outperforms or ties GPT-4 in 43% of cases.

Data Coverage (Diversity)

Training data should cover the range of problems users will have.

Diversity dimensions (vary by application):

  • Task diversity: QA, summarization, classification, translation, etc.
  • Topic diversity: technology, fashion, finance, law, etc.
  • Linguistic/cultural diversity for multilingual apps
  • Format diversity: long/short, with/without typos, different response lengths
  • Style diversity: casual, formal, Victorian English, etc.

Llama 3 insight: Performance gains primarily driven by data quality and diversity improvements, not architecture changes.
Llama 3 pre-training: math + code = ~42% of data (far more than internet proportion); high-quality code/math data boosts reasoning.

“Adding heterogeneous data can sometimes hurt performance” (Shen et al., 2024) — data diversity without quality control backfires.

Data Quantity

General rules:

  • Start small (50 examples) to validate finetuning viability before scaling
  • If no improvement with small data → more data won’t help either (first check hyperparameters, quality, format)
  • Clear improvement with small data → more data will likely continue to help

Factors affecting data needed:

FactorLow data neededHigh data needed
Finetuning techniquePEFT (LoRA)Full finetuning
Task complexitySimple (sentiment)Complex (financial QA)
Base model strengthStronger baseWeaker base

With 100 examples: advanced models >> weaker models.
With 550K examples: all models perform similarly after finetuning.

Rule: few examples + PEFT on advanced models OR many examples + full finetuning on smaller models.

Performance gain curve: plot performance vs. dataset size; diminishing returns are typical. First 1K examples → +10%; next 1K → +5%.

Staging approach: self-supervised → less relevant supervised → relevant supervised; or synthetic → real data.

Data Acquisition and Annotation

Best data source: your own users — perfectly relevant and aligned, matches real distribution.

Data flywheel: user data → model improvement → better product → more users → more data.

Annotation challenges:

  • Need clear guidelines (what makes a response good vs. just correct?)
  • Guidelines are the same as evaluation guidelines — invest in evaluation first
  • Many teams abandon careful annotation midway → risky shortcut
  • Human labeler fatigue: annotations made in second half of session are lower quality (Kern et al., 2024)

Format considerations:

  • Single-turn data: simpler, more available
  • Multi-turn data: needed for tasks requiring clarification and back-and-forth

Data Augmentation and Synthesis

Augmentation: new data from existing data (flip image of cat → still cat).
Synthesis: generate data mimicking real data properties (simulate rare weather events).

Why Synthesize Data

  1. Increase quantity: scale data programmatically (AlphaGeometry: 100M synthetic geometry examples)
  2. Increase coverage: targeted data for specific behaviors; adversarial examples; rare classes; toxic content for detectors
  3. Increase quality: sometimes AI generates better data than humans (tool use data, complex math problems); more consistent preference ratings
  4. Privacy: synthetic patient records instead of real ones
  5. Distillation: small model imitates large model

Traditional Techniques

Rule-based:

  • Templates + random generators (Faker): transactions, invoices, contracts
  • Word replacement with synonyms (bias mitigation: female CEO examples)
  • Perturbation (add noise): BERT trained with 1.5% random token replacement → performance boost

Simulation:

  • Self-driving: CARLA, Waymo SimulationCity
  • Robotics: simulate joint movements, test in virtual environment
  • Sim2Real gap: simulated algorithms may not transfer perfectly to real world
  • Tool use: simulate action sequences to find optimal paths

AI-Powered Synthesis

Paraphrasing/translation: MetaMath rewrote 15K examples into 400K → outperforms larger models.

Back-translation for quality verification: translate → translate back → compare to original.

Self-play: AI plays against itself (OpenAI Dota 2: 180 years of games per day).

Instruction data synthesis:

  • UltraChat: asked ChatGPT to generate 30 topics → 30-50 subtopics → instructions + responses per subtopic
  • Alpaca: 175 seed examples → GPT-3 generates 52K (instruction, response) pairs
  • Reverse instruction: take existing high-quality content → AI generates prompts that would elicit this content → avoids AI hallucination in responses

Llama 3 coding data synthesis (2.7M examples):

  1. AI generates programming problems
  2. AI generates solutions + unit tests
  3. Linter/parser checks syntax
  4. Unit tests check correctness
  5. AI fixes failures
  6. Code translation to other languages + back-translation verification

Data Verification for Synthetic Data

  • Functional correctness: run code, check test cases
  • AI judges: score 1-5 or classify good/bad; factual consistency checkers
  • Heuristics: filter too short/long; duplicate responses; output = input; repetitive examples
  • Back-translation: quality proxy for translations

Limitations of AI-Generated Data

  1. Quality control: garbage in, garbage out; hard to evaluate quality automatically
  2. Superficial imitation (Gudibande et al., 2023): imitation models mimic style but struggle with factual accuracy and generalization; teaches student to hallucinate answers it can’t solve
  3. Model collapse (Shumailov et al., 2023): recursive AI training → models forget rare events → over-represent probable events; mixing real + synthetic prevents collapse
  4. Obscured data lineage: AI models may have been trained on copyrighted/contaminated data → downstream risks

Model Distillation

Train small “student” model to mimic large “teacher” model.

  • DistilBERT: 40% smaller than BERT, 97% of language comprehension, 60% faster
  • Alpaca (Llama-7B): finetuned on 52K examples from GPT-3 (175B) → behaves similarly at 4% the size
  • Nemotron-4 340B (NVIDIA): finetuned on data from Mixtral-8x7B (56B params) → outperformed teacher

Rule: not all training with synthetic data = distillation; distillation implies teacher is the gold standard.

License check: many models prohibit using their outputs to train other models (OpenAI, Meta Llama licenses).


Data Processing

Steps

  1. Inspect: distributions of token lengths, response lengths, topics; plot by annotator/source; manual inspection (15 minutes often saves hours); fact-check responses
  2. Deduplicate: whole doc duplications, intra-doc duplications, cross-doc duplications
    • Methods: pairwise comparison (expensive), MinHash, Bloom filter, dimensionality reduction
    • Anthropic: repeating 0.1% of data 100 times → 800M model degrades to 400M performance
  3. Clean and filter: remove HTML tags, PII, toxic content; low-quality data; Databricks: 20% accuracy boost from removing extraneous formatting
  4. Format: match model’s expected chat template; if finetuning a base model, training data needn’t include few-shot examples (model learns from examples directly); shorter prompts are possible after finetuning

Processing Tips

  • Order steps to save compute (filter first if cheaper than cleaning)
  • Always do trial runs before full-scale processing
  • Keep a copy of original data
  • Prompt format at inference must match training format exactly (even trailing space matters)

Key Takeaways

  • Data is the biggest differentiator as models become commodities
  • Quality > quantity: 10K carefully crafted >> hundreds of thousands of noisy examples
  • Diversity is equally important: LIMA-style small + high-quality data lacks robustness
  • Start with a small evaluation dataset and clear annotation guidelines — these become training guidelines
  • Synthetic data solves the quantity/coverage problem but requires careful verification and mixing with real data
  • Model distillation uses AI-generated data strategically but requires attention to license restrictions and superficial imitation risks
  • Manual data inspection is indispensable — “staring at data for 15 minutes usually gives insight that saves hours” (Greg Brockman)