Chapter 8: Dataset Engineering

Data-Centric vs. Model-Centric AI

Model-centric: improve performance by improving models (architectures, size, techniques)
Data-centric: improve performance by improving data (quality, processing, diversity)

Trend: more benchmarks are data-centric — given the same model, improve the dataset.

Andrew Ng (2021): data-centric AI competition — improve dataset, not model.
DataComp (2023): best dataset for CLIP model wins on 38 downstream tasks.

“Three golden goals for training data: Quantity, Quality, Diversity.”

Data Curation

Data Quality

6 characteristics of high-quality data:

Relevant: training examples match the target task (19th-century law dataset is only relevant if task is historical law)
Aligned with task requirements: responses match what the task actually needs (factually correct, creative, concise — as required)
Consistent: same example annotated by different people → similar annotations; clear annotation guidelines are essential
Correctly formatted: no extraneous HTML tags, trailing whitespace, inconsistent casing; Databricks: removing extraneous Markdown/HTML improved accuracy 20%, reduced input tokens 60%
Sufficiently unique: no duplicates that introduce bias or training/test contamination
Compliant: no PII, no copyrighted data, follows all applicable laws/regulations

Less is more: Yi model (2024): 10K carefully crafted instructions > hundreds of thousands of noisy ones.
LIMA (2023): 1K carefully curated prompts → 65B Llama outperforms or ties GPT-4 in 43% of cases.

Data Coverage (Diversity)

Training data should cover the range of problems users will have.

Diversity dimensions (vary by application):

Task diversity: QA, summarization, classification, translation, etc.
Topic diversity: technology, fashion, finance, law, etc.
Linguistic/cultural diversity for multilingual apps
Format diversity: long/short, with/without typos, different response lengths
Style diversity: casual, formal, Victorian English, etc.

Llama 3 insight: Performance gains primarily driven by data quality and diversity improvements, not architecture changes.
Llama 3 pre-training: math + code = ~42% of data (far more than internet proportion); high-quality code/math data boosts reasoning.

“Adding heterogeneous data can sometimes hurt performance” (Shen et al., 2024) — data diversity without quality control backfires.

Data Quantity

General rules:

Start small (50 examples) to validate finetuning viability before scaling
If no improvement with small data → more data won’t help either (first check hyperparameters, quality, format)
Clear improvement with small data → more data will likely continue to help

Factors affecting data needed:

Factor	Low data needed	High data needed
Finetuning technique	PEFT (LoRA)	Full finetuning
Task complexity	Simple (sentiment)	Complex (financial QA)
Base model strength	Stronger base	Weaker base

With 100 examples: advanced models >> weaker models.
With 550K examples: all models perform similarly after finetuning.

Rule: few examples + PEFT on advanced models OR many examples + full finetuning on smaller models.

Performance gain curve: plot performance vs. dataset size; diminishing returns are typical. First 1K examples → +10%; next 1K → +5%.

Staging approach: self-supervised → less relevant supervised → relevant supervised; or synthetic → real data.

Data Acquisition and Annotation

Best data source: your own users — perfectly relevant and aligned, matches real distribution.

Data flywheel: user data → model improvement → better product → more users → more data.

Annotation challenges:

Need clear guidelines (what makes a response good vs. just correct?)
Guidelines are the same as evaluation guidelines — invest in evaluation first
Many teams abandon careful annotation midway → risky shortcut
Human labeler fatigue: annotations made in second half of session are lower quality (Kern et al., 2024)

Format considerations:

Single-turn data: simpler, more available
Multi-turn data: needed for tasks requiring clarification and back-and-forth

Data Augmentation and Synthesis

Augmentation: new data from existing data (flip image of cat → still cat).
Synthesis: generate data mimicking real data properties (simulate rare weather events).

Why Synthesize Data

Increase quantity: scale data programmatically (AlphaGeometry: 100M synthetic geometry examples)
Increase coverage: targeted data for specific behaviors; adversarial examples; rare classes; toxic content for detectors
Increase quality: sometimes AI generates better data than humans (tool use data, complex math problems); more consistent preference ratings
Privacy: synthetic patient records instead of real ones
Distillation: small model imitates large model

Traditional Techniques

Rule-based:

Templates + random generators (Faker): transactions, invoices, contracts
Word replacement with synonyms (bias mitigation: female CEO examples)
Perturbation (add noise): BERT trained with 1.5% random token replacement → performance boost

Simulation:

Self-driving: CARLA, Waymo SimulationCity
Robotics: simulate joint movements, test in virtual environment
Sim2Real gap: simulated algorithms may not transfer perfectly to real world
Tool use: simulate action sequences to find optimal paths

AI-Powered Synthesis

Paraphrasing/translation: MetaMath rewrote 15K examples into 400K → outperforms larger models.

Back-translation for quality verification: translate → translate back → compare to original.

Self-play: AI plays against itself (OpenAI Dota 2: 180 years of games per day).

Instruction data synthesis:

UltraChat: asked ChatGPT to generate 30 topics → 30-50 subtopics → instructions + responses per subtopic
Alpaca: 175 seed examples → GPT-3 generates 52K (instruction, response) pairs
Reverse instruction: take existing high-quality content → AI generates prompts that would elicit this content → avoids AI hallucination in responses

Llama 3 coding data synthesis (2.7M examples):

AI generates programming problems
AI generates solutions + unit tests
Linter/parser checks syntax
Unit tests check correctness
AI fixes failures
Code translation to other languages + back-translation verification

Data Verification for Synthetic Data

Functional correctness: run code, check test cases
AI judges: score 1-5 or classify good/bad; factual consistency checkers
Heuristics: filter too short/long; duplicate responses; output = input; repetitive examples
Back-translation: quality proxy for translations

Limitations of AI-Generated Data

Quality control: garbage in, garbage out; hard to evaluate quality automatically
Superficial imitation (Gudibande et al., 2023): imitation models mimic style but struggle with factual accuracy and generalization; teaches student to hallucinate answers it can’t solve
Model collapse (Shumailov et al., 2023): recursive AI training → models forget rare events → over-represent probable events; mixing real + synthetic prevents collapse
Obscured data lineage: AI models may have been trained on copyrighted/contaminated data → downstream risks

Model Distillation

Train small “student” model to mimic large “teacher” model.

DistilBERT: 40% smaller than BERT, 97% of language comprehension, 60% faster
Alpaca (Llama-7B): finetuned on 52K examples from GPT-3 (175B) → behaves similarly at 4% the size
Nemotron-4 340B (NVIDIA): finetuned on data from Mixtral-8x7B (56B params) → outperformed teacher

Rule: not all training with synthetic data = distillation; distillation implies teacher is the gold standard.

License check: many models prohibit using their outputs to train other models (OpenAI, Meta Llama licenses).

Data Processing

Steps

Inspect: distributions of token lengths, response lengths, topics; plot by annotator/source; manual inspection (15 minutes often saves hours); fact-check responses
Deduplicate: whole doc duplications, intra-doc duplications, cross-doc duplications
- Methods: pairwise comparison (expensive), MinHash, Bloom filter, dimensionality reduction
- Anthropic: repeating 0.1% of data 100 times → 800M model degrades to 400M performance
Clean and filter: remove HTML tags, PII, toxic content; low-quality data; Databricks: 20% accuracy boost from removing extraneous formatting
Format: match model’s expected chat template; if finetuning a base model, training data needn’t include few-shot examples (model learns from examples directly); shorter prompts are possible after finetuning

Processing Tips

Order steps to save compute (filter first if cheaper than cleaning)
Always do trial runs before full-scale processing
Keep a copy of original data
Prompt format at inference must match training format exactly (even trailing space matters)

Key Takeaways

Data is the biggest differentiator as models become commodities
Quality > quantity: 10K carefully crafted >> hundreds of thousands of noisy examples
Diversity is equally important: LIMA-style small + high-quality data lacks robustness
Start with a small evaluation dataset and clear annotation guidelines — these become training guidelines
Synthetic data solves the quantity/coverage problem but requires careful verification and mixing with real data
Model distillation uses AI-generated data strategically but requires attention to license restrictions and superficial imitation risks
Manual data inspection is indispensable — “staring at data for 15 minutes usually gives insight that saves hours” (Greg Brockman)

Study Notes by Niladri & AI

Explorer

08-dataset-engineering