Chapter 5: Design a Metrics Monitoring and Alerting System
volume2 metrics monitoring alerting time-series
Status: π© Interview ready
Difficulty: Hard
Time to complete: 50 min read + practice
Overview
A metrics monitoring and alerting system collects, stores, queries, and visualizes operational data from infrastructure and applications, then fires alerts when anomalies occur.
Real-world examples: Datadog, Prometheus + Grafana, AWS CloudWatch, New Relic
Why this matters:
- Core infrastructure question (appears at FAANG, Stripe, Datadog)
- Touches time-series databases, streaming ingestion, and distributed systems
- Tests knowledge of write-heavy workloads, data compression, and storage tiering
Problem Statement
Design a metrics monitoring system that:
- Collects infrastructure and application metrics at 100M DAU scale
- Stores metrics with 1-year retention
- Supports PromQL-style queries (groupby, avg, sum, rate)
- Fires alerts via Email, PagerDuty, Slack when thresholds are exceeded
- Renders Grafana-like dashboards for visualization
Step 1: Requirements & Scope (5 min)
Functional Requirements
Clarifying questions:
- What types of metrics? β CPU, memory, disk, custom app metrics, business metrics
- What scale? β 1000 servers Γ 10 metrics each = 10,000 writes/sec
- Query style? β PromQL-style: avg/sum/rate over time ranges, groupby labels
- Alert channels? β Email, PagerDuty, Slack
- Retention? β 1 year
Scope:
- Ingest metrics at high write throughput
- Store with downsampling over time
- Query with label-based filtering and aggregations
- Alert when metric value crosses threshold
- Visualize on dashboards
Non-Functional Requirements
- High write throughput: ~10K writes/sec sustained
- High availability: Monitoring must stay up even when services are down
- Durability: No metric loss; 1-year retention
- Query performance: Dashboard queries return in < 1 second
- Scalability: Scale from 10K to 1M writes/sec horizontally
- Low latency alerts: Alert fires within 1β2 minutes of threshold breach
Scale Estimates
Servers: 1,000
Metrics/server: 10
Write rate: 10,000 metrics/sec
Metric size: ~200 bytes (name + labels + timestamp + value)
Raw data/day: 10,000 Γ 200 B Γ 86,400 sec = ~173 GB/day
Raw data/year: ~63 TB/year (before compression/downsampling)
After 12x compression + downsampling: ~5β7 TB/year
Step 2: High-Level Design (10 min)
Core Components
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Metrics Monitoring System β
β β
β ββββββββββββββββ βββββββββββββββββ ββββββββββββββββββββ β
β β Metrics β β Collection β β Ingestion β β
β β Sources βββββΆβ Layer βββββΆβ Pipeline β β
β β β β (Pull/Push) β β (Kafka buffer) β β
β ββββββββββββββββ βββββββββββββββββ ββββββββββ¬ββββββββββ β
β β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββΌββββββββββ β
β β Time-Series Database (TSDB) β β
β β InfluxDB / Prometheus TSDB / OpenTSDB β β
β ββββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββ β
β β β
β ββββββββββββββββββββββΌβββββββββββββββββββββ β
β βΌ βΌ βΌ β
β βββββββββββββββ ββββββββββββββββ βββββββββββββββββββββ β
β β Query API β β Alert Managerβ β Visualization β β
β β (PromQL) β β (Rules eval) β β (Grafana-like) β β
β βββββββββββββββ ββββββββββββββββ βββββββββββββββββββββ β
β β β
β ββββββββΌβββββββ β
β βNotification β β
β β Channels β β
β βEmail/PD/Slackβ β
β βββββββββββββββ β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Data Collection: Pull vs Push
Pull model (Prometheus):
Prometheus Server ββββ scrapes βββββΆ /metrics endpoint on each host
CPU: 0.72
mem_used_bytes: 2147483648
disk_free_bytes: 107374182400
Push model (Datadog, StatsD):
Host Agent ββββββ pushes βββββββΆ Collector / Aggregator βββΆ TSDB
metric + labels
+ timestamp + value
Metrics Data Model
Every metric is a tuple:
(metric_name, {labels}, timestamp, value)
Example:
metric_name: "cpu_usage_percent"
labels: {host="server-01", region="us-east-1", env="prod"}
timestamp: 1712995200 (Unix epoch, seconds)
value: 72.4
Wire format (line protocol - InfluxDB style):
cpu_usage_percent,host=server-01,region=us-east-1,env=prod value=72.4 1712995200
API Design
Write API:
POST /v1/metrics
Content-Type: application/json
{
"metric_name": "cpu_usage_percent",
"labels": {"host": "server-01", "region": "us-east-1"},
"timestamp": 1712995200,
"value": 72.4
}
Response: 204 No Content
Query API:
GET /v1/query_range
?query=avg(cpu_usage_percent{region="us-east-1"}) by (host)
&start=1712908800
&end=1712995200
&step=60
Response:
{
"status": "success",
"data": {
"resultType": "matrix",
"result": [
{
"metric": {"host": "server-01"},
"values": [[1712908800, "68.2"], [1712908860, "71.1"], ...]
}
]
}
}
Step 3: Deep Dive (25 min)
Pull vs Push Model β Detailed Comparison
Pull model (Prometheus-style):
Prometheus server
β
βββ every 15s β GET http://server-01:9090/metrics
βββ every 15s β GET http://server-02:9090/metrics
βββ every 15s β GET http://server-03:9090/metrics
| Aspect | Pull | Push |
|---|---|---|
| Who initiates | Collector scrapes target | Agent pushes to collector |
| Health check | Easy (no scrape = target down) | Hard (silent failure) |
| Firewall | Collector needs access to targets | Works behind NAT/firewall |
| Scalability | Collector becomes bottleneck | Scales with agents |
| Data loss on restart | Rescrapes | Can lose buffered data |
| Examples | Prometheus, Nagios | Datadog, StatsD, CloudWatch |
Recommendation: Use hybrid β agents push to Kafka, Prometheus-style query layer reads from TSDB.
Hybrid ingestion pipeline:
Host Agent ββpushβββΆ Kafka Topic ββconsumeβββΆ TSDB Writer βββΆ TSDB
metric "metrics-raw" (Flink or (InfluxDB/
events partitioned consumer OpenTSDB)
by metric_name group)
Why Kafka as buffer?:
- Decouples producers (agents) from consumers (TSDB writers)
- Absorbs traffic spikes (TSDB writes can be batched)
- Replay capability if TSDB is temporarily unavailable
- Multiple consumers: TSDB writer + alerting consumer + archiving
Time-Series Database Design
Why not a relational database?
| Concern | Relational DB | Time-Series DB |
|---|---|---|
| Write throughput | ~10K rows/sec per instance | Millions of points/sec |
| Time-range queries | Full table scan or index scan | Optimized for range reads |
| Compression | Row-based, poor for floats | Column-based + delta encoding |
| Retention/TTL | Manual partitioning | Built-in downsampling & TTL |
| Schema | Fixed schema | Dynamic label key-value pairs |
TSDB storage layout (columnar):
Chunk for metric "cpu_usage_percent" / host="server-01":
Timestamps (delta-encoded):
[1712908800, +60, +60, +60, +60, ...] β store base + deltas (small ints)
Values (XOR / Gorilla encoding):
[72.4, 72.6, 71.9, 73.1, ...] β XOR consecutive floats β many zeros β compress
Block structure:
ββββββββββββββββββββββββββββββββββββββββββββ
β Block header (time range, metric names) β
ββββββββββββββββββββββββββββββββββββββββββββ€
β Index: label β chunk offset β
ββββββββββββββββββββββββββββββββββββββββββββ€
β Chunk 1: timestamps + values (Gorilla) β
β Chunk 2: timestamps + values (Gorilla) β
β ... β
ββββββββββββββββββββββββββββββββββββββββββββ
Gorilla compression (Facebook, 2015):
Step 1 β Delta-of-delta encoding for timestamps:
Raw: [100, 160, 220, 280]
Delta: [60, 60, 60]
DOD: [0, 0, 0] β All zeros! 1-2 bits to store each
Step 2 β XOR encoding for float values:
72.4 β 0 10000000101 0111001100110011... (IEEE 754 double)
72.6 β 0 10000000101 0111001100110100...
XOR β 0 00000000000 0000000000000111... (mostly zeros β compresses well)
Result: ~12x compression ratio over raw float64 storage
Downsampling Strategy
Raw metrics at 10-second granularity are expensive to store for 1 year. Use downsampling:
Storage tier: Raw (10s) β 1-min avg β 1-hour avg β 1-day avg
Retention: 7 days 30 days 6 months 1 year
Size factor: 1x 1/6x 1/360x 1/8640x
Query use: Recent debug Recent trend Weekly trend Long-term
Aggregation functions stored per interval:
- avg (most common for dashboards)
- min (detect spikes)
- max (detect spikes)
- sum (counters, e.g., request count)
- count (number of raw samples aggregated)
- p95 (percentile β stored as approximation via T-Digest or HDR Histogram)
Downsampling pipeline:
TSDB Raw (10s data)
β
βββ Background job every 1 min β Write 1-min aggregates to TSDB
βββ Background job every 1 hour β Write 1-hour aggregates to TSDB
βββ Background job every 1 day β Write 1-day aggregates to TSDB
Delete raw data after 7 days, 1-min data after 30 days, etc.
Hot / Warm / Cold Storage Tiering
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Storage Tiering β
β β
β HOT (0β7 days) WARM (7β30 days) COLD (30dβ1 year) β
β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ β
β β TSDB local β β β Replicated β β β Object β β
β β NVMe SSD β β TSDB / SSD β β Storage β β
β β (fastest) β β (fast) β β (S3/GCS) β β
β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ β
β 10s granularity 1-min granularity 1-hour/1-day β
β Full resolution Medium resolution Low resolution β
β Sub-second query < 1s query 2β5s query β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Query routing:
def route_query(start_time, end_time):
age_days = (now() - start_time) / 86400
if age_days <= 7:
return query_hot_storage(start_time, end_time)
elif age_days <= 30:
return query_warm_storage(start_time, end_time)
else:
return query_cold_storage(start_time, end_time)Alerting System Design
Alert rule definition:
alert: HighCPUUsage
expr: avg(cpu_usage_percent{env="prod"}) by (host) > 85
for: 5m # Must be above threshold for 5 minutes (avoid flapping)
labels:
severity: warning
annotations:
summary: "High CPU on {{ $labels.host }}"
description: "CPU is {{ $value }}% for 5 minutes"
notify:
- channel: pagerduty
- channel: slack
webhook: https://hooks.slack.com/services/...Alerting pipeline:
TSDB / Kafka
β
βΌ
Alert Rule Engine (runs every 1 min)
β Evaluates all rules against current metric values
β Compares result against threshold
β
βΌ
Alert State Machine
βββββββββββ threshold crossed βββββββββββ "for" duration met βββββββββββ
βINACTIVE β ββββββββββββββββββββββΆβ PENDING β βββββββββββββββββββββββΆβ FIRING β
βββββββββββ βββββββββββ βββββββββββ
β β
β threshold recovered β threshold recovered
βΌ βΌ
βββββββββββ βββββββββββ
βINACTIVE β βINACTIVE β
βββββββββββ βββββββββββ
Alert Manager
β Receives FIRING alerts
β Deduplicates (same alert β send once per interval)
β Groups related alerts (host=server-01 fires 3 alerts β 1 notification)
β Silences (maintenance windows)
β Routes to correct channel based on severity
βΌ
Notification Channels
βββ Email (SMTP)
βββ PagerDuty (REST API + escalation policies)
βββ Slack (webhook)
Alert deduplication:
def should_send_alert(alert_id, alert_state):
last_sent = redis.get(f"alert_last_sent:{alert_id}")
if alert_state == "FIRING" and last_sent is None:
redis.setex(f"alert_last_sent:{alert_id}", 3600, now()) # resend every 1 hour
return True
elif alert_state == "RESOLVED":
redis.delete(f"alert_last_sent:{alert_id}")
return True # Always send resolved
return FalseQuery Layer: PromQL-Style Execution
PromQL example queries:
# Instant vector: current CPU for all hosts in prod
cpu_usage_percent{env="prod"}
# Range vector: rate of HTTP requests per second over last 5 min
rate(http_requests_total{job="api"}[5m])
# Aggregation: avg CPU by region
avg by (region) (cpu_usage_percent{env="prod"})
# Alerting expression: 95th percentile latency > 500ms
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 0.5
Query execution flow:
PromQL query string
β
βΌ Parse
AST (abstract syntax tree)
β
βΌ Plan
Query plan (which series, what time range, what aggregations)
β
βΌ Execute
TSDB: fetch matching series from index, load chunks for time range
β
βΌ Aggregate
Apply sum/avg/rate functions in memory
β
βΌ Return
JSON result to visualization layer
Visualization Layer (Grafana-style)
Dashboard structure:
Dashboard: "Production Infrastructure"
βββ Row: "Compute"
β βββ Panel: CPU Usage (line chart) β query: avg(cpu_usage_percent) by (host)
β βββ Panel: Memory Used % (gauge) β query: mem_used / mem_total * 100
β βββ Panel: CPU Heatmap β query: cpu_usage_percent grouped by host
βββ Row: "Network"
β βββ Panel: Network In/Out (bytes/sec)
β βββ Panel: Packet Loss %
βββ Row: "Application"
βββ Panel: Request Rate (req/sec)
βββ Panel: Error Rate %
βββ Panel: p99 Latency (ms)
Pre-computed dashboard caching:
Dashboard load request
β
βββ Check Redis cache (TTL: 30s) β Cache hit: return immediately
β
βββ Cache miss:
β
βββ Run PromQL queries against TSDB
βββ Store result in Redis (TTL: 30s)
βββ Return to user
Design Summary
Full Observability Stack
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Complete Architecture β
β β
β Data Sources Collection Ingestion β
β ββββββββββββ βββββββββββ βββββββββββββββββββββ β
β βApp ServerβββpushβββΆ β Agent ββββββββΆβ Kafka β β
β βDatabase β β(Datadog β β "metrics-raw" β β
β βContainer β β style) β β partitioned by β β
β ββββββββββββ βββββββββββ β metric_name hash β β
β ββββββββββ¬βββββββββββ β
β Pull targets Pull model β β
β ββββββββββββ βββββββββββ βΌ β
β β/metrics ββββscrapeβββPrometheusβ ββββββββββββββββββββ β
β βendpoint β βScraper ββββββββΆβ TSDB Writer β β
β ββββββββββββ βββββββββββ β (batch writes) β β
β ββββββββββ¬ββββββββββ β
β βΌ β
β βββββββββββββββββββββββββββββ β
β β Time-Series Database β β
β β (InfluxDB / OpenTSDB) β β
β β β β
β β Hot: 7d (10s raw) β β
β β Warm: 30d (1m agg) β β
β β Cold: 1yr (1h/1d agg) β β
β βββββββββββββ¬ββββββββββββββββ β
β β β
β βββββββββββββββββββββββββΌβββββββββββββββ β
β βΌ βΌ βΌ β
β ββββββββββββββββ ββββββββββββββββ ββββββββββββ
β β Query API β βAlert Manager β βGrafana ββ
β β (PromQL) β β(Rule engine) β βDashboardββ
β ββββββββββββββββ ββββββββ¬ββββββββ ββββββββββββ
β βΌ β
β βββββββββββββββββββ β
β β Notifications β β
β β Email/PD/Slack β β
β βββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Key Decisions Summary
| Decision | Choice | Reasoning |
|---|---|---|
| Collection model | Hybrid (push + Kafka buffer) | Works behind firewall, absorbs spikes |
| Storage | Time-series DB (InfluxDB/OpenTSDB) | High write throughput, time-range optimized |
| Compression | Gorilla encoding | 12x compression, industry proven |
| Downsampling | 10s β 1m β 1h β 1d | Balance storage cost vs query granularity |
| Storage tiering | Hot/warm/cold (SSD β S3) | Cost-efficient 1-year retention |
| Alerting | Rule engine + deduplication | Avoid alert storms, PagerDuty escalation |
| Query | PromQL-style | Label-based, aggregation-first |
Interview Questions & Answers
Q: Why use a time-series database instead of MySQL or Postgres for metrics?
A: TSDBs are optimized for time-stamped data: columnar storage + Gorilla/delta encoding achieves 12x compression vs row-based storage. They support built-in downsampling, TTL, and time-range scans in O(1) via chunk index. MySQL would require manual partitioning and canβt sustain 10K+ writes/sec with the same memory footprint.
Q: Explain the difference between pull model (Prometheus) and push model (Datadog).
A: Pull β the collector scrapes targets on a schedule. Easy to detect down hosts (no scrape = problem), but needs network access to all targets. Push β agents on each host push metrics to a collector. Works behind firewalls and NAT, but silent failures are harder to detect. Hybrid (push to Kafka, then pull from Kafka into TSDB) gets the best of both.
Q: What is downsampling and why is it critical for a 1-year retention system?
A: Downsampling replaces high-resolution raw data with aggregated summaries (avg/min/max) over longer intervals. 10s raw data for 1 year would require ~63 TB. After downsampling to 1-min for 30 days, 1-hour for 6 months, and 1-day for the rest, storage drops to 5β7 TB. We accept lower resolution for older data because minute-by-minute variation from 6 months ago is rarely needed.
Q: How does the alerting system avoid alert storms (thousands of alerts firing at once)?
A: Three mechanisms: (1) The for clause requires a threshold to be breached for N minutes before firing (prevents flapping). (2) Deduplication β Alert Manager tracks last_sent in Redis and suppresses re-notification until cooldown expires. (3) Alert grouping β multiple related alerts (all from same host) are bundled into a single notification.
Q: How does Gorilla compression work?
A: Two tricks. For timestamps: delta-of-delta encoding β store the difference between consecutive deltas. If scrape interval is constant (every 60s), deltas are all 60 and the delta-of-delta is 0, requiring only 1β2 bits each. For float values: XOR consecutive readings β adjacent metric values (e.g., CPU) change little, so XOR produces mostly-zero bit patterns that compress aggressively. Together these achieve ~12x compression over raw float64 storage.
Key Takeaways
- Use a hybrid collection model β push to Kafka, then write to TSDB. Kafka buffers spikes and enables replay.
- Time-series DB is mandatory at this scale β relational DBs cannot handle the write throughput or time-range query patterns.
- Gorilla compression (XOR for values, delta-of-delta for timestamps) achieves ~12x compression. Essential for cost-efficient storage.
- Downsampling is non-negotiable for 1-year retention β raw 10s data would cost 63 TB; downsampled data costs ~5 TB.
- Hot/warm/cold storage tiering matches query latency needs to storage cost: SSD for recent data, S3 for historical.
- Alert deduplication +
forclause prevents alert storms. Alert Manager groups related alerts before notifying on-call. - Pre-compute dashboards using Redis cache (TTL 30s) to avoid hammering the TSDB on every page load.
Related Resources
- distributed-system-components - Kafka, Redis, storage tiers
- key-patterns - Write-heavy systems, time-series patterns
- ch06-ad-click-aggregation - Related: streaming aggregation, Kafka, Flink
- ch04-rate-limiter - Rate limiting ingestion API
Last Updated: 2026-04-13
Status: Interview ready β Hard question, appears at Datadog, Meta, Google SRE