Chapter 5: Design a Metrics Monitoring and Alerting System

volume2 metrics monitoring alerting time-series

Status: 🟩 Interview ready
Difficulty: Hard
Time to complete: 50 min read + practice

Overview

A metrics monitoring and alerting system collects, stores, queries, and visualizes operational data from infrastructure and applications, then fires alerts when anomalies occur.

Real-world examples: Datadog, Prometheus + Grafana, AWS CloudWatch, New Relic

Why this matters:

Core infrastructure question (appears at FAANG, Stripe, Datadog)
Touches time-series databases, streaming ingestion, and distributed systems
Tests knowledge of write-heavy workloads, data compression, and storage tiering

Problem Statement

Design a metrics monitoring system that:

Collects infrastructure and application metrics at 100M DAU scale
Stores metrics with 1-year retention
Supports PromQL-style queries (groupby, avg, sum, rate)
Fires alerts via Email, PagerDuty, Slack when thresholds are exceeded
Renders Grafana-like dashboards for visualization

Step 1: Requirements & Scope (5 min)

Functional Requirements

Clarifying questions:

What types of metrics? → CPU, memory, disk, custom app metrics, business metrics
What scale? → 1000 servers × 10 metrics each = 10,000 writes/sec
Query style? → PromQL-style: avg/sum/rate over time ranges, groupby labels
Alert channels? → Email, PagerDuty, Slack
Retention? → 1 year

Scope:

Ingest metrics at high write throughput
Store with downsampling over time
Query with label-based filtering and aggregations
Alert when metric value crosses threshold
Visualize on dashboards

Non-Functional Requirements

High write throughput: ~10K writes/sec sustained
High availability: Monitoring must stay up even when services are down
Durability: No metric loss; 1-year retention
Query performance: Dashboard queries return in < 1 second
Scalability: Scale from 10K to 1M writes/sec horizontally
Low latency alerts: Alert fires within 1–2 minutes of threshold breach

Scale Estimates

Servers:         1,000
Metrics/server:  10
Write rate:      10,000 metrics/sec
Metric size:     ~200 bytes (name + labels + timestamp + value)
Raw data/day:    10,000 × 200 B × 86,400 sec = ~173 GB/day
Raw data/year:   ~63 TB/year (before compression/downsampling)
After 12x compression + downsampling: ~5–7 TB/year

Step 2: High-Level Design (10 min)

Core Components

┌──────────────────────────────────────────────────────────────────┐
│                     Metrics Monitoring System                    │
│                                                                  │
│  ┌──────────────┐    ┌───────────────┐    ┌──────────────────┐  │
│  │   Metrics    │    │  Collection   │    │   Ingestion      │  │
│  │   Sources    │───▶│    Layer      │───▶│   Pipeline       │  │
│  │              │    │  (Pull/Push)  │    │  (Kafka buffer)  │  │
│  └──────────────┘    └───────────────┘    └────────┬─────────┘  │
│                                                    │            │
│  ┌─────────────────────────────────────────────────▼─────────┐  │
│  │              Time-Series Database (TSDB)                  │  │
│  │           InfluxDB / Prometheus TSDB / OpenTSDB           │  │
│  └────────────────────────────┬──────────────────────────────┘  │
│                               │                                  │
│          ┌────────────────────┼────────────────────┐            │
│          ▼                    ▼                    ▼            │
│   ┌─────────────┐    ┌──────────────┐    ┌───────────────────┐  │
│   │  Query API  │    │ Alert Manager│    │   Visualization   │  │
│   │  (PromQL)   │    │ (Rules eval) │    │   (Grafana-like)  │  │
│   └─────────────┘    └──────────────┘    └───────────────────┘  │
│                             │                                    │
│                      ┌──────▼──────┐                            │
│                      │Notification │                            │
│                      │  Channels   │                            │
│                      │Email/PD/Slack│                           │
│                      └─────────────┘                            │
└──────────────────────────────────────────────────────────────────┘

Data Collection: Pull vs Push

Pull model (Prometheus):

Prometheus Server ──── scrapes ────▶ /metrics endpoint on each host
                                     CPU: 0.72
                                     mem_used_bytes: 2147483648
                                     disk_free_bytes: 107374182400

Push model (Datadog, StatsD):

Host Agent ────── pushes ──────▶ Collector / Aggregator ──▶ TSDB
              metric + labels
              + timestamp + value

Metrics Data Model

Every metric is a tuple:

(metric_name, {labels}, timestamp, value)

Example:
  metric_name:  "cpu_usage_percent"
  labels:       {host="server-01", region="us-east-1", env="prod"}
  timestamp:    1712995200   (Unix epoch, seconds)
  value:        72.4

Wire format (line protocol - InfluxDB style):
  cpu_usage_percent,host=server-01,region=us-east-1,env=prod value=72.4 1712995200

API Design

Write API:

POST /v1/metrics
Content-Type: application/json

{
  "metric_name": "cpu_usage_percent",
  "labels": {"host": "server-01", "region": "us-east-1"},
  "timestamp": 1712995200,
  "value": 72.4
}

Response: 204 No Content

Query API:

GET /v1/query_range
  ?query=avg(cpu_usage_percent{region="us-east-1"}) by (host)
  &start=1712908800
  &end=1712995200
  &step=60

Response:
{
  "status": "success",
  "data": {
    "resultType": "matrix",
    "result": [
      {
        "metric": {"host": "server-01"},
        "values": [[1712908800, "68.2"], [1712908860, "71.1"], ...]
      }
    ]
  }
}

Step 3: Deep Dive (25 min)

Pull vs Push Model — Detailed Comparison

Pull model (Prometheus-style):

Prometheus server
    │
    ├── every 15s → GET http://server-01:9090/metrics
    ├── every 15s → GET http://server-02:9090/metrics
    └── every 15s → GET http://server-03:9090/metrics

Aspect	Pull	Push
Who initiates	Collector scrapes target	Agent pushes to collector
Health check	Easy (no scrape = target down)	Hard (silent failure)
Firewall	Collector needs access to targets	Works behind NAT/firewall
Scalability	Collector becomes bottleneck	Scales with agents
Data loss on restart	Rescrapes	Can lose buffered data
Examples	Prometheus, Nagios	Datadog, StatsD, CloudWatch

Recommendation: Use hybrid — agents push to Kafka, Prometheus-style query layer reads from TSDB.

Hybrid ingestion pipeline:

Host Agent  ──push──▶  Kafka Topic     ──consume──▶  TSDB Writer  ──▶  TSDB
             metric      "metrics-raw"    (Flink or                (InfluxDB/
             events       partitioned       consumer                OpenTSDB)
                         by metric_name    group)

Why Kafka as buffer?:

Decouples producers (agents) from consumers (TSDB writers)
Absorbs traffic spikes (TSDB writes can be batched)
Replay capability if TSDB is temporarily unavailable
Multiple consumers: TSDB writer + alerting consumer + archiving

Time-Series Database Design

Why not a relational database?

Concern	Relational DB	Time-Series DB
Write throughput	~10K rows/sec per instance	Millions of points/sec
Time-range queries	Full table scan or index scan	Optimized for range reads
Compression	Row-based, poor for floats	Column-based + delta encoding
Retention/TTL	Manual partitioning	Built-in downsampling & TTL
Schema	Fixed schema	Dynamic label key-value pairs

TSDB storage layout (columnar):

Chunk for metric "cpu_usage_percent" / host="server-01":

Timestamps (delta-encoded):
  [1712908800, +60, +60, +60, +60, ...]   → store base + deltas (small ints)

Values (XOR / Gorilla encoding):
  [72.4, 72.6, 71.9, 73.1, ...]          → XOR consecutive floats → many zeros → compress

Block structure:
  ┌──────────────────────────────────────────┐
  │  Block header (time range, metric names) │
  ├──────────────────────────────────────────┤
  │  Index: label → chunk offset             │
  ├──────────────────────────────────────────┤
  │  Chunk 1: timestamps + values (Gorilla)  │
  │  Chunk 2: timestamps + values (Gorilla)  │
  │  ...                                     │
  └──────────────────────────────────────────┘

Gorilla compression (Facebook, 2015):

Step 1 — Delta-of-delta encoding for timestamps:
  Raw:     [100, 160, 220, 280]
  Delta:   [60, 60, 60]
  DOD:     [0, 0, 0]              ← All zeros! 1-2 bits to store each

Step 2 — XOR encoding for float values:
  72.4 → 0 10000000101 0111001100110011...  (IEEE 754 double)
  72.6 → 0 10000000101 0111001100110100...
  XOR  → 0 00000000000 0000000000000111...  (mostly zeros → compresses well)

Result: ~12x compression ratio over raw float64 storage

Downsampling Strategy

Raw metrics at 10-second granularity are expensive to store for 1 year. Use downsampling:

Storage tier:    Raw (10s)     →  1-min avg  →  1-hour avg  →  1-day avg
Retention:       7 days            30 days        6 months      1 year
Size factor:     1x                1/6x           1/360x        1/8640x
Query use:       Recent debug      Recent trend   Weekly trend  Long-term

Aggregation functions stored per interval:
  - avg   (most common for dashboards)
  - min   (detect spikes)
  - max   (detect spikes)
  - sum   (counters, e.g., request count)
  - count (number of raw samples aggregated)
  - p95   (percentile — stored as approximation via T-Digest or HDR Histogram)

Downsampling pipeline:

TSDB Raw (10s data)
    │
    ├── Background job every 1 min  →  Write 1-min aggregates to TSDB
    ├── Background job every 1 hour →  Write 1-hour aggregates to TSDB
    └── Background job every 1 day  →  Write 1-day aggregates to TSDB

Delete raw data after 7 days, 1-min data after 30 days, etc.

Hot / Warm / Cold Storage Tiering

┌──────────────────────────────────────────────────────────────────┐
│                      Storage Tiering                             │
│                                                                  │
│   HOT (0–7 days)         WARM (7–30 days)     COLD (30d–1 year) │
│   ┌──────────────┐       ┌──────────────┐     ┌──────────────┐  │
│   │  TSDB local  │  →    │  Replicated  │  →  │  Object      │  │
│   │  NVMe SSD    │       │  TSDB / SSD  │     │  Storage     │  │
│   │  (fastest)   │       │  (fast)      │     │  (S3/GCS)    │  │
│   └──────────────┘       └──────────────┘     └──────────────┘  │
│   10s granularity        1-min granularity     1-hour/1-day      │
│   Full resolution        Medium resolution     Low resolution    │
│   Sub-second query       < 1s query            2–5s query        │
└──────────────────────────────────────────────────────────────────┘

Query routing:

def route_query(start_time, end_time):
    age_days = (now() - start_time) / 86400
 
    if age_days <= 7:
        return query_hot_storage(start_time, end_time)
    elif age_days <= 30:
        return query_warm_storage(start_time, end_time)
    else:
        return query_cold_storage(start_time, end_time)

Alerting System Design

Alert rule definition:

alert: HighCPUUsage
expr: avg(cpu_usage_percent{env="prod"}) by (host) > 85
for: 5m           # Must be above threshold for 5 minutes (avoid flapping)
labels:
  severity: warning
annotations:
  summary: "High CPU on {{ $labels.host }}"
  description: "CPU is {{ $value }}% for 5 minutes"
notify:
  - channel: pagerduty
  - channel: slack
    webhook: https://hooks.slack.com/services/...

Alerting pipeline:

TSDB / Kafka
    │
    ▼
Alert Rule Engine (runs every 1 min)
    │   Evaluates all rules against current metric values
    │   Compares result against threshold
    │
    ▼
Alert State Machine
    ┌─────────┐   threshold crossed   ┌─────────┐   "for" duration met   ┌─────────┐
    │INACTIVE │ ─────────────────────▶│ PENDING │ ──────────────────────▶│  FIRING │
    └─────────┘                       └─────────┘                         └─────────┘
                                           │                                    │
                                           │ threshold recovered                │ threshold recovered
                                           ▼                                    ▼
                                      ┌─────────┐                         ┌─────────┐
                                      │INACTIVE │                         │INACTIVE │
                                      └─────────┘                         └─────────┘

Alert Manager
    │   Receives FIRING alerts
    │   Deduplicates (same alert → send once per interval)
    │   Groups related alerts (host=server-01 fires 3 alerts → 1 notification)
    │   Silences (maintenance windows)
    │   Routes to correct channel based on severity
    ▼
Notification Channels
    ├── Email (SMTP)
    ├── PagerDuty (REST API + escalation policies)
    └── Slack (webhook)

Alert deduplication:

def should_send_alert(alert_id, alert_state):
    last_sent = redis.get(f"alert_last_sent:{alert_id}")
 
    if alert_state == "FIRING" and last_sent is None:
        redis.setex(f"alert_last_sent:{alert_id}", 3600, now())  # resend every 1 hour
        return True
    elif alert_state == "RESOLVED":
        redis.delete(f"alert_last_sent:{alert_id}")
        return True  # Always send resolved
    return False

Query Layer: PromQL-Style Execution

PromQL example queries:

# Instant vector: current CPU for all hosts in prod
cpu_usage_percent{env="prod"}

# Range vector: rate of HTTP requests per second over last 5 min
rate(http_requests_total{job="api"}[5m])

# Aggregation: avg CPU by region
avg by (region) (cpu_usage_percent{env="prod"})

# Alerting expression: 95th percentile latency > 500ms
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 0.5

Query execution flow:

PromQL query string
    │
    ▼ Parse
AST (abstract syntax tree)
    │
    ▼ Plan
Query plan (which series, what time range, what aggregations)
    │
    ▼ Execute
TSDB: fetch matching series from index, load chunks for time range
    │
    ▼ Aggregate
Apply sum/avg/rate functions in memory
    │
    ▼ Return
JSON result to visualization layer

Visualization Layer (Grafana-style)

Dashboard structure:

Dashboard: "Production Infrastructure"
├── Row: "Compute"
│   ├── Panel: CPU Usage (line chart) — query: avg(cpu_usage_percent) by (host)
│   ├── Panel: Memory Used % (gauge) — query: mem_used / mem_total * 100
│   └── Panel: CPU Heatmap — query: cpu_usage_percent grouped by host
├── Row: "Network"
│   ├── Panel: Network In/Out (bytes/sec)
│   └── Panel: Packet Loss %
└── Row: "Application"
    ├── Panel: Request Rate (req/sec)
    ├── Panel: Error Rate %
    └── Panel: p99 Latency (ms)

Pre-computed dashboard caching:

Dashboard load request
    │
    ├── Check Redis cache (TTL: 30s)  →  Cache hit: return immediately
    │
    └── Cache miss:
         │
         ├── Run PromQL queries against TSDB
         ├── Store result in Redis (TTL: 30s)
         └── Return to user

Design Summary

Full Observability Stack

┌─────────────────────────────────────────────────────────────────────┐
│                     Complete Architecture                           │
│                                                                     │
│  Data Sources           Collection        Ingestion                 │
│  ┌──────────┐           ┌─────────┐       ┌───────────────────┐    │
│  │App Server│──push──▶  │  Agent  │──────▶│      Kafka        │    │
│  │Database  │           │(Datadog │       │  "metrics-raw"    │    │
│  │Container │           │ style)  │       │  partitioned by   │    │
│  └──────────┘           └─────────┘       │  metric_name hash │    │
│                                           └────────┬──────────┘    │
│  Pull targets           Pull model                 │               │
│  ┌──────────┐           ┌─────────┐                ▼               │
│  │/metrics  │◀──scrape──│Prometheus│      ┌──────────────────┐     │
│  │endpoint  │           │Scraper  │──────▶│   TSDB Writer    │     │
│  └──────────┘           └─────────┘       │  (batch writes)  │     │
│                                           └────────┬─────────┘     │
│                                                    ▼               │
│                                    ┌───────────────────────────┐   │
│                                    │  Time-Series Database     │   │
│                                    │  (InfluxDB / OpenTSDB)    │   │
│                                    │                           │   │
│                                    │  Hot:  7d   (10s raw)     │   │
│                                    │  Warm: 30d  (1m agg)      │   │
│                                    │  Cold: 1yr  (1h/1d agg)   │   │
│                                    └───────────┬───────────────┘   │
│                                                │                   │
│                        ┌───────────────────────┼──────────────┐   │
│                        ▼                       ▼              ▼   │
│               ┌──────────────┐       ┌──────────────┐  ┌─────────┐│
│               │  Query API   │       │Alert Manager │  │Grafana  ││
│               │  (PromQL)    │       │(Rule engine) │  │Dashboard││
│               └──────────────┘       └──────┬───────┘  └─────────┘│
│                                             ▼                      │
│                                    ┌─────────────────┐             │
│                                    │ Notifications   │             │
│                                    │ Email/PD/Slack  │             │
│                                    └─────────────────┘             │
└─────────────────────────────────────────────────────────────────────┘

Key Decisions Summary

Decision	Choice	Reasoning
Collection model	Hybrid (push + Kafka buffer)	Works behind firewall, absorbs spikes
Storage	Time-series DB (InfluxDB/OpenTSDB)	High write throughput, time-range optimized
Compression	Gorilla encoding	12x compression, industry proven
Downsampling	10s → 1m → 1h → 1d	Balance storage cost vs query granularity
Storage tiering	Hot/warm/cold (SSD → S3)	Cost-efficient 1-year retention
Alerting	Rule engine + deduplication	Avoid alert storms, PagerDuty escalation
Query	PromQL-style	Label-based, aggregation-first

Interview Questions & Answers

Q: Why use a time-series database instead of MySQL or Postgres for metrics?
A: TSDBs are optimized for time-stamped data: columnar storage + Gorilla/delta encoding achieves 12x compression vs row-based storage. They support built-in downsampling, TTL, and time-range scans in O(1) via chunk index. MySQL would require manual partitioning and can’t sustain 10K+ writes/sec with the same memory footprint.

Q: Explain the difference between pull model (Prometheus) and push model (Datadog).
A: Pull — the collector scrapes targets on a schedule. Easy to detect down hosts (no scrape = problem), but needs network access to all targets. Push — agents on each host push metrics to a collector. Works behind firewalls and NAT, but silent failures are harder to detect. Hybrid (push to Kafka, then pull from Kafka into TSDB) gets the best of both.

Q: What is downsampling and why is it critical for a 1-year retention system?
A: Downsampling replaces high-resolution raw data with aggregated summaries (avg/min/max) over longer intervals. 10s raw data for 1 year would require ~63 TB. After downsampling to 1-min for 30 days, 1-hour for 6 months, and 1-day for the rest, storage drops to 5–7 TB. We accept lower resolution for older data because minute-by-minute variation from 6 months ago is rarely needed.

Q: How does the alerting system avoid alert storms (thousands of alerts firing at once)?
A: Three mechanisms: (1) The for clause requires a threshold to be breached for N minutes before firing (prevents flapping). (2) Deduplication — Alert Manager tracks last_sent in Redis and suppresses re-notification until cooldown expires. (3) Alert grouping — multiple related alerts (all from same host) are bundled into a single notification.

Q: How does Gorilla compression work?
A: Two tricks. For timestamps: delta-of-delta encoding — store the difference between consecutive deltas. If scrape interval is constant (every 60s), deltas are all 60 and the delta-of-delta is 0, requiring only 1–2 bits each. For float values: XOR consecutive readings — adjacent metric values (e.g., CPU) change little, so XOR produces mostly-zero bit patterns that compress aggressively. Together these achieve ~12x compression over raw float64 storage.

Key Takeaways

Use a hybrid collection model — push to Kafka, then write to TSDB. Kafka buffers spikes and enables replay.
Time-series DB is mandatory at this scale — relational DBs cannot handle the write throughput or time-range query patterns.
Gorilla compression (XOR for values, delta-of-delta for timestamps) achieves ~12x compression. Essential for cost-efficient storage.
Downsampling is non-negotiable for 1-year retention — raw 10s data would cost 63 TB; downsampled data costs ~5 TB.
Hot/warm/cold storage tiering matches query latency needs to storage cost: SSD for recent data, S3 for historical.
Alert deduplication + for clause prevents alert storms. Alert Manager groups related alerts before notifying on-call.
Pre-compute dashboards using Redis cache (TTL 30s) to avoid hammering the TSDB on every page load.

distributed-system-components - Kafka, Redis, storage tiers
key-patterns - Write-heavy systems, time-series patterns
ch06-ad-click-aggregation - Related: streaming aggregation, Kafka, Flink
ch04-rate-limiter - Rate limiting ingestion API

Last Updated: 2026-04-13
Status: Interview ready — Hard question, appears at Datadog, Meta, Google SRE

Study Notes by Niladri & AI

Explorer

ch05-metrics-monitoring

Chapter 5: Design a Metrics Monitoring and Alerting System

Overview

Problem Statement

Step 1: Requirements & Scope (5 min)

Functional Requirements

Non-Functional Requirements

Scale Estimates

Step 2: High-Level Design (10 min)

Core Components

Data Collection: Pull vs Push

Metrics Data Model

API Design

Step 3: Deep Dive (25 min)

Pull vs Push Model — Detailed Comparison

Time-Series Database Design

Downsampling Strategy

Hot / Warm / Cold Storage Tiering

Alerting System Design

Query Layer: PromQL-Style Execution

Visualization Layer (Grafana-style)

Design Summary

Full Observability Stack

Key Decisions Summary

Interview Questions & Answers

Key Takeaways

Graph View

Table of Contents

Backlinks

Study Notes by Niladri & AI

Explorer

ch05-metrics-monitoring

Chapter 5: Design a Metrics Monitoring and Alerting System

Overview

Problem Statement

Step 1: Requirements & Scope (5 min)

Functional Requirements

Non-Functional Requirements

Scale Estimates

Step 2: High-Level Design (10 min)

Core Components

Data Collection: Pull vs Push

Metrics Data Model

API Design

Step 3: Deep Dive (25 min)

Pull vs Push Model — Detailed Comparison

Time-Series Database Design

Downsampling Strategy

Hot / Warm / Cold Storage Tiering

Alerting System Design

Query Layer: PromQL-Style Execution

Visualization Layer (Grafana-style)

Design Summary

Full Observability Stack

Key Decisions Summary

Interview Questions & Answers

Key Takeaways

Related Resources

Graph View

Table of Contents

Backlinks