Chapter 5: Design a Metrics Monitoring and Alerting System

volume2 metrics monitoring alerting time-series

Status: 🟩 Interview ready
Difficulty: Hard
Time to complete: 50 min read + practice


Overview

A metrics monitoring and alerting system collects, stores, queries, and visualizes operational data from infrastructure and applications, then fires alerts when anomalies occur.

Real-world examples: Datadog, Prometheus + Grafana, AWS CloudWatch, New Relic

Why this matters:

  • Core infrastructure question (appears at FAANG, Stripe, Datadog)
  • Touches time-series databases, streaming ingestion, and distributed systems
  • Tests knowledge of write-heavy workloads, data compression, and storage tiering

Problem Statement

Design a metrics monitoring system that:

  • Collects infrastructure and application metrics at 100M DAU scale
  • Stores metrics with 1-year retention
  • Supports PromQL-style queries (groupby, avg, sum, rate)
  • Fires alerts via Email, PagerDuty, Slack when thresholds are exceeded
  • Renders Grafana-like dashboards for visualization

Step 1: Requirements & Scope (5 min)

Functional Requirements

Clarifying questions:

  • What types of metrics? β†’ CPU, memory, disk, custom app metrics, business metrics
  • What scale? β†’ 1000 servers Γ— 10 metrics each = 10,000 writes/sec
  • Query style? β†’ PromQL-style: avg/sum/rate over time ranges, groupby labels
  • Alert channels? β†’ Email, PagerDuty, Slack
  • Retention? β†’ 1 year

Scope:

  • Ingest metrics at high write throughput
  • Store with downsampling over time
  • Query with label-based filtering and aggregations
  • Alert when metric value crosses threshold
  • Visualize on dashboards

Non-Functional Requirements

  • High write throughput: ~10K writes/sec sustained
  • High availability: Monitoring must stay up even when services are down
  • Durability: No metric loss; 1-year retention
  • Query performance: Dashboard queries return in < 1 second
  • Scalability: Scale from 10K to 1M writes/sec horizontally
  • Low latency alerts: Alert fires within 1–2 minutes of threshold breach

Scale Estimates

Servers:         1,000
Metrics/server:  10
Write rate:      10,000 metrics/sec
Metric size:     ~200 bytes (name + labels + timestamp + value)
Raw data/day:    10,000 Γ— 200 B Γ— 86,400 sec = ~173 GB/day
Raw data/year:   ~63 TB/year (before compression/downsampling)
After 12x compression + downsampling: ~5–7 TB/year

Step 2: High-Level Design (10 min)

Core Components

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                     Metrics Monitoring System                    β”‚
β”‚                                                                  β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚   Metrics    β”‚    β”‚  Collection   β”‚    β”‚   Ingestion      β”‚  β”‚
β”‚  β”‚   Sources    │───▢│    Layer      │───▢│   Pipeline       β”‚  β”‚
β”‚  β”‚              β”‚    β”‚  (Pull/Push)  β”‚    β”‚  (Kafka buffer)  β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β”‚                                                    β”‚            β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚              Time-Series Database (TSDB)                  β”‚  β”‚
β”‚  β”‚           InfluxDB / Prometheus TSDB / OpenTSDB           β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β”‚                               β”‚                                  β”‚
β”‚          β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”            β”‚
β”‚          β–Ό                    β–Ό                    β–Ό            β”‚
β”‚   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚   β”‚  Query API  β”‚    β”‚ Alert Managerβ”‚    β”‚   Visualization   β”‚  β”‚
β”‚   β”‚  (PromQL)   β”‚    β”‚ (Rules eval) β”‚    β”‚   (Grafana-like)  β”‚  β”‚
β”‚   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β”‚                             β”‚                                    β”‚
β”‚                      β”Œβ”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”                            β”‚
β”‚                      β”‚Notification β”‚                            β”‚
β”‚                      β”‚  Channels   β”‚                            β”‚
β”‚                      β”‚Email/PD/Slackβ”‚                           β”‚
β”‚                      β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                            β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Data Collection: Pull vs Push

Pull model (Prometheus):

Prometheus Server ──── scrapes ────▢ /metrics endpoint on each host
                                     CPU: 0.72
                                     mem_used_bytes: 2147483648
                                     disk_free_bytes: 107374182400

Push model (Datadog, StatsD):

Host Agent ────── pushes ──────▢ Collector / Aggregator ──▢ TSDB
              metric + labels
              + timestamp + value

Metrics Data Model

Every metric is a tuple:

(metric_name, {labels}, timestamp, value)

Example:
  metric_name:  "cpu_usage_percent"
  labels:       {host="server-01", region="us-east-1", env="prod"}
  timestamp:    1712995200   (Unix epoch, seconds)
  value:        72.4

Wire format (line protocol - InfluxDB style):
  cpu_usage_percent,host=server-01,region=us-east-1,env=prod value=72.4 1712995200

API Design

Write API:

POST /v1/metrics
Content-Type: application/json

{
  "metric_name": "cpu_usage_percent",
  "labels": {"host": "server-01", "region": "us-east-1"},
  "timestamp": 1712995200,
  "value": 72.4
}

Response: 204 No Content

Query API:

GET /v1/query_range
  ?query=avg(cpu_usage_percent{region="us-east-1"}) by (host)
  &start=1712908800
  &end=1712995200
  &step=60

Response:
{
  "status": "success",
  "data": {
    "resultType": "matrix",
    "result": [
      {
        "metric": {"host": "server-01"},
        "values": [[1712908800, "68.2"], [1712908860, "71.1"], ...]
      }
    ]
  }
}

Step 3: Deep Dive (25 min)

Pull vs Push Model β€” Detailed Comparison

Pull model (Prometheus-style):

Prometheus server
    β”‚
    β”œβ”€β”€ every 15s β†’ GET http://server-01:9090/metrics
    β”œβ”€β”€ every 15s β†’ GET http://server-02:9090/metrics
    └── every 15s β†’ GET http://server-03:9090/metrics
AspectPullPush
Who initiatesCollector scrapes targetAgent pushes to collector
Health checkEasy (no scrape = target down)Hard (silent failure)
FirewallCollector needs access to targetsWorks behind NAT/firewall
ScalabilityCollector becomes bottleneckScales with agents
Data loss on restartRescrapesCan lose buffered data
ExamplesPrometheus, NagiosDatadog, StatsD, CloudWatch

Recommendation: Use hybrid β€” agents push to Kafka, Prometheus-style query layer reads from TSDB.

Hybrid ingestion pipeline:

Host Agent  ──push──▢  Kafka Topic     ──consume──▢  TSDB Writer  ──▢  TSDB
             metric      "metrics-raw"    (Flink or                (InfluxDB/
             events       partitioned       consumer                OpenTSDB)
                         by metric_name    group)

Why Kafka as buffer?:

  • Decouples producers (agents) from consumers (TSDB writers)
  • Absorbs traffic spikes (TSDB writes can be batched)
  • Replay capability if TSDB is temporarily unavailable
  • Multiple consumers: TSDB writer + alerting consumer + archiving

Time-Series Database Design

Why not a relational database?

ConcernRelational DBTime-Series DB
Write throughput~10K rows/sec per instanceMillions of points/sec
Time-range queriesFull table scan or index scanOptimized for range reads
CompressionRow-based, poor for floatsColumn-based + delta encoding
Retention/TTLManual partitioningBuilt-in downsampling & TTL
SchemaFixed schemaDynamic label key-value pairs

TSDB storage layout (columnar):

Chunk for metric "cpu_usage_percent" / host="server-01":

Timestamps (delta-encoded):
  [1712908800, +60, +60, +60, +60, ...]   β†’ store base + deltas (small ints)

Values (XOR / Gorilla encoding):
  [72.4, 72.6, 71.9, 73.1, ...]          β†’ XOR consecutive floats β†’ many zeros β†’ compress

Block structure:
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
  β”‚  Block header (time range, metric names) β”‚
  β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
  β”‚  Index: label β†’ chunk offset             β”‚
  β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
  β”‚  Chunk 1: timestamps + values (Gorilla)  β”‚
  β”‚  Chunk 2: timestamps + values (Gorilla)  β”‚
  β”‚  ...                                     β”‚
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Gorilla compression (Facebook, 2015):

Step 1 β€” Delta-of-delta encoding for timestamps:
  Raw:     [100, 160, 220, 280]
  Delta:   [60, 60, 60]
  DOD:     [0, 0, 0]              ← All zeros! 1-2 bits to store each

Step 2 β€” XOR encoding for float values:
  72.4 β†’ 0 10000000101 0111001100110011...  (IEEE 754 double)
  72.6 β†’ 0 10000000101 0111001100110100...
  XOR  β†’ 0 00000000000 0000000000000111...  (mostly zeros β†’ compresses well)

Result: ~12x compression ratio over raw float64 storage

Downsampling Strategy

Raw metrics at 10-second granularity are expensive to store for 1 year. Use downsampling:

Storage tier:    Raw (10s)     β†’  1-min avg  β†’  1-hour avg  β†’  1-day avg
Retention:       7 days            30 days        6 months      1 year
Size factor:     1x                1/6x           1/360x        1/8640x
Query use:       Recent debug      Recent trend   Weekly trend  Long-term

Aggregation functions stored per interval:
  - avg   (most common for dashboards)
  - min   (detect spikes)
  - max   (detect spikes)
  - sum   (counters, e.g., request count)
  - count (number of raw samples aggregated)
  - p95   (percentile β€” stored as approximation via T-Digest or HDR Histogram)

Downsampling pipeline:

TSDB Raw (10s data)
    β”‚
    β”œβ”€β”€ Background job every 1 min  β†’  Write 1-min aggregates to TSDB
    β”œβ”€β”€ Background job every 1 hour β†’  Write 1-hour aggregates to TSDB
    └── Background job every 1 day  β†’  Write 1-day aggregates to TSDB

Delete raw data after 7 days, 1-min data after 30 days, etc.

Hot / Warm / Cold Storage Tiering

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                      Storage Tiering                             β”‚
β”‚                                                                  β”‚
β”‚   HOT (0–7 days)         WARM (7–30 days)     COLD (30d–1 year) β”‚
β”‚   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”       β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚   β”‚  TSDB local  β”‚  β†’    β”‚  Replicated  β”‚  β†’  β”‚  Object      β”‚  β”‚
β”‚   β”‚  NVMe SSD    β”‚       β”‚  TSDB / SSD  β”‚     β”‚  Storage     β”‚  β”‚
β”‚   β”‚  (fastest)   β”‚       β”‚  (fast)      β”‚     β”‚  (S3/GCS)    β”‚  β”‚
β”‚   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜       β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β”‚   10s granularity        1-min granularity     1-hour/1-day      β”‚
β”‚   Full resolution        Medium resolution     Low resolution    β”‚
β”‚   Sub-second query       < 1s query            2–5s query        β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Query routing:

def route_query(start_time, end_time):
    age_days = (now() - start_time) / 86400
 
    if age_days <= 7:
        return query_hot_storage(start_time, end_time)
    elif age_days <= 30:
        return query_warm_storage(start_time, end_time)
    else:
        return query_cold_storage(start_time, end_time)

Alerting System Design

Alert rule definition:

alert: HighCPUUsage
expr: avg(cpu_usage_percent{env="prod"}) by (host) > 85
for: 5m           # Must be above threshold for 5 minutes (avoid flapping)
labels:
  severity: warning
annotations:
  summary: "High CPU on {{ $labels.host }}"
  description: "CPU is {{ $value }}% for 5 minutes"
notify:
  - channel: pagerduty
  - channel: slack
    webhook: https://hooks.slack.com/services/...

Alerting pipeline:

TSDB / Kafka
    β”‚
    β–Ό
Alert Rule Engine (runs every 1 min)
    β”‚   Evaluates all rules against current metric values
    β”‚   Compares result against threshold
    β”‚
    β–Ό
Alert State Machine
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”   threshold crossed   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”   "for" duration met   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚INACTIVE β”‚ ─────────────────────▢│ PENDING β”‚ ──────────────────────▢│  FIRING β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                       β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                           β”‚                                    β”‚
                                           β”‚ threshold recovered                β”‚ threshold recovered
                                           β–Ό                                    β–Ό
                                      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”                         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                                      β”‚INACTIVE β”‚                         β”‚INACTIVE β”‚
                                      β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Alert Manager
    β”‚   Receives FIRING alerts
    β”‚   Deduplicates (same alert β†’ send once per interval)
    β”‚   Groups related alerts (host=server-01 fires 3 alerts β†’ 1 notification)
    β”‚   Silences (maintenance windows)
    β”‚   Routes to correct channel based on severity
    β–Ό
Notification Channels
    β”œβ”€β”€ Email (SMTP)
    β”œβ”€β”€ PagerDuty (REST API + escalation policies)
    └── Slack (webhook)

Alert deduplication:

def should_send_alert(alert_id, alert_state):
    last_sent = redis.get(f"alert_last_sent:{alert_id}")
 
    if alert_state == "FIRING" and last_sent is None:
        redis.setex(f"alert_last_sent:{alert_id}", 3600, now())  # resend every 1 hour
        return True
    elif alert_state == "RESOLVED":
        redis.delete(f"alert_last_sent:{alert_id}")
        return True  # Always send resolved
    return False

Query Layer: PromQL-Style Execution

PromQL example queries:

# Instant vector: current CPU for all hosts in prod
cpu_usage_percent{env="prod"}

# Range vector: rate of HTTP requests per second over last 5 min
rate(http_requests_total{job="api"}[5m])

# Aggregation: avg CPU by region
avg by (region) (cpu_usage_percent{env="prod"})

# Alerting expression: 95th percentile latency > 500ms
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 0.5

Query execution flow:

PromQL query string
    β”‚
    β–Ό Parse
AST (abstract syntax tree)
    β”‚
    β–Ό Plan
Query plan (which series, what time range, what aggregations)
    β”‚
    β–Ό Execute
TSDB: fetch matching series from index, load chunks for time range
    β”‚
    β–Ό Aggregate
Apply sum/avg/rate functions in memory
    β”‚
    β–Ό Return
JSON result to visualization layer

Visualization Layer (Grafana-style)

Dashboard structure:

Dashboard: "Production Infrastructure"
β”œβ”€β”€ Row: "Compute"
β”‚   β”œβ”€β”€ Panel: CPU Usage (line chart) β€” query: avg(cpu_usage_percent) by (host)
β”‚   β”œβ”€β”€ Panel: Memory Used % (gauge) β€” query: mem_used / mem_total * 100
β”‚   └── Panel: CPU Heatmap β€” query: cpu_usage_percent grouped by host
β”œβ”€β”€ Row: "Network"
β”‚   β”œβ”€β”€ Panel: Network In/Out (bytes/sec)
β”‚   └── Panel: Packet Loss %
└── Row: "Application"
    β”œβ”€β”€ Panel: Request Rate (req/sec)
    β”œβ”€β”€ Panel: Error Rate %
    └── Panel: p99 Latency (ms)

Pre-computed dashboard caching:

Dashboard load request
    β”‚
    β”œβ”€β”€ Check Redis cache (TTL: 30s)  β†’  Cache hit: return immediately
    β”‚
    └── Cache miss:
         β”‚
         β”œβ”€β”€ Run PromQL queries against TSDB
         β”œβ”€β”€ Store result in Redis (TTL: 30s)
         └── Return to user

Design Summary

Full Observability Stack

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                     Complete Architecture                           β”‚
β”‚                                                                     β”‚
β”‚  Data Sources           Collection        Ingestion                 β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”           β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”       β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚
β”‚  β”‚App Server│──push──▢  β”‚  Agent  │──────▢│      Kafka        β”‚    β”‚
β”‚  β”‚Database  β”‚           β”‚(Datadog β”‚       β”‚  "metrics-raw"    β”‚    β”‚
β”‚  β”‚Container β”‚           β”‚ style)  β”‚       β”‚  partitioned by   β”‚    β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜           β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜       β”‚  metric_name hash β”‚    β”‚
β”‚                                           β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚
β”‚  Pull targets           Pull model                 β”‚               β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”           β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”                β–Ό               β”‚
β”‚  β”‚/metrics  │◀──scrape──│Prometheusβ”‚      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”‚
β”‚  β”‚endpoint  β”‚           β”‚Scraper  │──────▢│   TSDB Writer    β”‚     β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜           β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜       β”‚  (batch writes)  β”‚     β”‚
β”‚                                           β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β”‚
β”‚                                                    β–Ό               β”‚
β”‚                                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚                                    β”‚  Time-Series Database     β”‚   β”‚
β”‚                                    β”‚  (InfluxDB / OpenTSDB)    β”‚   β”‚
β”‚                                    β”‚                           β”‚   β”‚
β”‚                                    β”‚  Hot:  7d   (10s raw)     β”‚   β”‚
β”‚                                    β”‚  Warm: 30d  (1m agg)      β”‚   β”‚
β”‚                                    β”‚  Cold: 1yr  (1h/1d agg)   β”‚   β”‚
β”‚                                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β”‚                                                β”‚                   β”‚
β”‚                        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚                        β–Ό                       β–Ό              β–Ό   β”‚
β”‚               β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”       β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”β”‚
β”‚               β”‚  Query API   β”‚       β”‚Alert Manager β”‚  β”‚Grafana  β”‚β”‚
β”‚               β”‚  (PromQL)    β”‚       β”‚(Rule engine) β”‚  β”‚Dashboardβ”‚β”‚
β”‚               β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜       β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜β”‚
β”‚                                             β–Ό                      β”‚
β”‚                                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”             β”‚
β”‚                                    β”‚ Notifications   β”‚             β”‚
β”‚                                    β”‚ Email/PD/Slack  β”‚             β”‚
β”‚                                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜             β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Key Decisions Summary

DecisionChoiceReasoning
Collection modelHybrid (push + Kafka buffer)Works behind firewall, absorbs spikes
StorageTime-series DB (InfluxDB/OpenTSDB)High write throughput, time-range optimized
CompressionGorilla encoding12x compression, industry proven
Downsampling10s β†’ 1m β†’ 1h β†’ 1dBalance storage cost vs query granularity
Storage tieringHot/warm/cold (SSD β†’ S3)Cost-efficient 1-year retention
AlertingRule engine + deduplicationAvoid alert storms, PagerDuty escalation
QueryPromQL-styleLabel-based, aggregation-first

Interview Questions & Answers

Q: Why use a time-series database instead of MySQL or Postgres for metrics?
A: TSDBs are optimized for time-stamped data: columnar storage + Gorilla/delta encoding achieves 12x compression vs row-based storage. They support built-in downsampling, TTL, and time-range scans in O(1) via chunk index. MySQL would require manual partitioning and can’t sustain 10K+ writes/sec with the same memory footprint.

Q: Explain the difference between pull model (Prometheus) and push model (Datadog).
A: Pull β€” the collector scrapes targets on a schedule. Easy to detect down hosts (no scrape = problem), but needs network access to all targets. Push β€” agents on each host push metrics to a collector. Works behind firewalls and NAT, but silent failures are harder to detect. Hybrid (push to Kafka, then pull from Kafka into TSDB) gets the best of both.

Q: What is downsampling and why is it critical for a 1-year retention system?
A: Downsampling replaces high-resolution raw data with aggregated summaries (avg/min/max) over longer intervals. 10s raw data for 1 year would require ~63 TB. After downsampling to 1-min for 30 days, 1-hour for 6 months, and 1-day for the rest, storage drops to 5–7 TB. We accept lower resolution for older data because minute-by-minute variation from 6 months ago is rarely needed.

Q: How does the alerting system avoid alert storms (thousands of alerts firing at once)?
A: Three mechanisms: (1) The for clause requires a threshold to be breached for N minutes before firing (prevents flapping). (2) Deduplication β€” Alert Manager tracks last_sent in Redis and suppresses re-notification until cooldown expires. (3) Alert grouping β€” multiple related alerts (all from same host) are bundled into a single notification.

Q: How does Gorilla compression work?
A: Two tricks. For timestamps: delta-of-delta encoding β€” store the difference between consecutive deltas. If scrape interval is constant (every 60s), deltas are all 60 and the delta-of-delta is 0, requiring only 1–2 bits each. For float values: XOR consecutive readings β€” adjacent metric values (e.g., CPU) change little, so XOR produces mostly-zero bit patterns that compress aggressively. Together these achieve ~12x compression over raw float64 storage.


Key Takeaways

  1. Use a hybrid collection model β€” push to Kafka, then write to TSDB. Kafka buffers spikes and enables replay.
  2. Time-series DB is mandatory at this scale β€” relational DBs cannot handle the write throughput or time-range query patterns.
  3. Gorilla compression (XOR for values, delta-of-delta for timestamps) achieves ~12x compression. Essential for cost-efficient storage.
  4. Downsampling is non-negotiable for 1-year retention β€” raw 10s data would cost 63 TB; downsampled data costs ~5 TB.
  5. Hot/warm/cold storage tiering matches query latency needs to storage cost: SSD for recent data, S3 for historical.
  6. Alert deduplication + for clause prevents alert storms. Alert Manager groups related alerts before notifying on-call.
  7. Pre-compute dashboards using Redis cache (TTL 30s) to avoid hammering the TSDB on every page load.


Last Updated: 2026-04-13
Status: Interview ready β€” Hard question, appears at Datadog, Meta, Google SRE