Chapter 5 Flashcards - Metrics Monitoring and Alerting System

flashcards volume2 metrics monitoring alerting time-series

What is the metrics data model used in systems like Prometheus and Datadog?
?
Every metric is a 4-tuple: (metric_name, {labels}, timestamp, value). Example: metric_name=“cpu_usage_percent”, labels={host=“server-01”, region=“us-east-1”, env=“prod”}, timestamp=1712995200, value=72.4. Labels are key-value pairs used for filtering and grouping. This is the foundation of all PromQL-style queries.

What is the Pull model for metrics collection? Name pros and cons.
?
The collector (e.g., Prometheus) scrapes a /metrics endpoint on each target on a schedule. Pros: Easy to detect down hosts (no scrape = target is down), collector controls rate, simple debugging. Cons: Collector needs network access to all targets (firewall/NAT problems), collector can become bottleneck at large scale. Used by Prometheus, Nagios.

What is the Push model for metrics collection? Name pros and cons.
?
Each host runs an agent that pushes metrics to a central collector (e.g., Datadog Agent, StatsD). Pros: Works behind NAT/firewall, scales with number of agents, agents can buffer locally if collector is unavailable. Cons: Silent failure is hard to detect (agent stops pushing with no error), potential to overload collector with traffic spikes. Used by Datadog, CloudWatch, StatsD.

What is the hybrid collection model and why is it preferred at scale?
?
Agents push metrics to Kafka, and a consumer service writes from Kafka into the TSDB. Best of both: push model works behind firewalls, Kafka buffers spikes so TSDB isn’t overwhelmed, Kafka replay handles TSDB downtime, multiple consumers can read same data (TSDB writer + alerting consumer + archiving). This is the production-grade choice for systems like Datadog at scale.

Why can’t a relational database (MySQL/Postgres) be used for metrics storage at scale?
?
Four reasons: (1) Write throughput — TSDB handles millions of points/sec vs ~10K rows/sec for RDBMS. (2) Time-range queries — TSDB has optimized chunk index; RDBMS requires full table scans. (3) Compression — TSDB uses columnar Gorilla encoding (12x); RDBMS uses row-based storage (poor float compression). (4) Schema flexibility — TSDB supports dynamic label key-value pairs; RDBMS needs fixed schema or wide sparse tables.

What is Gorilla compression? Explain the two core encoding techniques.
?
Facebook’s 2015 time-series compression algorithm achieving ~12x compression. Two techniques: (1) Delta-of-delta for timestamps — store the difference between consecutive deltas. If scrape interval is constant (60s), all deltas are 60, delta-of-deltas are 0, requiring only 1-2 bits each. (2) XOR for float values — XOR consecutive float readings bit-by-bit. Adjacent metrics (e.g., CPU at 72.4 vs 72.6) share most bits, XOR produces mostly-zero patterns that compress aggressively with variable-length encoding.

What is downsampling and what is the standard retention ladder?
?
Downsampling replaces high-resolution raw data with aggregated summaries over longer intervals, reducing storage dramatically. Standard ladder: Raw 10s data → retained 7 days, 1-min aggregates → retained 30 days, 1-hour aggregates → retained 6 months, 1-day aggregates → retained 1 year. For each interval, store: avg, min, max, sum, count (and optionally p95 via T-Digest). Reduces 63 TB/year raw storage to ~5-7 TB.

What aggregation functions should be stored per downsampling interval, and why?
?
Store avg (for dashboards and trending), min and max (for detecting spikes that avg would hide), sum (for counters like request counts), count (to know how many raw samples were aggregated). Optionally store p95/p99 via T-Digest or HDR Histogram approximate data structure. Storing only avg would lose spike information — a CPU that spiked to 100% for 10s appears as 75% in a 1-min avg, but max=100% catches it.

Describe the hot/warm/cold storage tiering strategy for a 1-year metrics system.
?
Three tiers: Hot (0-7 days) — local NVMe SSD, full 10s resolution, sub-second queries, most expensive. Warm (7-30 days) — replicated TSDB on SSD, 1-min resolution, < 1s queries, medium cost. Cold (30 days-1 year) — object storage (S3/GCS), 1-hour or 1-day resolution, 2-5s queries, cheapest. Query router inspects start_time and routes to appropriate tier. Data is migrated automatically by a background job as it ages.

How does the alerting state machine work? Name the states and transitions.
?
Three states: INACTIVE (metric below threshold), PENDING (threshold crossed but “for” duration not yet met), FIRING (threshold exceeded for required duration → notification sent). Transitions: INACTIVE → PENDING when threshold first crossed; PENDING → FIRING after “for” duration passes (e.g., 5 minutes); FIRING or PENDING → INACTIVE when metric recovers. The PENDING state prevents flapping (brief spikes don’t page on-call).

What is the “for” clause in an alert rule and why is it critical?
?
The “for” clause requires a threshold to be continuously breached for a specified duration before the alert fires. Example: “for: 5m” means CPU > 85% must be true for 5 consecutive minutes before PagerDuty is called. Without it, a 10-second CPU spike at 3am would wake up an engineer. The for clause prevents alert flapping on transient anomalies, drastically reducing alert fatigue.

How does alert deduplication work? Why is it needed?
?
Alert Manager tracks the last time each alert was sent (stored in Redis with TTL). If an alert is already FIRING, it suppresses re-notification until the cooldown expires (e.g., resend every 1 hour for persistent issues). Always sends a RESOLVED notification when metric recovers. Without deduplication, a threshold breach lasting 2 hours would send 120 PagerDuty pages (one per minute). Deduplication sends 2 total: one for FIRING, one for RESOLVED.

What is alert grouping and how does it prevent alert storms?
?
Alert grouping bundles multiple related alerts into a single notification. Example: If disk is full on server-01, it might trigger 10 different alerts (high disk %, write errors, log rotation failures, etc.). Without grouping, on-call gets 10 pages. With grouping, Alert Manager identifies they share the label host=“server-01” and sends a single grouped notification: “10 alerts firing for server-01”. Configured by grouping key (e.g., {alertname, host}).

What is a PromQL range query? Give an example.
?
A range query fetches metric values over a time range and applies functions. Syntax: query_range?query=&start=&end=&step=. Example: avg(cpu_usage_percent{env=“prod”}) by (host) with step=60s returns a matrix — one time series per host, one data point per minute. Common functions: rate() for counters, avg_over_time() for gauges, histogram_quantile() for latency percentiles.

What is the difference between an instant vector and a range vector in PromQL?
?
Instant vector: a single snapshot of all matching time series at one moment in time. Example: cpu_usage_percent{env=“prod”} returns the current CPU for every matching host. Range vector: all values within a lookback window for each series. Example: cpu_usage_percent{env=“prod”}[5m] returns 5 minutes of samples for each host. Range vectors are used with rate() and avg_over_time() to compute rates and smoothed averages.

How is the TSDB chunk/block structure organized?
?
Data is organized in time-bounded blocks (e.g., 2-hour blocks in Prometheus). Each block contains: (1) Block header — time range covered, creation time. (2) Index — maps label sets to chunk file offsets for O(1) series lookup. (3) Chunks — columnar data, one chunk per series, containing encoded timestamps and values (Gorilla compressed). Blocks are immutable once written; recent data lives in an in-memory WAL (write-ahead log) before being flushed to block files.

Why is Kafka used as a buffer between push agents and the TSDB writer?
?
Four reasons: (1) Decoupling — TSDB writer can be restarted without losing data; agents keep pushing. (2) Spike absorption — agents may burst at peak hours; Kafka queues the excess and TSDB writer consumes at a steady rate. (3) Replay — if TSDB is down for 30 minutes, Kafka retains messages and writer replays on recovery. (4) Multiple consumers — same metrics stream feeds the TSDB writer, the alerting evaluator, and an archiving consumer simultaneously.

How do you estimate write throughput for a metrics system?
?
Formula: Writes/sec = (Number of hosts) × (Metrics per host) × (1 / scrape interval). Example: 1,000 servers × 10 metrics × (1/10s scrape interval) = 1,000 writes/sec per server cluster. At 100M DAU with ~10K servers: 10,000 × 10 × (1/10) = 10,000 writes/sec. Raw data per day: 10,000 writes/sec × 200 bytes/metric × 86,400 sec = ~173 GB/day uncompressed.

How do you handle TSDB query performance for dashboard loads?
?
Two strategies: (1) Pre-compute + cache — run common dashboard queries every 30 seconds, store results in Redis with TTL=30s. Dashboard load hits Redis cache instead of TSDB. (2) Query routing by time range — route queries for recent data to hot SSD-backed TSDB (fast), older queries to warm/cold storage (slower but acceptable for historical trends). For very wide queries (30-day range), use downsampled 1-hour aggregates instead of raw 10s data.

What data does a single metric write contain on the wire?
?
In InfluxDB line protocol format: “measurement,tag_key=tag_value field_key=field_value timestamp”. Example: cpu_usage_percent,host=server-01,region=us-east-1,env=prod value=72.4 1712995200. Components: measurement name (string), tags (indexed label key=value pairs, stored in inverted index for fast filtering), fields (numeric values, not indexed), Unix timestamp (nanoseconds in InfluxDB, seconds in Prometheus). ~200 bytes per metric point uncompressed.

How does the alert rule engine evaluate rules continuously?
?
Rule engine runs on a tick (e.g., every 60 seconds). On each tick: (1) Load all alert rules from database (cached in memory, refreshed every 5 min). (2) For each rule, execute the PromQL expression against the TSDB. (3) Compare result to threshold. (4) Update alert state machine (INACTIVE → PENDING → FIRING). (5) Pass FIRING alerts to Alert Manager. Multiple rules evaluated in parallel. Rule evaluation is stateless between ticks — state is persisted in a state store (Redis or DB).

How does the visualization layer (Grafana-style) avoid hammering the TSDB on every user interaction?
?
Three techniques: (1) Query caching — cache query results in Redis with short TTL (15–60 seconds). Same panel viewed by 100 users triggers only 1 TSDB query per TTL window. (2) Pre-computed snapshots — for shared/public dashboards, compute on schedule and serve static results. (3) Query downsampling — Grafana automatically adjusts step resolution to match panel pixel width (e.g., a 1000px wide panel over 7 days uses 10-min steps, not 10s raw data). Avoids returning millions of points the browser can’t render.

What retention and storage policy should you define for a 1-year metrics system?
?
Four-tier policy: (1) Raw 10s data — hot SSD, 7-day retention, full resolution. (2) 1-min aggregates — warm storage, 30-day retention, computed by background downsampling job. (3) 1-hour aggregates — warm/cold storage, 6-month retention. (4) 1-day aggregates — cold object storage (S3), 1-year retention. Background compaction job runs nightly to produce next tier aggregates and delete expired raw data. Total storage drops from ~63 TB/year (raw) to ~5-7 TB/year (tiered).

What is the role of a Write-Ahead Log (WAL) in a time-series database?
?
The WAL is an append-only log on disk that records every incoming metric write before it is added to the in-memory index and chunk store. Purpose: crash recovery — if the TSDB process crashes with recent data in memory not yet flushed to block files, the WAL replays those writes on restart. The WAL is sequential disk I/O (fast), while block file writing is batched and happens periodically. Prometheus and InfluxDB both use WALs. WAL is truncated after a successful block flush.

What are the top 3 single points of failure in a metrics system and how do you mitigate them?
?
(1) TSDB — mitigation: run 2-3 replicas (InfluxDB clustering or Thanos/Cortex for Prometheus HA). Writes fan out to all replicas; reads use any healthy replica. (2) Kafka — mitigation: Kafka is already distributed with replication factor 3; use multiple brokers across availability zones. (3) Alert Manager — mitigation: run in active-active cluster; use consistent hashing to assign alert rules to nodes; deduplication layer prevents double notifications if multiple nodes fire same alert.

Total Cards: 25
Review Time: 20-25 minutes
Priority: HIGH — Hard interview question, Datadog/Prometheus internals tested at FAANG
Last Updated: 2026-04-13

Study Notes by Niladri & AI

Explorer

vol2-ch05-metrics-monitoring

Chapter 5 Flashcards - Metrics Monitoring and Alerting System

Graph View