Chapter 6 Flashcards - Ad Click Event Aggregation System

flashcards volume2 ad-tech click-aggregation streaming batch

What is Lambda architecture and what are its two layers?
?
Lambda architecture runs two parallel data pipelines. Speed layer (streaming): Flink/Spark Streaming processes events as they arrive, produces results within minutes, may be approximate due to watermarks and late events. Batch layer: Spark batch job runs every few hours over the immutable raw event log, produces exact results but with hours of latency. Serving layer: merges both views — serves batch results for historical windows (accurate), streaming results for recent windows (fast). Use when you need both low latency AND accuracy.

What is the Speed layer in Lambda architecture? What are its trade-offs?
?
The speed layer processes raw events in a streaming framework (Flink/Kafka Streams) as they arrive, producing aggregated results within 1-5 minutes. Trade-offs: Fast (near-real-time) but approximate — events arriving after the watermark closes the window may be missed, causing under-counting. Bugs in the streaming job can produce wrong results that must be corrected later. Suitable for dashboards where slight inaccuracy is acceptable but not for final billing.

What is the Batch layer in Lambda architecture? What problem does it solve?
?
The batch layer reads the immutable raw event log (S3/HDFS) periodically (every 3-6 hours) and recomputes exact aggregates from scratch using Spark. It solves two streaming layer problems: (1) Late-arriving events — clicks that arrived after the watermark closed the window are captured in the batch run. (2) Bug corrections — if the Flink job had a bug, reprocessing the raw log with the fixed job corrects historical results. Batch results overwrite streaming results for past time windows in the serving layer.

What is the scale estimate for a system processing 1 billion ad clicks per day?
?
1B clicks/day ÷ 86,400 sec/day ≈ 11,574 avg clicks/sec. Peak = 50,000 clicks/sec (roughly 4× average). Raw event size: ~50 bytes × 1B/day = 50 GB/day. 7-day retention: 350 GB of raw click logs. Aggregated results (1-min windows, 1M ads): ~1M × 1440 × 8 bytes ≈ 11 GB/day. 90-day aggregated storage: ~1 TB. Kafka must handle 50K events/sec at peak — requires ~50 partitions at 1K events/sec/partition.

What is a tumbling window in stream processing? When do you use it?
?
A tumbling window divides time into fixed, non-overlapping intervals. Each event belongs to exactly one window. Example: 1-minute tumbling windows — [12:00, 12:01), [12:01, 12:02), etc. Properties: simple to implement, low memory (no overlap), deterministic. Use case: “How many clicks did ad X get in the minute from 12:00-12:01?” — for billing where each click is counted once per interval. Every Flink window emits its result exactly once when the watermark passes the window end time.

What is a sliding window in stream processing? When do you use it?
?
A sliding window has fixed size but slides by a step smaller than its size, so windows overlap. Example: 1-minute window sliding every 10 seconds — events appear in multiple windows. Properties: smooth/continuous view of recent data, higher memory (events live in multiple windows simultaneously). Use case: “Top 100 most-clicked ads in the last 1 minute” updated every 10 seconds. Memory cost: (window_size / slide_step) × events_per_window. More expensive than tumbling but needed for continuously-updated rankings.

What is event time vs processing time in stream processing? Why does it matter for ad click billing?
?
Event time: when the click physically occurred on the user’s device. Processing time: when Kafka/Flink received and processed the event (may be minutes later due to mobile network delay, battery saving modes, etc.). Matters for billing: a click at 12:00:00 must be billed in the 12:00 window regardless of when it arrived. If you use processing time, a click delayed by 3 minutes is incorrectly billed in the 12:03 window. Always use event time for financial calculations; watermarks tell Flink when it’s safe to close past windows.

What is a watermark in stream processing and how does it handle late events?
?
A watermark is a signal to the stream processor that says “I believe no more events with timestamp < T will arrive.” Formula: watermark = max_event_time_seen - allowed_lateness. Example: max event time = 12:05:00, allowed lateness = 2 minutes → watermark = 12:03:00. When watermark passes a window’s end time, Flink closes and emits that window. Events arriving after the window is closed are “late events” — options: drop them, emit to a side output (for batch correction), or update the window and re-emit. Watermarks balance latency vs completeness.

What is the naive top-K approach (min-heap) and why does it fail at scale?
?
For each time window, maintain a min-heap of size K containing the top K ads by click count. When a new ad count arrives, push it if it beats the heap minimum. Query: return heap contents sorted descending. Fails at scale because: with 1M ads across 100 Flink nodes, each node sees a subset of ads. To compute global top-K, you need to merge all 1M ad counts from all nodes — a full network shuffle of O(total_ads × nodes) data every window. At 100 nodes × 1M ads × 8 bytes = 800 MB shuffled per window interval. Too expensive.

What is the two-level aggregation approach for top-K?
?
Level 1 (local aggregation per node): Each Flink node maintains a local top-K heap for the ads it processes. Every 10 seconds, each node emits its local top-K (e.g., top 100). Level 2 (global merge): A single aggregator node receives local top-K lists from all nodes, merges and re-ranks them, emits the global top-K. Why correct: The global #1 ad must appear in at least one node’s local top-K list (it can only be globally top if it was locally top on some node). Data shuffled: O(K × num_nodes) instead of O(total_ads × num_nodes). Massive reduction at large scale.

How does Count-Min Sketch work? What is it used for in ad click aggregation?
?
Count-Min Sketch is a probabilistic data structure with d rows and w counters per row. Update for item X: increment counter at position hash_i(X) % w in each row i. Query for item X: return min across all rows of counter at hash_i(X) % w. The min guards against hash collisions inflating counts. Used for: approximate click counts per ad when tracking millions of distinct ads — fixed memory (d × w integers = ~40 KB for typical params) vs O(distinct ads) for exact hash map. Always over-estimates due to collisions. Use for top-K rankings; use exact counts from DB for billing.

What are the error guarantees of Count-Min Sketch?
?
With d rows of width w, and total N insertions: the estimated count is at most (true_count + ε × N) with probability at least (1 - δ), where ε = e/w (e ≈ 2.718) and δ = 1/e^d. Common config: d=5, w=2000 gives ε ≈ 0.0014, δ ≈ 0.0067 — i.e., 99.3% of the time, error is at most 0.14% of total insertions. Memory: 5 × 2000 × 4 bytes = 40 KB regardless of how many distinct ads exist. Trade-off: always over-estimates (never under-estimates), making it safe for “at least this many clicks” guarantees.

How do you achieve exactly-once semantics in a Kafka → Flink → Cassandra pipeline?
?
Three-part solution: (1) Flink checkpointing — every minute, Flink snapshots Kafka consumer offsets and all operator state (window accumulators) to S3. On crash, restore from checkpoint and replay Kafka from saved offset. (2) Two-phase commit — Flink’s exactly-once mode uses 2PC: pre-commit writes to Cassandra sink, then commits only when checkpoint succeeds. (3) Idempotent writes — Cassandra UPSERT uses (ad_id, window_start) as primary key. If a batch is replayed after crash, the UPSERT overwrites with identical values — no double counting. All three layers required together.

What is Kafka’s role in a fault-tolerant ad click system?
?
Kafka serves as the durable source of truth for raw click events. Key roles: (1) Buffer — absorbs bursts (50K clicks/sec peak) so downstream processors consume at a steady rate. (2) Replay buffer — 7-day retention means Flink can replay up to 7 days of clicks after a failure. (3) Fan-out — same topic feeds multiple consumers: Flink stream processor, S3 raw log writer, batch job (via S3), and audit consumers. (4) Decoupling — ad servers write to Kafka; TSDB/Cassandra writers are independent. Kafka retention period should match raw click log retention (7 days).

How does data reconciliation work between the streaming and batch layers?
?
The serving layer applies a priority rule: if a batch result exists for a requested time window, return the batch result (accurate). If not (recent data, batch job hasn’t run yet), return the streaming result. When the batch job completes for a past window, it overwrites the Batch View in Cassandra with the corrected count. Example: Streaming said ad_12345 had 14,823 clicks in hour 12:00-13:00. Batch job (reading full raw log including late arrivals) computed 14,910. The 14,910 overwrites the 14,823. Any future queries for that window return 14,910. Batch is the source of truth for billing.

What is Flink checkpointing and what does it checkpoint?
?
Flink takes a consistent distributed snapshot of the entire job state every N seconds (typically 1-5 minutes). Snapshot includes: (1) Kafka consumer offsets — exactly which messages were last processed per partition. (2) Operator state — window accumulators (partial click counts), timers, watermarks. (3) Sink state — any pre-committed but not yet committed transactions. Checkpoint stored in distributed storage (S3/HDFS). On failure: Flink restores all operators to last checkpoint state, sets Kafka consumers back to saved offsets, and replays messages from that point. Recovery time = time since last checkpoint.

Why is Cassandra the right database for storing pre-aggregated ad click counts?
?
Four reasons: (1) Write throughput — Cassandra handles millions of writes/sec; MySQL tops out at ~10K/sec. (2) Time-range queries — Cassandra’s clustering key on window_start supports efficient range scans for “ad X clicks from 12:00 to 13:00”. (3) Horizontal scale — Cassandra’s ring topology scales by adding nodes with no downtime. (4) Built-in TTL — set default_time_to_live = 7776000 (90 days) per table; Cassandra auto-deletes old data. Bonus: Cassandra COUNTER column type supports atomic increment — useful for streaming writes where multiple Flink nodes write to same (ad_id, window) key.

What is the raw click event data model? What fields does it contain?
?
A raw click event contains: ad_id (which ad was clicked), click_timestamp (event time, Unix epoch — critical for billing correctness), user_id (for deduplication and frequency capping), ip (for fraud detection and geo-filtering), country/region (for regional billing and filtering), user_agent (for device-type filtering, e.g., mobile vs desktop). Size: ~50 bytes per event. Stored in: Kafka (7-day retention), S3/HDFS raw log (7-day retention). The raw event log is immutable — never modified, only appended. Batch layer reads from S3, not Kafka (Kafka TTL might expire before batch runs).

How do you handle deduplication of click events (click fraud prevention)?
?
At the Kafka producer level: enable idempotent producer (enable.idempotence=true) to deduplicate retried messages within a session. At the stream processor level: maintain a bloom filter or sliding-window set of seen (user_id, ad_id) pairs within a short window (e.g., 1 minute) — if same user clicks same ad twice in 1 minute, count only once. At the batch layer: Spark job can deduplicate by (user_id, ad_id, click_timestamp) before aggregating — exact deduplication on full raw log. Trade-off: in-stream dedup is fast but approximate; batch dedup is exact but delayed.

What query API should an ad click aggregation system expose?
?
Two main endpoints: (1) Count query: GET /v1/ads/{ad_id}/aggregated_count?from=&to=&filter_region=US — returns total click count for an ad over a time range with optional filters. (2) Top-K query: GET /v1/ads/top_k?window_minutes=1&k=100 — returns the top K ads by click count in the last N minutes. Both should support pagination (top-K) and multiple filter dimensions (region, user_agent type, country). Response should include the time range covered and whether the result comes from the speed layer (approximate) or batch layer (accurate).

How does the system scale to handle a traffic spike from 10K to 50K clicks/sec?
?
Four scaling levers: (1) Kafka partitions — increase partitions on the “ad-clicks” topic; each partition handled by one Flink task manager. Add partitions proportionally to peak load. (2) Flink task managers — add more Flink workers; Kafka partitions are redistributed. (3) Cassandra nodes — add nodes to Cassandra ring; data auto-rebalances via consistent hashing. (4) Ad server load balancing — ensure click events are load-balanced across Kafka brokers. Kafka and Cassandra both scale horizontally with no downtime. Flink scales with rolling restart of the job.

What is the data retention policy for the ad click aggregation system?
?
Three-tier retention: (1) Raw click events — Kafka: 7-day retention (for Flink replay and batch reprocessing). S3/HDFS raw log: 7-day retention (immutable, source for batch layer). (2) Aggregated results (1-min windows) — Cassandra: 90-day retention via TTL. (3) Long-term reporting — optional: further aggregate to 1-hour or 1-day granularity and store in cold storage (S3 Parquet) indefinitely for year-over-year reporting. Note: Batch layer must run before raw data expires (within 7 days), otherwise late-event corrections are impossible.

What is the serving layer and how does it decide which view to return?
?
The serving layer is the query gateway that merges the speed view (streaming results, Cassandra “speed” table) and batch view (batch-corrected results, Cassandra “batch” table). Decision logic: For a given (ad_id, time_range) query — check if batch view has a result for this time range. If yes (batch job has processed it), return batch result (accurate). If no (recent data, batch hasn’t run), return speed view result (streaming, near-real-time but approximate). This ensures recent dashboards load instantly with streaming data while historical billing is based on accurate batch data.

What are the main fault tolerance failure modes in this system and how are they mitigated?
?
Four failure scenarios: (1) Kafka broker failure — Kafka replication factor 3 across AZs. Producer retries (acks=all) ensure no data loss. (2) Flink task manager crash — Flink checkpointing restores state; Kafka replay covers data since last checkpoint. At most 1-5 minutes of re-processing, not data loss. (3) Cassandra node failure — Cassandra replication factor 3 (LOCAL_QUORUM reads/writes). One node failure is invisible to queries. (4) Batch job failure — Spark batch is idempotent (can rerun); raw logs in S3 are immutable so rerun produces identical results. Failed batch job can simply be resubmitted.

What are the key differences between Lambda architecture and Kappa architecture?
?
Lambda architecture: two separate pipelines (streaming + batch), complex to maintain (same logic implemented twice), but clear separation of concerns and proven at scale. Kappa architecture: single streaming pipeline only — the streaming layer is also capable of batch reprocessing by replaying from Kafka with a new job version. Simpler to operate (one codebase), but requires Kafka to retain data long enough for reprocessing (weeks/months vs 7 days). Kappa is preferred when streaming logic can be made exactly-once and deterministic. Lambda preferred when batch and streaming have fundamentally different logic (e.g., complex SQL aggregations easier in Spark).

Total Cards: 25
Review Time: 20-25 minutes
Priority: HIGH — Hard interview question, common at Meta/Google/Amazon data infra rounds
Last Updated: 2026-04-13