Chapter 4 Flashcards - Distributed Message Queue (Vol 2)
flashcards volume2 message-queue kafka distributed-systems streaming
What are the core building blocks of a Kafka-style message queue?
?
Topic: Named channel for a category of messages (e.g., “orders”). Partition: Ordered, append-only log within a topic — unit of parallelism and ordering. Broker: Server that stores and serves partitions. Producer: Writes messages to topics. Consumer: Reads messages from topics. Consumer Group: Set of consumers sharing partition assignments for parallel processing. Offset: Monotonically increasing integer identifying a message’s position in a partition.
How do topics and partitions relate, and why does the partition count matter?
?
A topic is divided into N partitions. Each partition is an independent, ordered, immutable log stored on one broker (with replicas elsewhere). Partition count determines maximum parallelism: a consumer group can have at most one active consumer per partition. More partitions = more parallelism for reads. To increase throughput: add partitions (but note you can only increase, never decrease). Rule of thumb: target_throughput / throughput_per_partition, then add a buffer.
How does a consumer group enable parallel processing?
?
Each partition is assigned to exactly one consumer within the group. With 3 partitions and 3 consumers: Consumer 1→P0, Consumer 2→P1, Consumer 3→P2 — true parallelism. Multiple consumer groups can independently read the same topic (each group tracks its own offsets). Adding consumers beyond the partition count leaves extras idle. Different groups see all messages — pub/sub fan-out. Same group shares messages — competing consumers pattern.
How does Kafka track consumer offsets?
?
The consumer (not the broker) is responsible for tracking its offset — which message to read next. Offsets are stored in an internal Kafka topic: __consumer_offsets, keyed by (consumer_group, topic, partition). The consumer commits offsets after processing. On restart or rebalance, the consumer reads its last committed offset and resumes from there. This enables at-least-once delivery: if consumer crashes after reading but before committing, it re-reads those messages.
What happens during consumer group rebalancing?
?
Rebalancing is triggered when a consumer joins, leaves, or crashes (detected by heartbeat timeout). The group coordinator (a broker) orchestrates: (1) All consumers stop consuming (eager protocol) or only affected partitions move (cooperative protocol). (2) Partitions are reassigned. (3) Consumers resume from their last committed offset for newly assigned partitions. Cooperative rebalancing (Kafka 2.4+) minimizes disruption by only moving partitions that need to change, rather than stopping everyone.
What is the Leader-Follower replication model in Kafka?
?
Each partition has one leader (handles all reads and writes from producers) and N-1 followers (replicate the leader). Followers pull data from the leader and keep their log in sync. All producer writes go to the leader first. Consumers read from the leader (or any replica in newer Kafka versions). If the leader broker crashes, the controller elects a new leader from the ISR (In-Sync Replicas). Replication factor = number of copies total (leader + followers). Recommended: replication factor of 3.
What is ISR (In-Sync Replicas) and why is it critical?
?
ISR is the set of replicas that are caught up with the leader within replica.lag.time.max.ms (default 10 seconds). A replica falling behind is removed from ISR. When acks=all, the leader waits for ALL ISR replicas to acknowledge before confirming to the producer. Key guarantee: new leader is ALWAYS elected from ISR — so no committed message (one acked by all ISR) is ever lost during failover. Small ISR = faster writes but less durability. Large ISR = slower but stronger guarantee.
What are the three ack modes in Kafka and when do you use each?
?
acks=0: No acknowledgment. Fire and forget. Highest throughput, data loss possible. Use for metrics/logs where occasional loss is OK. acks=1: Leader acknowledges after writing to disk. Fast. Risk: leader crashes before replication = loss. Default for most use cases. acks=all: All ISR replicas must acknowledge. Highest durability — survives leader crash. Lower throughput. Use with min.insync.replicas=2 for financial/critical data. The durability-throughput dial.
What is at-least-once vs exactly-once vs at-most-once delivery semantics?
?
At-most-once: Commit offset BEFORE processing. May lose messages (crash after commit, before processing). Highest throughput. At-least-once (Kafka default): Commit offset AFTER processing. Producer retries on failure. Messages may be processed multiple times. Consumer must be idempotent. Exactly-once: Idempotent producer (PID + sequence number deduplication at broker) + transactional API (atomically commit output + input offset). No loss, no duplication. ~20% throughput cost. Use for financial transactions.
Why is sequential disk I/O the key to Kafka’s performance?
?
Random HDD I/O: ~1 MB/s (physical seek time ~10ms per seek). Sequential HDD I/O: ~500 MB/s. Sequential SSD I/O: ~3,000 MB/s. That’s 500x difference. Kafka’s append-only, write-ahead log (WAL) ensures ALL writes are sequential — no seeking. Combined with OS page cache, hot partitions are often served entirely from RAM. This design enables 1 GB/s+ throughput on commodity hardware without expensive in-memory storage. Sequential disk can actually outperform random-access RAM structures for sustained throughput.
What is the Write-Ahead Log (commit log) structure in Kafka?
?
Each partition is stored as a series of segment files: {base_offset}.log (message data) and {base_offset}.index (offset → byte position mapping). New messages always appended to the active segment. When a segment reaches max size (default 1 GB) or age (default 1 week), it’s closed and a new one opens. Old segments deleted when retention period expires. To find offset N: binary search the .index file for the byte position, then seek .log file to that position and read sequentially.
What is the message format stored on disk in Kafka?
?
Each message record contains: Offset (8 bytes), Size (4 bytes), CRC32 checksum (4 bytes — integrity check), Magic byte (version), Attributes, Timestamp (8 bytes), Key length + Key (variable — used for partition routing), Value length + Value (variable — actual payload). Messages are grouped into batches with a single CRC for the batch. Key insight: the broker stores raw bytes and does NOT enforce schema — schema validation (Avro, Protobuf) is done by producers/consumers via Schema Registry.
How does Kafka use zero-copy optimization to improve throughput?
?
Without zero-copy: Disk → kernel buffer (copy 1) → application memory (copy 2) → socket buffer (copy 3) → NIC (copy 4). 4 copies, 2 context switches, high CPU. With zero-copy (sendfile() syscall): Disk → kernel buffer (copy 1) → NIC via DMA (copy 2). 2 copies, 0 context switches. Result: 2–4x better read throughput for same hardware. Kafka uses FileChannel.transferTo() in Java, which triggers sendfile(). Applies to sending stored messages to consumers. Does NOT apply when messages need broker-level transformation.
Why does Kafka use a pull-based consumption model instead of push?
?
Push: Broker must track each consumer’s state and rate. Risk of overwhelming slow consumers (different consumers have wildly different processing speeds: 100 msg/sec vs 10K msg/sec). Broker logic becomes complex. Pull: Consumer calls poll() at its own pace. Naturally handles varied consumer speeds. Consumer controls batch size and frequency. Long polling (wait up to N ms for messages) avoids busy-wait. Simpler broker design. The main downside: slight additional latency vs push (consumer must poll to discover new messages), acceptable for most use cases.
What is a Kafka segment file and what is log rotation?
?
A segment file is a fixed-size (default 1 GB) or fixed-time (default 1 week) chunk of a partition’s log. The partition is split into multiple segment files rather than one huge file. Only the last (active) segment is written to. Older segments are read-only. Log rotation = when the active segment reaches the size/time limit, close it and open a new active segment. Old segments are deleted when their data exceeds the retention period (default 7 days). Benefits: efficient deletion (drop whole segment files), bounded memory for index structures.
How does key-based partitioning ensure message ordering?
?
partition = Hash(key) % num_partitions. All messages with the same key always map to the same partition. Since each partition is an ordered append-only log, all messages with a given key are delivered to the consumer in the order they were produced. Example: all events for user_id=123 go to the same partition → consumer processes them in chronological order. Without a key, messages are round-robined across partitions (load balancing but no ordering guarantee across messages). Only guarantee: ordering WITHIN a partition, never across partitions.
What is ZooKeeper’s role in legacy Kafka and why was it replaced by KRaft?
?
ZooKeeper stored Kafka metadata: broker list, topic/partition info, ISR lists, controller election. Problems: Required operating a separate ZooKeeper cluster (complexity). ZooKeeper bottlenecked at ~10K partitions (slow metadata propagation). Seconds of lag for metadata updates. Harder to scale and manage two systems. KRaft (Kafka Raft): Metadata stored in internal __cluster_metadata topic using Raft consensus. Controller nodes (3 or 5) form a quorum. Benefits: no external dependency, sub-millisecond metadata propagation, scales to millions of partitions, faster failover, simpler operations.
How does KRaft (Kafka Raft) work for metadata management?
?
KRaft designates a subset of brokers as controllers (typically 3 or 5). These form a Raft quorum: one leader controller handles writes, followers replicate. Metadata (broker list, partition assignments, ISR, configs) stored in an internal partition __cluster_metadata. All regular brokers are followers of the controller quorum — they pull metadata updates. On leader controller failure, Raft elects a new leader in milliseconds (vs seconds for ZooKeeper). No external system required: Kafka manages its own consensus. Kafka 3.x+ uses KRaft by default.
What throughput optimizations does Kafka use besides sequential I/O?
?
- Batching: Producer accumulates messages (up to
batch.size=16KB) and waits (linger.ms=5ms) before sending — fewer network round-trips. 2. Compression: LZ4/Snappy/GZIP compress batches (2–4x size reduction) — less bandwidth. 3. Zero-copy (sendfile()): Consumer reads bypass application memory — 2–4x better read throughput. 4. Page cache: OS caches hot partition data in RAM — reads often don’t hit disk. 5. Batch consumer reads:consumer.poll(max_records=500)— amortizes per-request overhead. Combined, these give 100x throughput vs naive approach.
How does the idempotent producer work for deduplication?
?
Each producer is assigned a Producer ID (PID) by the broker. The producer maintains a sequence number per partition (starts at 0, increments per message). Each message includes (PID, partition, sequence_number). Broker tracks the last seen sequence per (PID, partition). On receiving a message: if sequence == last_seen + 1 → accept and advance. If sequence <= last_seen → reject as duplicate (idempotent retry). If sequence > last_seen + 1 → out-of-order error. Enable with enable.idempotence=true. This handles producer retry duplicates without application-level dedup logic.
What is Kafka’s transactional API and when do you use it?
?
Transactions atomically commit: (1) output messages to one or more topics + (2) consumed input offsets. Either all are committed or none. Use case: read-process-write pipeline where exactly-once is needed. Config: transactional.id=unique-string. Flow: beginTransaction() → produce output → sendOffsetsToTransaction(offsets, group) → commitTransaction() (or abortTransaction()). Consumers read only committed messages with isolation.level=read_committed. Cost: ~20% throughput reduction, more complex code. Use only for financial transactions or critical pipelines.
What is the scale target for a distributed message queue and key storage estimates?
?
Target: 1M messages/sec, 1 KB average size = 1 GB/sec write throughput. With 3x replication: 3 GB/sec disk writes cluster-wide. Sequential disk can handle 500 MB/s (HDD) to 3 GB/s (SSD). Daily storage: 1 GB/sec × 86,400 sec = ~86 TB/day. With 7-day retention: ~600 TB storage cluster. Compression (50% ratio) reduces to ~300 TB. This drives the need for a cluster of brokers rather than a single machine. Partition distribution across brokers ensures no single machine is the bottleneck.
How does Kafka handle broker failure from end to end?
?
- Broker X (leader for P0) crashes. 2. Other brokers detect failure via ZooKeeper/KRaft session timeout (~10–30 sec). 3. Active controller selects new leader for P0 from ISR (picks most up-to-date replica). 4. Controller updates partition metadata cluster-wide. 5. Producers and consumers reconnect — they receive new metadata and redirect to new leader. 6. Broker X restarts: fetches log from new leader, catches up, re-joins ISR. 7. Partition leadership may or may not move back to Broker X (preferred leader election optional). Total failover: ~5–30 seconds typical.
What is the difference between a traditional message queue (SQS) and a streaming log (Kafka)?
?
Traditional MQ (SQS/RabbitMQ): Message deleted after consumed. Each message consumed by one consumer (point-to-point) or fan-out (topic). No replay. Message ordering often not guaranteed. Good for task queues (email sending, job dispatch). Streaming Log (Kafka): Messages retained after consumed (configurable, default 7 days). Consumers track their own offset — can replay from any point. Multiple consumer groups independently read the same stream. Ordered within partition. Good for event streaming, audit logs, data pipelines, CQRS event stores. Kafka supports both models.
What are the key trade-offs to know for a distributed message queue interview?
?
Throughput vs Durability: acks=0/1 (fast, risk loss) vs acks=all (slow, no loss). Parallelism vs Ordering: More partitions = more parallelism, but ordering only within partition. Fewer partitions = global ordering possible (1 partition), but no parallelism. Delivery semantics: At-least-once (simple, duplicates) vs exactly-once (complex, ~20% overhead). Push vs Pull: Push = lower latency, harder to manage speeds. Pull = natural backpressure, slight latency. Retention: Long retention = replay capability but high storage cost. Short retention = less storage, no replay.
Total Cards: 25
Review Time: 20–25 minutes
Priority: HIGH — Very Hard difficulty, Kafka internals appear in many system design interviews
Last Updated: 2026-04-13