Stream processor’s estimate of “events before time T have likely arrived”
Event time
When event actually occurred (use for business logic)
Processing time
When event arrives at processor (use for operational SLAs)
Stream-table join
Enrich stream events with lookup table data; no expiry; for enrichment
Stream-stream join
Correlate two event streams within a time window; both sides expire
Exactly-once
Each event processed exactly once; requires idempotent sinks + checkpoints
Log compaction
Retain only latest value per key in Kafka; enables CDC state bootstrapping
Log-Based vs Message Queue
MESSAGE QUEUE (RabbitMQ, SQS):
┌──────────┐ push ┌──────────────────────────────┐
│ Queue │ ─────────► │ Consumer A (gets message, │
│ │ │ deletes it) │
└──────────┘ └──────────────────────────────┘
After Consumer A processes: message GONE
Consumer B: cannot read it (already deleted)
On crash: message lost unless acknowledged before crash
Good for: task queues, job distribution
LOG-BASED BROKER (Kafka, Kinesis):
┌─────────────────────────────────────────────────────┐
│ LOG: [offset 0][offset 1][offset 2]...[offset N] │
└──┬────────────────────────────────────────────────┘
│ │ │
│ │ │ (each group reads independently)
Consumer Consumer Consumer
Group A Group B Group C
(at off.7) (at off.3) (at off.N-1)
Message retained until log expiry policy (not deleted on consumption)
Any consumer group can replay from any offset
Good for: event streaming, CDC, audit, replay
Producer Broker Processor
────────────────────────────────────────────────────────
Event occurs at: Received at: Processed at:
event_time = 14:00:05 14:00:07 14:00:35
↑ USE THIS for business logic ↑ this is processing time
Why they differ:
Mobile app buffered offline 4 hours → event_time=10:00, processing_time=14:00
Network delay
Consumer paused/slow
Clock skew
Rule:
event_time → use for: revenue per hour, session duration, user activity
processing_time → use for: job throughput metrics, "is job keeping up?"
Watermark Strategies
FIXED-LAG WATERMARK:
watermark(t) = max_event_time_seen - lag
Example: lag=2min → if latest event=14:10, watermark=14:08
Windows before 14:08 are closed and emitted
✅ Simple, predictable
❌ Events > 2min late are silently dropped (or go to side output)
HEURISTIC WATERMARK:
Observe actual P99 late arrival delay → set lag accordingly
Adapts to data patterns; more accurate
✅ Better completeness for variable delay patterns
❌ Complex; still a heuristic
GRACE PERIOD:
Window stays open for extra N seconds after watermark passes
Late events within grace period included; after grace → dropped/side-output
✅ Catches most late arrivals without unbounded waiting
Flink: .withIdleness() + .allowedLateness(Duration.ofSeconds(30))
TRADE-OFF:
Larger lag → more completeness, more latency
Smaller lag → less latency, more data dropped
Stream Join Types
1. STREAM-TABLE JOIN (enrichment):
─────────────────────────────────────────────────────────
Stream: [order_id=1, user_id=42, amount=99] ──────────►
Table: {user_id=42 → tier="gold", country="US"} (loaded in state store)
Output: [order_id=1, user_id=42, amount=99, tier="gold", country="US"]
State: only lookup table (no stream state)
Expiry: table never expires (persistent ref data)
Keep table fresh: CDC feed updates state store
2. STREAM-STREAM JOIN (event correlation):
─────────────────────────────────────────────────────────
Stream A (impressions): [user=42, ad=9, t=14:00] ──────►
Stream B (clicks): [user=42, ad=9, t=14:07] ──────►
Join: same user+ad, click within 30min of impression
State: buffer both streams; expire after 30min
Output: [user=42, ad=9, impression=14:00, click=14:07, delay=7min]
3. TABLE-TABLE JOIN (incremental materialized view):
─────────────────────────────────────────────────────────
Table A (CDC): customer updates
Table B (CDC): order updates
Materialized view = JOIN(customers, orders), updated on each CDC event
Used in: ksqlDB, Kafka Streams, Flink Table API
Exactly-Once Delivery: Three Approaches
APPROACH 1: Idempotent writes
Process event → write with unique event_id → sink deduplicates
If event re-processed: same event_id → overwrite (no duplicate)
Requirement: sink must support deduplication by ID
Overhead: low
APPROACH 2: Kafka Transactions
producer.beginTransaction()
producer.send(output_topic, result)
consumer.commitSync(offsets_to_commit) // commit input offset
producer.commitTransaction() // atomic: output + offset
Consumer must use: isolation.level=read_committed
Overhead: moderate (~5-10% throughput reduction)
APPROACH 3: Flink Chandy-Lamport Checkpoints
Step 1: Inject BARRIER into all input streams
Step 2: Each operator snapshots state to S3/HDFS on BARRIER
Step 3: Record input offset at checkpoint
On failure: rollback all operators to last checkpoint; replay from offset
Overhead: checkpoint interval (e.g., every 5 minutes of extra state I/O)
Requirement: deterministic processing; idempotent or transactional sinks
Kafka vs Kinesis vs Pulsar vs RabbitMQ
Kafka
Kinesis
Pulsar
RabbitMQ
Type
Log (pull)
Log (pull)
Log + queue
Queue (push)
Replay
Yes
Yes
Yes
No
Retention
Configurable
1–365 days
Tiered (S3)
Until acked
Throughput
Very high
High
High
Medium
Ops burden
Medium
Low (managed)
Medium
Low
Ecosystem
Richest
AWS-native
Growing
Enterprise
Choose when
Open source, rich ecosystem
AWS-native stack
Cost-sensitive retention
Simple task queue
Delivery Guarantees
AT-MOST-ONCE:
Fire and forget; don't retry
❌ Events lost on failure
✅ Zero latency overhead
Use: non-critical logging, sampling
AT-LEAST-ONCE:
Retry on failure; may duplicate
✅ No data loss
❌ Downstream must be idempotent
Use: most production use cases with idempotent sinks
EXACTLY-ONCE:
Each event processed exactly once
✅ Correct for financial/billing use cases
❌ Overhead: transactions, checkpoints
Use: financial transactions, inventory, billing
Log Compaction for CDC State
Normal Kafka topic (time-based retention):
[user=1: A][user=2: B][user=1: B][user=1: C] → [EXPIRY]→
↑ all retained until TTL
Compacted Kafka topic:
[user=1: A][user=2: B][user=1: B][user=1: C]
↓ compaction ↓
[user=2: B][user=1: C] ← only LATEST value per key retained
Reading compacted topic from offset 0 = current state of all keys
Tombstone: user=1 with null value = DELETE user=1 from state
CDC bootstrapping:
New Elasticsearch index consumer:
1. Read compacted Kafka topic from offset 0 → get current state
2. Continue reading new events → stay up to date
No need to replay all history!
Red Flags
❌ Dual writes to database + Kafka (race conditions → divergence)
❌ Windowing by processing time for business metrics (use event time!)
❌ No late data handling (silently losing mobile/offline events)
❌ Using RabbitMQ when you need replay (messages deleted on consumption)
❌ Exactly-once without idempotent sinks (Flink checkpoints don't help if Elasticsearch double-writes)
❌ Stream-stream join with unbounded window (state grows forever → OOM)
❌ Kafka partition count < consumer count (consumers idle, no work to do)
Green Flags
✅ Kafka with retention policy for replay capability
✅ Debezium CDC: single source of truth stays in DB
✅ Event time windowing + watermark with grace period
✅ Broadcast hash join for large stream × small table (no shuffle)
✅ Log compaction for CDC topics (bootstrapping new consumers)
✅ Flink checkpointing + idempotent sink for exactly-once
✅ Side outputs for late events (don't discard; handle separately)
✅ Consumer group per independent downstream system (fan-out)