Chapter 13: Design a Stock Exchange
volume2 stock-exchange trading matching-engine low-latency
Status: π© Interview ready - The most complex chapter!
Difficulty: Extreme
Time to complete: 60 min read + practice
Overview
A stock exchange is an electronic marketplace where buyers and sellers trade financial securities (stocks, options, futures). Major exchanges include:
- NYSE (New York Stock Exchange) β largest by market cap
- NASDAQ β tech-heavy, fully electronic
- CBOE (Chicago Board Options Exchange) β options-focused
Core job: Match buy and sell orders, execute trades, and report prices in real-time.
Why this matters:
- Most technically demanding system design question β tests everything
- Extreme non-functional requirements (sub-millisecond latency, strict ordering)
- Covers event sourcing, low-latency architectures, and financial systems
- Demonstrates mastery of distributed systems under ultra-strict guarantees
Problem Statement
Design a stock exchange system that:
- Matches buy and sell orders for stocks with microsecond-to-millisecond latency
- Guarantees strict price-time priority for fair trade execution
- Provides real-time market data (price feeds) to all participants
- Maintains a complete, immutable audit trail for regulatory compliance
- Handles circuit breakers and risk management to ensure market stability
Step 1: Requirements & Scope (5 min)
Functional Requirements
Clarifying questions:
- What order types should we support? β Market, Limit, Stop orders
- What is the trading window? β 9:30 AM β 4:00 PM EST (market hours)
- Do we need pre-market / after-hours? β Out of scope for now
- What is the matching priority? β Price-time priority (FIFO at same price level)
- Do we need audit trail? β Yes, regulatory requirement (SEC, FINRA)
- What instruments? β Equities (stocks) only for now
Scope:
- Accept and validate incoming orders (market, limit, stop)
- Match orders using price-time priority
- Publish trade executions and market data in real-time
- Maintain complete audit trail for all orders and trades
- Risk management: circuit breakers, position limits, pre-trade risk checks
Non-Functional Requirements
- Ultra-low latency: Order matching P99 < 1 millisecond (microseconds in practice)
- High throughput: 1 billion orders/day (~100,000 orders/second peak)
- Strict ordering: Every order has a unique, monotonically increasing sequence number
- Fault tolerance: No lost orders; recovery must replay exactly same sequence
- Fairness: Same price β older order executes first (no favoritism)
- Availability: 99.99% during market hours
- Regulatory compliance: Full audit trail, immutable event log
Step 2: High-Level Design (10 min)
Order Lifecycle
[Trading Client]
|
| (FIX Protocol / WebSocket)
v
[Client Gateway] β Validates format, authenticates client
|
v
[Order Manager] β Validates risk (pre-trade), routes to engine
|
v
[Sequencer] β Assigns monotonically increasing sequence number
|
v
[Matching Engine] β Core: Matches buy/sell orders in order book
| |
v v
[Trade Reporter] [Order Book Updates]
| |
v v
[Market Data [Client Gateway] β Send fills/rejects back to client
Publisher]
|
v
[Market Data Feed] β Broadcasts prices/trades to all subscribers
Core Components
| Component | Role | Key Concern |
|---|---|---|
| Client Gateway | Order ingestion, FIX protocol parsing, client auth | Protocol handling |
| Order Manager | Order validation, pre-trade risk checks, routing | Risk management |
| Sequencer | Assign strict monotonic sequence numbers to all messages | Strict ordering |
| Matching Engine | Maintains order book, matches buy/sell orders | Correctness + latency |
| Market Data Publisher | Broadcasts price/volume data to market participants | Fan-out throughput |
| Trade Reporter | Sends execution reports back to clients | Reliable delivery |
| Risk Engine | Position limits, circuit breakers, margin calculations | Market safety |
Protocol: FIX (Financial Information eXchange)
FIX Message Example (New Order):
8=FIX.4.4 (Protocol version)
35=D (Message type: New Order Single)
49=CLIENT1 (Sender ID)
56=EXCHANGE (Target ID)
11=ORDER-001 (Client Order ID)
55=AAPL (Symbol)
54=1 (Side: 1=Buy, 2=Sell)
38=100 (Quantity)
40=2 (Order type: 2=Limit)
44=150.00 (Limit price)
FIX is the universal protocol for institutional trading β standardized, battle-tested, used by every major exchange.
Step 3: Deep Dive (30 min)
Order Book
The order book is the central data structure of the matching engine. It holds all outstanding (unmatched) orders for a given symbol.
Structure:
Symbol: AAPL
BIDS (Buy side) ASKS (Sell side)
Price Qty Orders | Price Qty Orders
------ ---- ------ | ------ ---- ------
$150.05 300 [O1, O2] | $150.10 200 [O5]
$150.00 500 [O3, O4] | $150.15 400 [O6, O7]
$149.95 200 [O8] | $150.20 100 [O9]
Best Bid: $150.05 Best Ask: $150.10
Spread = $150.10 - $150.05 = $0.05
Sorting rules:
- Bid side (buy orders): Sorted by price descending (highest price first β most eager buyer)
- Ask side (sell orders): Sorted by price ascending (lowest price first β most eager seller)
- At same price level: FIFO β earlier order (lower sequence number) executes first
Data structure choice:
Order Book Implementation:
- bids: TreeMap<Long, LinkedList<Order>> (price β queue of orders, descending)
- asks: TreeMap<Long, LinkedList<Order>> (price β queue of orders, ascending)
- orderIndex: HashMap<String, Order> (orderId β Order, for O(1) cancel/modify)
Key operations:
- Add order: O(log N) to find price level, O(1) to append to queue
- Cancel order: O(1) using orderIndex, O(1) to remove from doubly-linked list
- Match: O(1) to get best bid/ask from TreeMap head, O(1) to dequeue
Why doubly-linked list at each price level?
- O(1) insertion at tail (new order)
- O(1) removal from any position (cancel order β using pointer from orderIndex)
- O(1) removal from head (trade execution β FIFO)
Matching Algorithm: Price-Time Priority
Algorithm (runs on every new order):
function matchOrder(incomingOrder):
if incomingOrder.side == BUY:
while incomingOrder.qty > 0 and asks is not empty:
bestAsk = asks.firstEntry() // Lowest ask price
if incomingOrder.type == MARKET or
incomingOrder.limitPrice >= bestAsk.price:
fill = min(incomingOrder.qty, bestAsk.orders.peek().qty)
executeTrade(incomingOrder, bestAsk.orders.peek(), fill)
incomingOrder.qty -= fill
bestAsk.orders.peek().qty -= fill
if bestAsk.orders.peek().qty == 0:
bestAsk.orders.poll() // Remove fully filled order
if bestAsk.orders.isEmpty():
asks.remove(bestAsk.price) // Remove empty price level
else:
break // No match possible
if incomingOrder.qty > 0 and incomingOrder.type == LIMIT:
addToOrderBook(incomingOrder) // Rest on book
// Mirror logic for SELL side
function executeTrade(buyer, seller, qty):
price = seller.price // Resting order's price wins
trade = Trade(buyer.id, seller.id, price, qty, timestamp)
publishTrade(trade)
sendExecutionReport(buyer, FILL)
sendExecutionReport(seller, FILL)
Order Types
| Type | Description | Execution | Risk |
|---|---|---|---|
| Market | Execute immediately at best available price | Immediate, guaranteed fill | Slippage β may get terrible price in thin market |
| Limit | Execute only at specified price or better | May rest on book if no match | No slippage, but may not fill |
| Stop | Trigger market/limit order when price hits trigger level | Converts to market/limit on trigger | Gap risk on open |
Limit order resting on book (most common):
Limit Buy 100 AAPL @ $150.00
β Check if any asks <= $150.00
β No match found
β Order rests on bid side at $150.00
β Will execute when seller hits $150.00 or below
Sequencer: The Ordering Heart of the Exchange
The sequencer is arguably the most critical architectural component. It solves the problem: βHow do you guarantee strict ordering when multiple gateways receive orders concurrently?β
Problem without sequencer:
Gateway 1 receives Order A at t=10:00:00.000001
Gateway 2 receives Order B at t=10:00:00.000002
Both sent to matching engine simultaneously
β Race condition: which order is "first"?
β Non-deterministic matching = unfair market
Sequencer solution:
All incoming messages β Sequencer (single point)
|
Assigns sequence number
(monotonically increasing: 1, 2, 3, ...)
Writes to persistent event log
|
Broadcasts to matching engine
(matching engine processes in strict seq order)
Properties:
- Single sequencer = single source of truth for ordering
- Sequence number persisted to durable log before processing
- Matching engine consumes messages in strict sequence order
- On recovery: replay from last checkpoint by re-reading event log
Implementation: Ring buffer (like LMAX Disruptor) in front of matching engine β lock-free, cache-friendly, microsecond-latency message passing.
Sequencer Event Log Entry:
{
seqNum: 1000042,
timestamp: 1713000000000001, // Nanoseconds
type: "NEW_ORDER",
orderId: "CLIENT1-00001",
symbol: "AAPL",
side: "BUY",
type: "LIMIT",
price: 15000, // Integer cents to avoid floating point
qty: 100
}
Low-Latency Architecture Techniques
The matching engine must operate at microsecond latency. Standard software approaches are too slow.
1. LMAX Disruptor (Ring Buffer)
Traditional queue (with locks):
Producer β [Lock] β Queue β [Lock] β Consumer
Latency: ~1ms (lock contention, GC pauses)
LMAX Disruptor (lock-free ring buffer):
Producer β [Ring Buffer] β Consumer
Latency: ~100 nanoseconds
Ring Buffer:
- Fixed-size array (power of 2, e.g., 1 million slots)
- Producer writes to next slot using atomic increment of sequence number
- Consumer reads from current sequence position
- No locks, no GC pressure (pre-allocated), cache-friendly (sequential access)
Why it works: CPU caches love sequential memory access. Ring buffer = no pointer chasing, no heap allocation, no GC.
2. Memory-Mapped Files (mmap)
Traditional file I/O:
Write to file β Kernel buffer β Disk
Read from file β Kernel buffer β User space copy
Overhead: 2 memory copies, syscall
Memory-mapped file:
File mapped directly into process address space
Write to memory = write to file (OS handles persistence)
No syscall overhead, no data copy
Perfect for event log (sequencer persistence)
3. Kernel Bypass Networking (DPDK / RDMA)
Normal networking:
NIC β Kernel (interrupt) β Socket buffer β User space copy β Application
Latency: ~50 microseconds
Kernel bypass (DPDK):
NIC β User space directly (polling mode driver)
Application polls NIC directly, no kernel involvement
Latency: ~1-5 microseconds
Used by HFT firms and some exchanges for the βlast mileβ of latency.
4. Co-location
Trading firms pay exchange data centers to place their servers physically inside the exchange data center β sometimes in the same rack as matching engines. Speed-of-light matters at sub-microsecond scale.
5. CPU and Memory Optimizations
- CPU pinning: Bind matching engine thread to dedicated CPU core
(no context switches, no OS scheduler interference)
- NUMA awareness: Matching engine memory on same NUMA node as CPU
(cross-NUMA access adds ~100ns)
- Huge pages: Reduce TLB misses (2MB pages vs 4KB pages)
- Disable hyper-threading: Reduces cache contention on matching engine core
- Avoid GC: Use C++/Java with GC paused, or off-heap memory allocation
Latency Budget Breakdown
Component Target Latency
βββββββββββββββββββββββββββββββββββββββββ
NIC β Application 1-5 ΞΌs (with DPDK)
Order validation ~1 ΞΌs
Sequencer write ~5 ΞΌs (mmap to SSD)
Ring buffer enqueue ~100 ns
Matching engine ~1-10 ΞΌs
Execution report ~5 ΞΌs
Total P99 latency: < 100 ΞΌs (top exchanges achieve 10-50 ΞΌs)
Event Sourcing for Audit Trail
Why event sourcing?
Regulatory requirement: every order, modification, cancellation, and trade must be preserved permanently and exactly.
Event Sourcing Model:
- Every state change = an immutable event appended to the log
- Current state = replay of all past events
- Events are NEVER deleted or modified (append-only)
Events:
1. ORDER_RECEIVED { orderId, symbol, side, type, price, qty, timestamp }
2. ORDER_VALIDATED { orderId, timestamp }
3. ORDER_SEQUENCED { orderId, seqNum, timestamp }
4. ORDER_MATCHED { orderId, tradeId, fillPrice, fillQty, timestamp }
5. ORDER_CANCELED { orderId, timestamp, reason }
6. TRADE_EXECUTED { tradeId, buyOrderId, sellOrderId, price, qty, timestamp }
Benefits:
- Audit trail: Complete history of every order β required by SEC/FINRA
- Replay: Can reconstruct exact order book state at any past point in time
- Debugging: Reproduce any bug by replaying events to that moment
- Recovery: Restart matching engine by replaying events from last checkpoint
Implementation:
Event store options:
- Custom append-only log (exchange-grade, used in practice)
- Apache Kafka (distributed, durable, replayable)
- PostgreSQL with append-only inserts (for smaller scale)
Checkpoint strategy:
- Take order book snapshot every N events (e.g., every 100,000 events)
- On recovery: load latest snapshot + replay events since checkpoint
- Recovery time = time to load snapshot + time to replay delta
Risk Management
Pre-Trade Risk Checks (Before Matching)
Order received β Risk Engine β Approve/Reject β Sequencer β Matching Engine
Checks:
1. Position limit: Does this trade exceed max position for this client?
e.g., "Client cannot hold more than 10,000 shares of AAPL"
2. Credit limit: Does client have enough buying power?
e.g., "Client has $500K credit; this order costs $600K" β REJECT
3. Order size limit: Is this order suspiciously large?
e.g., "Single order > 10% of ADV (average daily volume)" β flag
4. Symbol validation: Is AAPL a valid, tradable symbol right now?
5. Price validation: Is $1,500,000 for AAPL a typo or fat finger?
Fat finger check: price within X% of current market price
Circuit Breaker (Market-Wide Stability)
Level 1: S&P 500 drops 7% β Halt trading for 15 minutes
Level 2: S&P 500 drops 13% β Halt trading for 15 minutes
Level 3: S&P 500 drops 20% β Halt trading for remainder of day
Single stock circuit breaker:
- Price moves > 5% in 5 minutes β Limit Up-Limit Down (LULD) pause
- LULD: No trades outside price band (ref price Β± 5%)
Implementation:
Circuit breaker monitors:
- Price movement: Last trade price vs reference price (5-minute average)
- Volume: Unusual volume spike (> 10x normal)
- Order book imbalance: All orders one-sided
On trigger:
- Stop accepting new orders (but allow cancels)
- Broadcast halt message to all participants
- After cool-down period, reopen with call auction
Post-Trade Risk
After each trade execution:
- Update client's real-time position and P&L
- Update margin requirements
- Alert risk team if position nears limits
- Report large trades (> threshold) to regulators in real-time
Market Data Feed
Two types of data published:
1. Full Order Book (L2 data):
- All price levels with total quantity at each level
- Incremental updates (add order, remove order, execute trade)
- Subscribers: Professional traders, algorithmic trading systems
2. Top of Book (L1 data):
- Best bid price, best bid qty
- Best ask price, best ask qty
- Last trade price and qty
- Subscribers: Retail trading apps, price display boards
Publication mechanism:
- Multicast UDP: Single packet β all subscribers simultaneously
(Lower latency than TCP unicast to each subscriber)
- Sequence numbers on market data messages (detect gaps)
- Gap recovery: Request retransmission or use snapshot feed
Market data publishing pipeline:
Matching Engine
|
| (trade executed / order book changed)
v
Market Data Handler
|
+β Snapshot feed (periodic full book snapshot)
+β Incremental feed (real-time updates via multicast UDP)
|
v
[Broker A] [Broker B] [HFT Firm] [Data Vendor (Bloomberg)]
Recovery Strategy
Failure scenario: Matching engine crashes mid-session
Recovery steps:
1. Load latest order book snapshot (taken every N events)
2. Read event log from snapshot sequence number forward
3. Replay all events in strict sequence order
4. Order book is reconstructed to exact pre-crash state
5. Resume processing new orders from where we left off
Critical guarantee:
- Same events replayed in same order β identical order book state
- This is why strict sequencing is so important
- Non-determinism would make recovery impossible
Recovery time target: < 30 seconds (exchange rule: halt < 30 sec)
Handling Market Open (Call Auction)
At 9:30 AM, there are accumulated overnight orders. The exchange doesnβt just start matching β it runs a call auction:
Pre-open (9:00 - 9:30 AM):
- Accept limit orders but don't match yet
- Show indicative opening price (what price maximizes executed shares)
Opening auction (9:30 AM):
- Calculate equilibrium price: price that maximizes matched quantity
- Execute all matching orders simultaneously at that single price
- Publish opening trade
Example:
Buy orders: 100 @ $150, 200 @ $149, 300 @ $148
Sell orders: 150 @ $148, 200 @ $149, 100 @ $150
Equilibrium: $149 (matches 400 shares on each side)
β All orders at $149 or better matched at $149
Design Summary
Complete Exchange Architecture
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β EXCHANGE PERIMETER β
β β
β [FIX Client]βββ[Client Gateway]βββ[Order Manager] β
β [Web Client]βββ[Client Gateway] (risk checks) β
β β β β
β (execution reports) β β
β βΌ β
β βββββ[Sequencer]βββββ β
β β (event log/mmap) β β
β βββββββββββ¬ββββββββββ β
β β β
β ββββββββββββΌβββββββββββ β
β β Matching Engine β β
β β (LMAX Disruptor β β
β β ring buffer, β β
β β order book β β
β β TreeMap+LinkedList)β β
β ββββββββββ¬βββββββββββββ β
β β β
β βββββββββββββββββββββββββΌββββββββββββββββββββ β
β β β β β
β βΌ βΌ βΌ β
β [Trade Reporter] [Market Data Publisher] [Risk Engine] β
β (execution fills) (multicast UDP feed) (circuit breaker) β
β β β
β [Bloomberg/Reuters] β
β [Broker Data Feeds] β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
[Audit / Compliance]
(immutable event log,
trade surveillance)
Key Decisions Summary
| Decision | Choice | Reasoning |
|---|---|---|
| Ordering guarantee | Single sequencer + monotonic seq numbers | Eliminates race conditions in matching |
| Order book data structure | TreeMap + doubly-linked list per price level | O(1) match/cancel, O(log N) add |
| Matching algorithm | Price-time priority (FIFO) | Industry standard, fair |
| Inter-component messaging | LMAX Disruptor ring buffer | Lock-free, < 1ΞΌs latency |
| Persistence | Memory-mapped files (mmap) | Near-zero overhead persistence |
| Networking | DPDK kernel bypass | 1-5ΞΌs NIC-to-app latency |
| Audit trail | Event sourcing (append-only log) | Regulatory compliance + replay |
| Recovery | Snapshot + event replay | Deterministic state reconstruction |
| Risk | Pre-trade + circuit breakers | Market stability |
| Market data | Multicast UDP + sequence numbers | Low-latency fan-out |
Interview Questions & Answers
Q: Why do you need a sequencer? Why not just let the matching engine handle ordering?
A: Without a sequencer, orders arriving at multiple gateways simultaneously create a race condition β no single βtruthβ about which order came first. The sequencer is a single point that assigns monotonically increasing sequence numbers before any order reaches the matching engine. This means all matching engine processing is deterministic and strictly ordered. It also enables recovery: replay the event log in sequence order and you get exactly the same order book state.
Q: Walk me through the order book data structure and why you chose it.
A: The order book uses a TreeMap<price, LinkedList<Order>> separately for bids and asks. TreeMap gives O(log N) insertion but O(1) access to best bid/ask (head of the map). Each price level is a doubly-linked list to support O(1) FIFO: new orders append to tail, executions remove from head, cancels use a pointer from a HashMap to remove from any position in O(1). This combination gives us O(1) matching, O(1) cancel, and O(log N) add β which is the industry-standard approach.
Q: What is the LMAX Disruptor and why is it used in trading systems?
A: LMAX Disruptor is a lock-free inter-thread communication library using a ring buffer. Traditional queues use locks, which cause thread contention and unpredictable latency spikes. The Disruptor uses a pre-allocated fixed-size ring buffer where producers and consumers communicate via atomic sequence number updates β no locks, no heap allocations, no GC. Sequential memory access maximizes CPU cache utilization. This gives < 1 microsecond latency compared to > 1 millisecond for locked queues. LMAX Exchange processes 6 million transactions/second with 1-microsecond latency using this approach.
Q: How do you handle a matching engine crash? How do you guarantee no orders are lost?
A: Event sourcing plus checkpointing. Every message is written to the sequencerβs durable event log (memory-mapped file) with a sequence number before processing. Periodically, a snapshot of the full order book state is taken. On crash: load the latest snapshot, then replay all events from the snapshotβs sequence number forward in strict order. Since matching is deterministic (same inputs β same outputs), the order book is reconstructed identically. Recovery target is under 30 seconds.
Q: What is a circuit breaker in a stock exchange context? How does it work?
A: A circuit breaker halts trading when price movement exceeds a threshold, preventing panic-driven crashes. Market-wide circuit breakers (NYSE rules): S&P 500 drops 7% β 15-minute halt; 13% β another 15-minute halt; 20% β halt for the day. Single-stock circuit breakers (LULD): if a stock price moves more than 5% in 5 minutes, a short pause allows the market to stabilize. Implementation: the risk engine monitors the rolling price change; on trigger, it sends a halt message to the sequencer, which stops accepting new orders (but allows cancels). After the cooling-off period, trading resumes with a brief auction.
Q: Why do you use integers for prices instead of floating point?
A: Floating-point arithmetic is non-deterministic across hardware and introduces rounding errors β catastrophic in financial systems. A price of 150.04999999β¦ or 150.05 = 15005 cents. Integer comparison is exact, consistent, and faster. All exchanges use integer price representation internally.
Key Takeaways
- Sequencer is the architectural centerpiece β single source of ordering truth, enables deterministic matching and crash recovery
- Order book = TreeMap + doubly-linked list β gives O(1) match/cancel, O(log N) add; price levels are linked lists for FIFO
- Price-time priority β best price wins; ties broken by arrival time (sequence number); this is the law of fair markets
- LMAX Disruptor β lock-free ring buffer; removes all locks and GC pressure from the hot path; enables sub-microsecond inter-thread messaging
- Event sourcing is not optional β regulatory requirement AND the mechanism for crash recovery; immutable append-only event log
- Two types of risk checks β pre-trade (stop bad orders before they match) and circuit breakers (halt trading to prevent market crashes)
- Use integers for prices β floating-point is non-deterministic; store price as integer cents/basis points
- Low-latency is a holistic effort β kernel bypass (DPDK), memory-mapped files, CPU pinning, NUMA awareness, co-location; no single trick is enough
Related Resources
- ch11-payment-system - Another transactional system with strict consistency
- ch12-digital-wallet - Similar event-sourcing and idempotency patterns
- key-patterns > Event Sourcing - Pattern used for audit trail and recovery
- distributed-system-components - Ring buffers, event logs, sequencers
Practice this design! The stock exchange is the hardest system design question. Be ready to:
- Draw the full architecture with all components and data flows
- Explain the order book data structure from scratch (TreeMap + linked list)
- Justify why the sequencer is needed (race condition elimination)
- Walk through price-time priority matching with a worked example
- Explain LMAX Disruptor vs traditional queues
Last Updated: 2026-04-13
Status: The most complex chapter β know it cold for FAANG and fintech interviews!