Chapter 7 Flashcards - Unique ID Generator

flashcards volume1 unique-id distributed-systems snowflake

Why can’t we use auto-increment IDs in a distributed database?
?
Auto-increment works per-database-instance. With multiple shards, each shard independently generates ID=1, ID=2, etc. causing collisions when data is queried across shards or merged. You’d need a central sequencer (single point of failure) or a step-based scheme (hard to scale). Example: Shard 1 and Shard 2 both generate user_id=42 for different users — chaos in joins or cross-shard queries.

What are the 4 main approaches to generating unique IDs in distributed systems?
?

Multi-Master Replication: each DB uses auto-increment with step=N (not sortable, hard to scale). 2. UUID: 128-bit random, no coordination (not sortable, not 64-bit, bad for indexes). 3. Ticket Server: centralized ID service with DB auto-increment (single point of failure, bottleneck). 4. Twitter Snowflake: 64-bit, timestamp-prefixed, no coordination (BEST — meets all requirements). Snowflake wins on uniqueness + sortability + 64-bit + high throughput.

What is the Twitter Snowflake ID structure? (Memorize this!)
?
64-bit integer, structured as: [1 sign bit][41-bit timestamp (ms since epoch)][5-bit datacenter ID][5-bit machine ID][12-bit sequence number]. Sign bit = always 0 (positive). Timestamp = ms since custom epoch (~69.7 years before overflow). DC ID = 32 datacenters. Machine ID = 32 machines per DC. Sequence = 4096 IDs per ms per machine. Most significant bits are timestamp → sorting by value = sorting by time.

Why is the sign bit in Snowflake always 0?
?
To ensure the 64-bit ID is always a positive number. If sign bit were 1, the BIGINT would be negative (in two’s complement). Databases and most programming languages treat BIGINT as signed. Having a positive ID avoids confusion, negative index keys, and compatibility issues. It “wastes” one bit but is necessary for portability and correctness across all systems.

How many years does the 41-bit timestamp in Snowflake support?
?
2^41 milliseconds = 2,199,023,255,551 ms ≈ 69.7 years. Twitter used a custom epoch of November 4, 2010 (their launch date). So their IDs won’t overflow until approximately 2079. Using Unix epoch (1970) instead would give overflow around 2039. Custom epoch = maximize useful lifetime. Key interview point: you can extract the creation time from any Snowflake ID by right-shifting 22 bits and adding the epoch.

How many datacenters and machines does Snowflake support?
?
Datacenter IDs: 5 bits → 2^5 = 32 possible datacenters (IDs 0-31). Machine IDs: 5 bits → 2^5 = 32 machines per datacenter. Total unique machines: 32 × 32 = 1,024 machines globally. Each machine has a globally unique (datacenter_id, machine_id) pair assigned at startup. The pair is immutable — never changes during the machine’s lifetime.

How many IDs per second can one Snowflake machine generate?
?
Sequence field is 12 bits → 2^12 = 4,096 IDs per millisecond. Per second: 4,096 × 1,000 ms = 4,096,000 ≈ 4 million IDs/sec per machine. Across all 1,024 machines: 4 million × 1,024 ≈ 4 billion IDs/sec total. The 10,000 IDs/sec requirement is met by a single machine alone. If 4,096 sequences exhaust within 1 ms, the generator waits for the next millisecond before generating more.

Why are Snowflake IDs sortable by time?
?
The most significant bits (bits 22-62) hold the 41-bit timestamp. When comparing two 64-bit IDs numerically, the higher-order bits dominate. Since an older ID has a smaller timestamp, it will have a smaller numeric value. Sorting IDs by their integer value = sorting by creation time. This makes range queries (e.g., “get all tweets in the last hour”) efficient — just query ID ranges rather than maintaining a separate timestamp column.

What is the clock skew problem in Snowflake and why is it dangerous?
?
If NTP corrects the system clock backward, the current timestamp is less than the last-used timestamp. The generator might produce an ID with timestamp T-2 after already producing IDs with timestamp T, creating a (timestamp, dc, machine, sequence) tuple that was already used → ID collision. Even 1ms backward can cause duplicates. Unlike most distributed systems where stale data is merely inconvenient, duplicate IDs cause data corruption.

What are 4 solutions to the clock skew problem in Snowflake?
?

Refuse to generate (default): Throw exception when clock goes backward. Caller retries later. 2. Wait for clock to catch up: Spin-wait until current_ms > last_timestamp. Works for small skews (< 10ms). 3. Monotonic clock: OS-provided clock that never goes backward (Java: System.nanoTime(), Python: time.monotonic()). Clock correction pauses the monotonic clock but never reverses it. 4. Hybrid Logical Clock (HLC): Combines physical + logical time, never goes backward, used by CockroachDB/YugabyteDB.

How do you assign machine IDs to Snowflake generators?
?
Option 1 (ZooKeeper): Each machine registers on startup, gets unique ID atomically from ZooKeeper’s sequential nodes. ZooKeeper guarantees no duplicates even with concurrent registrations. Used by original Twitter Snowflake. Option 2 (Config management): Assign IDs via environment variables, Kubernetes labels, or config files before deployment. Simpler, no ZooKeeper dependency. Option 3 (IP-based): Hash IP → machine ID. Risky: IP reuse after decommission causes collisions. Not recommended.

What are the pros and cons of UUID (v4)?
?
Pros: No coordination needed (independent generation), simple one-liner, very low collision probability (~1 in 10^36), works across all languages and systems. Cons: 128-bit (2x storage of Snowflake), not sortable (random, not time-ordered), not numeric (alphanumeric with dashes, can’t use as BIGINT), poor database index performance (random inserts cause B-tree fragmentation and page splits). Use for: session tokens, trace IDs, where sortability and storage are not concerns.

What are the pros and cons of a Ticket Server approach?
?
Pros: Generates numeric IDs, simple to implement, centralized control, auto-increment is well understood. Cons: Single point of failure (ticket server down = no IDs generated), network bottleneck (every ID request needs a round trip), serialized throughput (all servers wait for the one ticket server), scaling ticket servers introduces synchronization complexity. Used by: Flickr (historically). Avoid for new high-scale systems — use Snowflake instead.

What are the pros and cons of Multi-Master Replication (step-based auto-increment)?
?
Pros: No coordination needed, uses existing DB auto-increment, numeric IDs. Cons: Not sortable by time (interleaved IDs from different servers have no time order), not scalable (adding/removing servers requires changing step on all servers — risky migration), doesn’t work across datacenters easily. Example: 3 servers, step=3: server 1 → 1,4,7; server 2 → 2,5,8; server 3 → 3,6,9. Only use for legacy systems; prefer Snowflake for new systems.

How do you extract the creation timestamp from a Snowflake ID?
?
Right-shift the ID by 22 bits to get the timestamp component, then add the custom epoch: timestamp_ms = (id >> 22) + EPOCH. Example: id = 123456789012345 → (id >> 22) + 1288834974657 = creation time in milliseconds. Convert to datetime. This is because bits 22-62 hold the timestamp. Useful for: debugging (when was this ID created?), data partitioning by time range, TTL calculations.

Compare Snowflake, UUID v4, ULID, and UUID v7.
?
Snowflake: 64-bit, ms-sortable, no coordination at runtime, clock-dependent. Best for: high throughput, 64-bit integer requirement. UUID v4: 128-bit, NOT sortable, no coordination, no clock dependency. Best for: session tokens, trace IDs. ULID: 128-bit, ms-sortable, no coordination, no machine ID setup. Best for: general purpose when 128-bit OK and no machine ID setup desired. UUID v7: 128-bit, ms-sortable, RFC standard, no coordination. Best for: drop-in sortable replacement for UUID v4.

Why is UUID a poor choice for database primary keys?
?

Storage: 128 bits = 16 bytes vs 64-bit = 8 bytes. Larger primary key = larger index = more memory. 2. Random inserts: UUID v4 is random, not sequential. New rows insert at random B-tree positions → page splits, index fragmentation, poor cache locality. 3. Not sortable by time: UUID v4 has no time component, so you can’t do time-range queries efficiently by ID. 4. Not 64-bit: Can’t use as BIGINT in all databases. UUID is best for distributed unique identifiers, not as DB primary keys. Use Snowflake or UUID v7 instead.

What is the total bit breakdown of Snowflake and what is the total = ?
?
1 (sign) + 41 (timestamp) + 5 (datacenter) + 5 (machine) + 12 (sequence) = 64 bits. Memorize as: “1-41-5-5-12”. Sign always 0. Timestamp = ms since epoch. Datacenter = 0-31. Machine = 0-31. Sequence = 0-4095, resets each ms. The 64-bit total fits in a BIGINT in all major databases (MySQL, PostgreSQL, Oracle, SQL Server). This is half the size of UUID (128-bit) and sorts naturally as an integer.

What happens when the sequence number exhausts (hits 4095) within a millisecond?
?
The generator waits for the next millisecond before generating the next ID. This is implemented as a spin-wait: while current_ms() <= last_timestamp, keep checking. Once the ms ticks forward, reset sequence to 0 and continue. This guarantees no collisions within a machine. In practice, exhausting 4,096 IDs per millisecond requires generating IDs at >4 million/sec on a single machine — rare in practice. The brief wait (< 1ms) is acceptable.

Why does Snowflake use a custom epoch instead of Unix epoch (1970)?
?
Custom epoch maximizes the usable lifetime of the 41-bit timestamp field. Unix epoch (Jan 1, 1970) means the 41-bit counter started counting from 1970. By 2010 (when Twitter launched Snowflake), about 1.27 trillion milliseconds had already elapsed, consuming 40 years of the counter. Using a custom epoch starting at the system’s launch date means all 69.7 years of timestamp range are available from the start. Example: Twitter’s epoch (Nov 4, 2010) gives IDs until ~2079.

How does Snowflake ensure uniqueness across machines without coordination?
?
Each machine has a globally unique (datacenter_id, machine_id) pair assigned once at startup. Within a machine, the sequence number prevents collisions within the same millisecond. Combining the three fields (timestamp, datacenter_id, machine_id, sequence_number) guarantees uniqueness: no two machines share the same (DC, Machine) pair, and within one machine, the sequence is never repeated for the same millisecond. Result: global uniqueness with zero runtime coordination — all local.

What is the decision tree for choosing a unique ID algorithm?
?
Need 64-bit integers? YES + need time-sortable + high throughput → Snowflake (best choice). Need 64-bit + simple + small scale → Ticket Server. Don’t need 64-bit (128-bit OK) + need sortable → ULID or UUID v7. Don’t need sortable + want simplest → UUID v4. Need to avoid coordination completely at all times → Snowflake, ULID, UUID v4/v7 (all work without runtime coordination). Ticket server is the only option requiring a centralized service.

What makes Snowflake IDs “opaque” and what useful information can you extract?
?
Snowflake IDs look like random 64-bit numbers but are not opaque — they encode structured information. You can extract: (1) Creation timestamp: id >> 22 + epoch = creation time in ms. (2) Datacenter ID: (id >> 17) & 0x1F = which datacenter. (3) Machine ID: (id >> 12) & 0x1F = which machine in that DC. (4) Sequence: id & 0xFFF = sequence within that millisecond. This is powerful for debugging (trace any ID back to its origin machine and creation time).

Why is Snowflake a better choice than a centralized Redis INCR for distributed ID generation?
?
Redis INCR is atomic and generates sequential IDs, but: (1) Single point of failure — Redis down means no IDs. (2) Network round trip for every ID — adds latency (microseconds with local Redis to 1-5ms cross-region). (3) Scalability ceiling — one Redis instance maxes out at ~100K operations/sec. Snowflake: generates locally in nanoseconds, no network dependency, 4M IDs/sec per machine, no SPOF. Redis INCR is fine for small scale or when sequential IDs matter more than distributed throughput.

What are the three systems that use Snowflake or Snowflake-like IDs in production?
?
Twitter/X: Original Snowflake — tweet IDs and user IDs. Discord: Message IDs use Twitter Snowflake format exactly. Instagram: Photo IDs — modified Snowflake with custom epoch, 13-bit shard ID, 10-bit sequence (instead of DC+machine+sequence). MongoDB: ObjectID — 12-byte (96-bit): 4-byte Unix timestamp + 5-byte machine ID + 3-byte process ID + 3-byte random counter. Not exactly Snowflake but same concept: timestamp prefix + machine identifier + sequence.

Total Cards: 25
Review Time: 20-25 minutes
Priority: MEDIUM-HIGH - Common interview question, memorize Snowflake bit structure!
Last Updated: 2026-04-13

Study Notes by Niladri & AI

Explorer

vol1-ch07-unique-id-generator

Chapter 7 Flashcards - Unique ID Generator

Graph View