Chapter 9 Flashcards - S3-like Object Storage

flashcards volume2 s3 object-storage distributed-storage blob

What is object storage and how does it differ from block storage and file storage?
?
Object storage: Store files as key-value pairs (key = path string, value = binary blob) in flat buckets via HTTP API. Immutable after write (only PUT/DELETE). Scales to exabytes. Examples: S3, GCS, Azure Blob. Block storage: Raw disk blocks, used by databases/OS (e.g., EBS, SAN). File storage: Hierarchical filesystem with directories (e.g., NFS, EFS). Object storage wins at scale and simplicity for backups, media, logs.

What are the two storage planes in an object storage system?
?
Metadata store: Stores object attributes — bucket, key, size, content-type, checksum, storage node locations, version ID. Needs strong consistency and fast point lookups. Uses RocksDB/LevelDB cluster. Data store: Stores raw binary content of each object. Needs throughput and durability. Uses erasure coding or replication across storage nodes. Separating them allows independent scaling and tuning.

What is erasure coding and what does the k+m notation mean?
?
Erasure coding splits an object into k data chunks and computes m parity chunks using Reed-Solomon coding. The full (k+m) chunks are written to (k+m) different nodes. The original object can be reconstructed from any k chunks — meaning you can tolerate the loss of any m nodes. Example (8+4): 8 data + 4 parity = 12 chunks. Lose any 4 → still reconstructable. Storage overhead = m/k = 4/8 = 50%.

Compare erasure coding (8+4) vs 3-way replication in terms of storage overhead and fault tolerance.
?
3-way replication: Store 3 full copies. Storage overhead = 200%. Fault tolerance = 2 node failures. Simple, fast reads, easy recovery. Erasure coding (8+4): 50% overhead. Fault tolerance = 4 node failures. Complex (parity computation), higher read latency if node down (must reconstruct from 8 chunks). At 100 PB scale, erasure coding saves ~150 PB of storage vs replication. Use erasure coding for cold/warm data, replication for hot data.

When should you use erasure coding vs replication?
?
Erasure coding: Cold/warm data, large objects, when storage cost matters most. Used by S3 Standard storage class. Higher CPU cost on read/write. Recovery is expensive (read k chunks from k nodes). Replication: Hot data, small objects, when read latency matters most. Also used for cross-region redundancy. Simple recovery (copy from replica). Rule of thumb: erasure coding for bulk storage at scale, replication for low-latency frequently-accessed data.

What is the multipart upload flow in S3-like object storage?
?

  1. Initiate: POST /{bucket}/{key}?uploads → server returns upload_id. 2. Upload parts: PUT /{bucket}/{key}?partNumber=N&uploadId=X for each part → server returns ETag (checksum of that part). Parts can be uploaded in parallel. 3. Complete: POST /{bucket}/{key}?uploadId=X with list of (partNumber, ETag) pairs → server assembles final object. 4. Abort (if needed): DELETE /{bucket}/{key}?uploadId=X → server garbage-collects all parts.

When should multipart upload be used and what are its benefits?
?
Use for objects > 100 MB (required for > 5 GB in S3, which is the stream limit). Benefits: Parallel uploads (upload all parts simultaneously → much faster), Resumability (retry only the failed part, not the whole file; each part has its own ETag), Memory efficiency (client doesn’t need to buffer whole file). S3 allows parts of 5 MB to 5 GB each. The upload_id tracks state for in-progress uploads.

How does metadata storage work — what is stored and how is it sharded?
?
Stored per object: bucket_id, object_key, object_id, version_id, size_bytes, content_type, checksum, storage_node_locations, created_at, deleted_at. Storage: RocksDB cluster per shard (LSM-tree for fast writes, range scan for LIST). Sharding: Shard by hash(bucket_id) so all objects in a bucket land in one shard (enables efficient LIST with range scan). Large buckets can be sub-sharded by key prefix. Each shard has a leader (writes) + followers (reads) for strong consistency.

How does consistent hashing distribute objects across storage nodes?
?
Nodes are placed on a hash ring (0 to 2^32). Each object key is hashed to a position on the ring; the object is stored on the first node clockwise from that position. Adding a node: Only objects between the new node and its predecessor need to migrate (~1/N of total objects). Removing a node: Its objects move to the next node clockwise. Without consistent hashing, modulo-based sharding would require remapping nearly all objects when cluster size changes. Virtual nodes (vnodes) improve balance.

How does object versioning work in S3-like object storage?
?
Versioning is enabled per bucket. Each PUT generates a new version_id (UUID). All versions co-exist — old versions are not overwritten. GET /{bucket}/{key} returns the latest version. GET /{bucket}/{key}?versionId=X returns a specific version. DELETE /{bucket}/{key} inserts a delete marker (new version) — does not remove data. DELETE /{bucket}/{key}?versionId=X permanently removes a specific version. Metadata table: (bucket_id, object_key, version_id) → object data.

What is a DELETE marker in object versioning?
?
A DELETE marker is a special version entry (with its own version_id) that acts as a tombstone. When you DELETE an object without specifying a version, a DELETE marker is inserted as the latest version. Subsequent GET requests see the DELETE marker and return 404. The actual data of previous versions is still present. To truly delete all data, you must delete each version explicitly. DELETE markers enable “soft delete” behavior while preserving full version history.

How are checksums used to ensure data integrity in object storage?
?
On upload: Client computes SHA256 (or MD5) of the object and sends it in a header. Server computes checksum of received bytes and compares. Mismatch → reject (data corrupted in transit). On storage: Checksum stored in metadata alongside node locations. On read: Server reads bytes from storage node, recomputes checksum, compares with metadata. Mismatch → data corrupted on disk, try another replica or erasure-reconstruct. Periodic scrubbing: Background job reads every object weekly and re-verifies checksum to detect silent bit rot.

What is garbage collection in object storage and why is soft delete used?
?
Deleting an object marks it as deleted in metadata (soft delete) but does not immediately remove bytes from storage nodes. A background GC job runs periodically: Phase 1 (Mark): Enumerate all live object_ids from metadata. Phase 2 (Sweep): Scan storage nodes for all chunks. Any chunk not in live set is orphaned. Phase 3 (Delete): Remove orphaned chunks, free disk space. Why soft delete: Crash safety (no partial deletes), versioning correctness (previous versions stay accessible), and safe multipart abort coordination.

What is the Placement Service and why is it important?
?
The Placement Service tracks the health, capacity, and location of every storage node. Storage nodes send heartbeats every few seconds. If no heartbeat for ~30 seconds, the node is marked dead. The Data Routing Service queries the Placement Service to get a list of healthy nodes before routing reads/writes. This ensures the system automatically routes around failed nodes without operator intervention. It also tracks disk capacity to avoid routing to nearly-full nodes.

What is the data routing service and what does it do?
?
A stateless service that sits between the API layer and the storage node cluster. For PUT: receives object bytes, queries Placement Service for healthy nodes, applies consistent hashing to select nodes, writes bytes in parallel to k+m nodes (erasure) or N nodes (replication). For GET: queries metadata service to get node list, reads from any healthy node (or reads k chunks for erasure decoding). Stateless = horizontally scalable. Placement Service holds all the state.

How does the “write data first, commit metadata second” pattern ensure consistency?
?
On PUT: 1) Write data chunks to storage nodes. 2) If data write fails → abort, no metadata written → no dangling pointer. 3) If data write succeeds → write metadata record pointing to those nodes. 4) If metadata write fails after data is written → data chunks become orphaned → GC will clean them up in the sweep phase. Result: Metadata only points to data that definitely exists. The reverse (write metadata first) would create valid-looking pointers to missing data.

What are the scale targets for an S3-like object storage system?
?
Total data: 100 PB. Durability: 99.9999% (six 9s) — at most 1 object lost per million per year. Max object size: 5 TB. Throughput reference (S3): 3,500 PUT/sec per bucket, 5,500 GET/sec per bucket. Storage efficiency with erasure coding (8+4): 100 PB of user data requires ~150 PB of raw storage (50% overhead). Contrast: 3-way replication would require 300 PB of raw storage for the same 100 PB of user data.

How is data striped and stored using erasure coding during a PUT operation?
?

  1. Object bytes are split into k equal data chunks. 2. m parity chunks are computed using Reed-Solomon over the k data chunks. 3. All (k+m) chunks are sent to (k+m) different storage nodes in parallel. 4. Chunk positions are recorded in metadata (which node holds which chunk). On GET: read any k of the (k+m) chunks, reconstruct original. If all k original nodes are available, no computation needed (read directly). If up to m nodes are unavailable, reconstruct from whichever k healthy chunks are available.

What is periodic scrubbing and why is it necessary?
?
Scrubbing is a background job that reads every stored object, recomputes its checksum, and compares against the checksum stored in metadata. Runs approximately weekly. Why needed: Silent data corruption (bit rot) can occur on disk even without a hardware failure — bits flip due to cosmic rays, capacitor leakage, firmware bugs. Without scrubbing, corruption might not be detected until a read request hits that data. Scrubbing detects corruption early while other replicas/chunks are still intact for repair.

What storage engine is used for object metadata and why?
?
RocksDB (or LevelDB) — an embedded LSM-tree (Log-Structured Merge-tree) key-value store. Why: Fast sequential writes (LSM appends to a log, no random I/O), good read performance for point lookups (bucket_id + object_key), supports range scans (needed for LIST operations scanning all keys in a bucket). Deployed as a cluster with leader election for strong consistency. Each metadata shard runs its own RocksDB instance. Alternatives considered: MySQL (too slow at billions of rows), Redis (insufficient durability).

How are orphaned chunks created and how are they cleaned up?
?
Orphaned chunks are data bytes on storage nodes with no corresponding metadata entry. Created by: 1) Failed PUT where data wrote but metadata commit failed. 2) Aborted multipart upload where some parts wrote. 3) In-progress uploads that were never completed. Cleanup: GC mark-and-sweep: build set of all live object_ids from metadata, scan all storage node chunks, delete any chunk_id not in live set. Scheduled as a periodic background job (hourly or daily) to avoid immediate disk space waste.

What are the differences between object storage and block storage, and when would you use each?
?
Object storage (S3): HTTP API, flat key-value, immutable after write, exabyte scale, high latency (ms to seconds), great for backups/media/logs/static assets. Block storage (EBS): Block device, random read/write, very low latency (sub-ms), used for OS volumes and databases that need mutable random access. Rule of thumb: If you need to attach it to a VM as a disk or a database needs it, use block storage. If you’re storing files by name via API at internet scale, use object storage.

How does LIST work for objects in a bucket and what makes it expensive?
?
LIST returns objects in a bucket matching a prefix (e.g., “photos/2024/”). Implementation: Range scan on the metadata shard for that bucket_id, scanning all keys with the given prefix, sorted lexicographically. Expensive because: 1) A bucket can have billions of objects — full scan is slow. 2) Must paginate (marker-based or cursor-based pagination). 3) If sub-sharded across multiple metadata nodes, must merge results. Optimization: Shard by bucket_id so LIST is contained to one shard; store keys sorted in RocksDB for efficient range scan.

What happens when a storage node fails — how does the system recover?
?

  1. Node stops sending heartbeat → Placement Service marks it dead after ~30s timeout. 2. Data Routing Service gets updated node list from Placement Service → stops routing to dead node. 3. For replicated data: reads served from other replicas immediately. 4. For erasure-coded data: reads require fetching from remaining k healthy chunks and reconstructing. 5. Background replication monitor detects under-replicated chunks → re-replicates to a new healthy node to restore target redundancy. 6. When dead node comes back, Placement Service may re-add it; data sync needed.

What is the ETag returned by S3 and what does it represent?
?
An ETag (Entity Tag) is an identifier for a specific version of an object, returned in every GET and PUT response. For simple (non-multipart) uploads, ETag is the MD5 hash of the object bytes. For multipart uploads, ETag is an MD5 of the concatenated MD5s of each part, followed by a dash and part count (e.g., “etag-3”). Clients use ETags for: conditional reads (If-None-Match header → avoid re-downloading unchanged objects), integrity verification (compare downloaded bytes), and cache invalidation.


Total Cards: 25
Review Time: 20-25 minutes
Priority: HIGH - Senior/Staff level question at Google, Amazon, Meta
Last Updated: 2026-04-13