Distributed System Components - Complete Guide

A comprehensive reference for all major components in distributed systems, their roles, when to use them, and how they work together.

🏗️ System Architecture Layers

┌─────────────────────────────────────────────────────────┐
│                    Client Layer                         │
│         (Web Browser, Mobile App, Desktop App)          │
└─────────────────────────────────────────────────────────┘
                           ↓
┌─────────────────────────────────────────────────────────┐
│                    Edge Layer                           │
│              (DNS, CDN, DDoS Protection)                │
└─────────────────────────────────────────────────────────┘
                           ↓
┌─────────────────────────────────────────────────────────┐
│                  Gateway Layer                          │
│         (API Gateway, Load Balancer, WAF)               │
└─────────────────────────────────────────────────────────┘
                           ↓
┌─────────────────────────────────────────────────────────┐
│                 Application Layer                       │
│         (Web Servers, App Servers, Services)            │
└─────────────────────────────────────────────────────────┘
                           ↓
┌─────────────────────────────────────────────────────────┐
│                 Communication Layer                     │
│         (Message Queue, Service Mesh, RPC)              │
└─────────────────────────────────────────────────────────┘
                           ↓
┌─────────────────────────────────────────────────────────┐
│                    Data Layer                           │
│    (Database, Cache, Search Engine, Object Storage)     │
└─────────────────────────────────────────────────────────┘
                           ↓
┌─────────────────────────────────────────────────────────┐
│                 Observability Layer                     │
│         (Logging, Metrics, Tracing, Alerting)           │
└─────────────────────────────────────────────────────────┘

📡 Edge & Gateway Layer

1. DNS (Domain Name System)

What it is: Translates domain names (www.example.com) to IP addresses (192.168.1.1)

Why you need it:

Users type domain names, not IP addresses
Can route to different servers based on location (GeoDNS)
Enables failover (change IP if server fails)

Types:

A Record: Domain → IPv4 address
AAAA Record: Domain → IPv6 address
CNAME: Alias to another domain
MX Record: Mail server routing

GeoDNS (Geographic DNS):

Routes users to nearest datacenter
Example: US users → US servers, EU users → EU servers
Lower latency for global users

Interview relevance: “How do users reach your system?” → DNS!

Popular providers: Route 53 (AWS), Cloud DNS (GCP), Azure DNS

2. CDN (Content Delivery Network)

What it is: Geographically distributed servers that cache static content close to users

Why you need it:

Reduce latency (content closer to users)
Reduce origin server load (CDN serves 80-90% of requests)
Improve availability (highly redundant)
DDoS protection (distributed infrastructure)

What to cache:

Images, videos, audio files
JavaScript, CSS, fonts
Static HTML pages
Downloadable files (PDFs, binaries)

Push vs Pull:

Push CDN: Upload content to CDN manually
- Good for: Rarely changing content, full control
Pull CDN: CDN fetches from origin on miss
- Good for: Frequently changing content, automatic

Cache invalidation:

Set TTL (Time-To-Live)
Purge/invalidate API calls
URL versioning (style.v2.css)

When to use:

Bandwidth > 1 GB/s
Global users
Static content > 30% of traffic

Popular CDNs: Cloudflare, AWS CloudFront, Akamai, Fastly, Cloudinary (images)

3. Load Balancer

What it is: Distributes incoming traffic across multiple servers

Why you need it:

No single point of failure (redundancy)
Scale horizontally (add more servers)
Health checks (detect dead servers)
Zero-downtime deployments

Types:

Layer 4 (Transport): Routes based on IP/port (TCP/UDP)
- Fast, simple
- Can’t see HTTP content
Layer 7 (Application): Routes based on HTTP headers/URL/cookies
- Slower, more intelligent
- Can route /api → API servers, /static → static servers

Load Balancing Algorithms:

Round Robin: Distribute requests sequentially
- Simple, fair
- Ignores server load
Weighted Round Robin: More powerful servers get more traffic
- Good for heterogeneous servers
Least Connections: Send to server with fewest active connections
- Good for long-lived connections (WebSockets)
Least Response Time: Send to fastest server
- Best performance
- More complex
IP Hash: Hash client IP to always hit same server
- Sticky sessions (session affinity)
- Uneven distribution

Health Checks:

Periodic HTTP/TCP requests to servers
If server fails health check → remove from pool
When recovers → add back to pool

When to use:

More than 1 server
Need high availability
Traffic > 1K QPS

Popular load balancers: NGINX, HAProxy, AWS ELB/ALB, GCP Load Balancer

4. API Gateway

What it is: Single entry point for all client requests to microservices

Why you need it:

Authentication & Authorization: Verify JWT tokens, check permissions
Rate Limiting: Prevent abuse (100 requests/min per user)
Request Routing: Route /users → User Service, /posts → Post Service
Request Aggregation: Combine multiple microservice calls into one
Protocol Translation: REST → gRPC, HTTP → WebSocket
Caching: Cache common responses
Monitoring: Log all requests, track metrics

Problems it solves:

Clients don’t need to know about all microservices
Centralized authentication (don’t repeat in every service)
Consistent error handling and logging
Version management (/v1/users, /v2/users)

API Gateway vs Load Balancer:

Feature	Load Balancer	API Gateway
Purpose	Distribute traffic	Route & manage APIs
Layer	L4/L7	L7 only
Intelligence	Basic	High
Auth	No	Yes
Rate limiting	Basic	Advanced
Protocol translation	No	Yes

When to use:

Microservices architecture (mandatory!)
Need centralized auth/rate limiting
Multiple client types (web, mobile, IoT)
API versioning

Features in interviews:

“How do clients discover services?” → API Gateway!
“How do you handle authentication?” → API Gateway!
“How do you prevent abuse?” → API Gateway rate limiting!

Popular API Gateways: Kong, AWS API Gateway, Azure API Management, Apigee, Envoy

Example:

Client request: GET /api/v1/users/123/posts

API Gateway:
1. Verify JWT token (authentication)
2. Check rate limit (100 req/min)
3. Route to appropriate microservice
4. Aggregate User Service + Post Service responses
5. Return combined response to client

5. WAF (Web Application Firewall)

What it is: Filters and monitors HTTP traffic to protect web applications

Why you need it:

SQL Injection protection
XSS (Cross-Site Scripting) protection
DDoS mitigation
Bot detection
Geo-blocking (block certain countries)

When to use:

Public-facing web applications
Handling sensitive data
Compliance requirements (PCI DSS, HIPAA)

Popular WAFs: Cloudflare WAF, AWS WAF, Imperva, Akamai

🖥️ Application Layer

6. Web Servers

What it is: Handles HTTP requests and returns responses

Why you need it:

Serve web pages (HTML, CSS, JavaScript)
Execute application logic
Call databases and other services
Generate dynamic content

Stateless vs Stateful:

Stateless: No session data stored in server memory
- Can scale horizontally easily
- Any server can handle any request
- Store session in Redis or database
Stateful: Session data in server memory
- Sticky sessions required (same user → same server)
- Harder to scale
- Avoid if possible!

When to use: Every web application needs web servers!

Popular web servers: NGINX, Apache, Node.js, Tomcat, IIS

7. Application Servers

What it is: Runs business logic, separate from web presentation layer

Why you need it:

Separate concerns (presentation vs logic)
Can scale independently
Reusable business logic for multiple clients (web, mobile, API)

When to use:

Complex business logic
Enterprise applications
Need to serve multiple client types

Popular app servers: Spring Boot, Express.js, Django, Flask, FastAPI

8. Microservices

What it is: Architecture where application is split into small, independent services

Why you need it (at scale):

Independent deployment: Update User Service without touching Post Service
Independent scaling: Scale payment service 10x without scaling others
Technology flexibility: Use Python for ML service, Go for performance
Team ownership: Each team owns specific services
Fault isolation: If one service fails, others continue

Trade-offs:

Pros	Cons
Independent scaling	Complex deployment
Technology flexibility	Network latency between services
Fault isolation	Distributed tracing needed
Team autonomy	Data consistency challenges
Easier to understand individual services	Harder to understand system as whole

When NOT to use:

Small team (< 10 people)
Simple application
MVP/prototype
No clear service boundaries

When to use:

Large team (> 50 people)
Need independent scaling
Long-term project (years)
Clear service boundaries (User, Post, Payment, etc.)

Communication between microservices:

Synchronous: REST, gRPC
Asynchronous: Message queue, Pub/Sub

Popular frameworks: Spring Cloud, Istio (service mesh), Kubernetes

💬 Communication Layer

9. Message Queue

What it is: Stores messages between producers and consumers

Why you need it:

Decouple components (producer doesn’t wait for consumer)
Async processing (email, notifications, video encoding)
Buffer traffic spikes (queue absorbs bursts)
Retry failed operations
Guaranteed delivery (at-least-once)

Use cases:

Send email after user signup
Process video uploads
Generate reports
Resize images
Send push notifications

Patterns:

Point-to-Point: One message → one consumer
Work Queue: Multiple workers compete for messages (load distribution)

When to use:

Task takes > 1 second (offload from web server)
Can be processed asynchronously
Need retry logic
Traffic is spiky

Popular message queues: RabbitMQ, AWS SQS, Azure Service Bus

What it is: Broadcast messages to multiple subscribers

Why you need it:

Event broadcasting: One event → many listeners
Decouple producers from consumers
Scalability: Add subscribers without changing producer

Use cases:

User signup → Send welcome email, Analytics service, CRM update
Order placed → Inventory service, Shipping service, Analytics
Real-time updates → All connected WebSocket clients

Pub/Sub vs Message Queue:

Feature	Message Queue	Pub/Sub
Receivers	One consumer	Multiple subscribers
Message delivery	Once	Multiple times (one per subscriber)
Use case	Work distribution	Event broadcasting

When to use:

Multiple services need same event
Event-driven architecture
Real-time notifications

Popular pub/sub: Apache Kafka, AWS SNS, Google Pub/Sub, Redis Pub/Sub

11. Apache Kafka

What it is: Distributed event streaming platform (hybrid message queue + pub/sub)

Why you need it:

High throughput: Millions of messages per second
Durability: Messages persisted to disk
Replay: Can re-read old messages
Real-time streaming: Process events as they arrive
Scalability: Horizontally scalable

Use cases:

Activity tracking (clicks, page views)
Log aggregation
Stream processing (real-time analytics)
Event sourcing (store all state changes)
Metrics collection

Key concepts:

Topic: Category of messages (e.g., “user-signups”)
Partition: Subdivide topic for parallelism
Consumer Group: Multiple consumers share load
Offset: Position in the message stream

When to use:

Need high throughput (> 100K messages/sec)
Need to replay messages
Real-time analytics
Event-driven architecture at scale

Kafka vs RabbitMQ:

Feature	Kafka	RabbitMQ
Throughput	Very high	Medium
Message persistence	Always	Optional
Message replay	Yes	No
Complexity	High	Medium
Use case	Event streaming	Task queues

12. Service Mesh

What it is: Infrastructure layer for managing service-to-service communication in microservices

Why you need it (at scale):

Service discovery: Services find each other automatically
Load balancing: Between service instances
Retry logic: Automatic retries on failure
Circuit breaking: Stop calling failing services
Observability: Trace requests across services
mTLS: Secure service-to-service communication

Components:

Data plane: Sidecar proxies (Envoy) next to each service
Control plane: Manages configuration (Istio, Linkerd)

When to use:

Many microservices (> 20)
Need advanced traffic management
Security critical (mTLS for all services)
Complex observability requirements

When NOT to use:

Few microservices (< 10)
Adds significant complexity
Monolith architecture

Popular service meshes: Istio, Linkerd, Consul Connect

💾 Data Layer

13. Databases

See key-patterns > 4. SQL vs NoSQL for detailed comparison.

SQL Databases (PostgreSQL, MySQL):

Structured data with relationships
ACID transactions
Complex queries and joins
Vertical scaling (mostly)

NoSQL Databases:

Key-Value (Redis, DynamoDB): Simple lookups
Document (MongoDB, Couchbase): Flexible schema, JSON documents
Column-Family (Cassandra, HBase): Wide tables, high write throughput
Graph (Neo4j): Relationships and connections

When to use SQL: Default choice, structured data, need transactions

When to use NoSQL: Flexible schema, horizontal scaling, simple queries

14. Cache

What it is: In-memory data store for fast reads

Why you need it:

Speed: Memory is 100x faster than disk
Reduce DB load: 80-90% cache hit rate = 10x less DB traffic
Scalability: Handle more reads without scaling database

What to cache:

User sessions (ephemeral data)
Database query results (hot data)
Computed results (expensive calculations)
API responses

Cache strategies: See key-patterns > 1. Caching Strategies

Cache invalidation (hardest problem!):

TTL: Expire after X seconds
Write invalidation: Delete on update
Write-through: Update cache on write

Cache eviction policies:

LRU (Least Recently Used): Evict oldest accessed
LFU (Least Frequently Used): Evict least accessed
FIFO: First in, first out

When to use:

Database is bottleneck
Read-heavy workload (10:1 ratio)
Hot data accessed frequently

Popular caches: Redis, Memcached

Redis vs Memcached:

Feature	Redis	Memcached
Data structures	Rich (lists, sets, sorted sets)	Simple (key-value)
Persistence	Yes	No
Replication	Yes	No
Use case	Complex caching, session store	Simple cache

15. Search Engine

What it is: Specialized database optimized for text search and analytics

Why you need it:

Full-text search: Search within text fields
Fuzzy matching: Handle typos (autocomplete)
Relevance ranking: Most relevant results first
Faceted search: Filter by category, price, date
Analytics: Aggregate and analyze large datasets

Use cases:

Product search (e-commerce)
Log analysis
Autocomplete/typeahead
Content search (documents, emails)
Real-time analytics

When to use:

Need full-text search
SQL LIKE queries too slow
Complex search requirements (filters, ranking)

When NOT to use:

Simple exact match queries (use database)
Primary data store (use as secondary index)

Popular search engines: Elasticsearch, OpenSearch, Solr, Algolia

Architecture:

Database (source of truth)
    ↓ (sync)
Search Index (optimized for search)
    ↓
Users query search index, not database

16. Object Storage

What it is: Storage for large unstructured objects (files, images, videos)

Why you need it:

Scalable: Store petabytes
Durable: Replicated across multiple datacenters
Cost-effective: Cheaper than database storage
HTTP access: Access files via URL

Use cases:

User uploads (images, videos, documents)
Backups and archives
Static website hosting
Data lake (big data analytics)

When to use:

Storing files > 1 MB
Don’t need database features (queries, transactions)
Need high durability (99.999999999% - 11 nines!)

When NOT to use:

Structured data (use database)
Need fast queries (use database + search)
Files < 100 KB (database blob is fine)

Popular object stores: AWS S3, Google Cloud Storage, Azure Blob Storage

Features:

Versioning: Keep old versions
Lifecycle policies: Auto-delete after X days
Access control: Fine-grained permissions
Event notifications: Trigger on upload

17. Data Warehouse

What it is: Centralized repository for analytical queries (OLAP)

Why you need it:

Analytics: Complex queries on historical data
Reporting: Business intelligence, dashboards
Don’t impact production DB: Run expensive queries separately

OLTP vs OLAP:

Aspect	OLTP (Database)	OLAP (Data Warehouse)
Purpose	Transactional	Analytical
Queries	Simple, fast	Complex, slow
Data	Current	Historical
Volume	GB to TB	TB to PB
Users	Many	Few (analysts)

When to use:

Need complex analytics
Historical data analysis
Business intelligence
Don’t want to impact production database

Popular data warehouses: Snowflake, BigQuery, Redshift, ClickHouse

📊 Observability Layer

18. Logging

What it is: Recording events that happen in the system

Why you need it:

Debugging: Trace through execution
Audit: Who did what when
Compliance: Regulatory requirements
Incident response: Understand what went wrong

Log levels:

TRACE: Very detailed (development only)
DEBUG: Debug information
INFO: General information
WARN: Warning, not critical
ERROR: Error occurred
FATAL: System crash

Centralized logging:

Aggregate logs from all servers
Searchable (grep across 1000 servers)
Correlation (trace request across services)

When to use: Always! Every service needs logging.

Popular logging: ELK Stack (Elasticsearch, Logstash, Kibana), Splunk, Datadog, CloudWatch

19. Metrics

What it is: Numerical measurements of system behavior over time

Why you need it:

Monitor health: Is system working?
Detect issues: Spike in error rate?
Capacity planning: When to add servers?
Alerting: Page on-call when down

Key metrics (The Four Golden Signals):

Latency: How long requests take (p50, p95, p99)
Traffic: Requests per second (QPS)
Errors: Error rate (5xx errors)
Saturation: Resource utilization (CPU, memory, disk)

Additional metrics:

Cache hit rate
Database connection pool utilization
Queue depth
Active users

When to use: Always! Monitor everything.

Popular metrics: Prometheus + Grafana, Datadog, New Relic, CloudWatch

20. Distributed Tracing

What it is: Tracks requests as they flow through microservices

Why you need it:

Find bottlenecks: Which service is slow?
Debug failures: Where did request fail?
Understand dependencies: Service call graph

How it works:

Generate unique trace ID
Pass through all services (in HTTP header)
Each service logs with trace ID
Visualize the complete request path

When to use:

Microservices architecture
Requests span multiple services
Need to debug slow requests

Popular tracing: Jaeger, Zipkin, AWS X-Ray, Datadog APM

Example trace:

Request ID: abc123
1. API Gateway (5ms)
2. User Service (10ms)
   3. Database query (8ms)
4. Post Service (50ms) ← Bottleneck!
   5. Database query (45ms) ← Slow query
6. Response (total 65ms)

🔐 Security Components

21. Authentication Service

What it is: Verifies user identity (who are you?)

Why you need it:

Verify users are who they claim to be
Issue authentication tokens
SSO (Single Sign-On)

Methods:

Username + Password
OAuth 2.0 (Google, Facebook login)
SAML (enterprise SSO)
Multi-factor authentication (MFA)

JWT (JSON Web Token):

Self-contained token with user info
Stateless (no server-side session)
Signed to prevent tampering

When to use: Every application with users!

Popular auth services: Auth0, Okta, Firebase Auth, AWS Cognito

22. Authorization Service

What it is: Determines what user can do (what are you allowed to do?)

Why you need it:

Permission management
Role-based access control (RBAC)
Attribute-based access control (ABAC)

Patterns:

RBAC: User → Role → Permissions
- Example: Admin role can delete users
ABAC: Based on attributes
- Example: Owner of resource can edit

When to use: When different users have different permissions

🔄 How Everything Ties Together

Example: Complete System (Twitter-like)

                          [Users]
                             ↓
                          [DNS] ← "What's twitter.com's IP?"
                             ↓
                          [CDN] ← Serves images, JS, CSS
                             ↓
                    [Load Balancer] ← Distributes traffic
                             ↓
                     [API Gateway] ← Auth, rate limiting, routing
                             ↓
            ┌────────────────┼────────────────┐
            ↓                ↓                ↓
    [User Service]   [Post Service]   [Timeline Service]
            ↓                ↓                ↓
         [Redis           [Kafka]          [Redis
          Cache]      ← Events →        Timeline Cache]
            ↓                                 ↓
    [PostgreSQL        Message Queue    [Cassandra
     User DB]     ← Background jobs →   Post DB]
                             ↓
                      [Worker Service]
                             ↓
                    [S3 Object Storage]
                    ← Media files →

                    [Elasticsearch]
                    ← Search index ←

Request flow: “Post a tweet”

Client → sends POST request
DNS → resolves twitter.com to IP
Load Balancer → picks healthy web server
API Gateway:
- Verifies JWT token (authentication)
- Checks rate limit (100 tweets/hour)
- Routes to Post Service
Post Service:
- Validates tweet (length, content)
- Writes to Cassandra (post database)
- Publishes event to Kafka (“tweet-posted”)
- Returns success to client
Kafka consumers (async):
- Timeline Service: Updates followers’ timeline caches
- Search Service: Indexes tweet in Elasticsearch
- Analytics Service: Tracks metrics
- Notification Service: Notifies mentioned users
S3: If media attached, uploaded to object storage
CDN: Media cached at edge locations
Observability:
- Logs: Request logged with trace ID
- Metrics: QPS, latency tracked
- Tracing: Request path across services

Read flow: “View timeline”

Client → API Gateway → Timeline Service
Check Redis cache: HIT (90% of time)
Return cached timeline immediately
If MISS: Query Cassandra → Build timeline → Cache → Return

🎯 Component Selection Guide

By Scale

< 10K users (Startup):

Load balancer (1)
Web servers (2-3)
Database (1 primary + 1 replica)
Redis cache (1)

10K - 100K users:

Add: CDN, API Gateway
Scale: More web servers (5-10)
Add: Message queue (RabbitMQ)

100K - 1M users:

Add: Microservices (split by domain)
Scale: Database sharding
Add: Search engine (Elasticsearch)
Add: Object storage (S3)

1M - 10M users:

Add: Kafka for event streaming
Add: Service mesh (Istio)
Add: Multi-region setup
Add: Data warehouse (Snowflake)

10M+ users (FAANG scale):

Everything above
Add: Custom CDN
Add: Custom message protocols
Add: Advanced ML systems

By Use Case

E-commerce:

Must have: Database, Cache, API Gateway, Object Storage (product images)
Nice to have: Search engine (product search), Message queue (order processing)

Social Media:

Must have: Database, Cache, CDN (images), Message queue (notifications)
Nice to have: Kafka (activity tracking), Search (content search)

Video Streaming:

Must have: CDN (mandatory!), Object Storage (videos), Adaptive bitrate
Nice to have: Analytics, Recommendation engine

Real-time Chat:

Must have: WebSocket servers, Message queue, Cache (presence)
Nice to have: Push notifications, Read receipts

key-patterns: Design patterns using these components
ch01-scale-from-zero-to-millions: When to add each component
estimation-cheatsheet: Calculate requirements for each component

Remember: Start simple, add complexity only when needed. Not every system needs every component!

Last Updated: 2026-04-08

Study Notes by Niladri & AI

Explorer

distributed-system-components

Distributed System Components - Complete Guide

🏗️ System Architecture Layers

📡 Edge & Gateway Layer

1. DNS (Domain Name System)

2. CDN (Content Delivery Network)

3. Load Balancer

4. API Gateway

5. WAF (Web Application Firewall)

🖥️ Application Layer

6. Web Servers

7. Application Servers

8. Microservices

💬 Communication Layer

9. Message Queue

11. Apache Kafka

12. Service Mesh

💾 Data Layer

13. Databases

14. Cache

15. Search Engine

16. Object Storage

17. Data Warehouse

📊 Observability Layer

18. Logging

19. Metrics

20. Distributed Tracing

🔐 Security Components

21. Authentication Service

22. Authorization Service

🔄 How Everything Ties Together

Example: Complete System (Twitter-like)

🎯 Component Selection Guide

By Scale

By Use Case

Graph View

Table of Contents

Backlinks

Study Notes by Niladri & AI

Explorer

distributed-system-components

Distributed System Components - Complete Guide

🏗️ System Architecture Layers

📡 Edge & Gateway Layer

1. DNS (Domain Name System)

2. CDN (Content Delivery Network)

3. Load Balancer

4. API Gateway

5. WAF (Web Application Firewall)

🖥️ Application Layer

6. Web Servers

7. Application Servers

8. Microservices

💬 Communication Layer

9. Message Queue

10. Pub/Sub (Publish-Subscribe)

11. Apache Kafka

12. Service Mesh

💾 Data Layer

13. Databases

14. Cache

15. Search Engine

16. Object Storage

17. Data Warehouse

📊 Observability Layer

18. Logging

19. Metrics

20. Distributed Tracing

🔐 Security Components

21. Authentication Service

22. Authorization Service

🔄 How Everything Ties Together

Example: Complete System (Twitter-like)

🎯 Component Selection Guide

By Scale

By Use Case

🔗 Related Resources

Graph View

Table of Contents

Backlinks