Distributed System Components - Complete Guide
A comprehensive reference for all major components in distributed systems, their roles, when to use them, and how they work together.
ποΈ System Architecture Layers
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Client Layer β
β (Web Browser, Mobile App, Desktop App) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Edge Layer β
β (DNS, CDN, DDoS Protection) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Gateway Layer β
β (API Gateway, Load Balancer, WAF) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Application Layer β
β (Web Servers, App Servers, Services) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Communication Layer β
β (Message Queue, Service Mesh, RPC) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Data Layer β
β (Database, Cache, Search Engine, Object Storage) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Observability Layer β
β (Logging, Metrics, Tracing, Alerting) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
π‘ Edge & Gateway Layer
1. DNS (Domain Name System)
What it is: Translates domain names (www.example.com) to IP addresses (192.168.1.1)
Why you need it:
- Users type domain names, not IP addresses
- Can route to different servers based on location (GeoDNS)
- Enables failover (change IP if server fails)
Types:
- A Record: Domain β IPv4 address
- AAAA Record: Domain β IPv6 address
- CNAME: Alias to another domain
- MX Record: Mail server routing
GeoDNS (Geographic DNS):
- Routes users to nearest datacenter
- Example: US users β US servers, EU users β EU servers
- Lower latency for global users
Interview relevance: βHow do users reach your system?β β DNS!
Popular providers: Route 53 (AWS), Cloud DNS (GCP), Azure DNS
2. CDN (Content Delivery Network)
What it is: Geographically distributed servers that cache static content close to users
Why you need it:
- Reduce latency (content closer to users)
- Reduce origin server load (CDN serves 80-90% of requests)
- Improve availability (highly redundant)
- DDoS protection (distributed infrastructure)
What to cache:
- Images, videos, audio files
- JavaScript, CSS, fonts
- Static HTML pages
- Downloadable files (PDFs, binaries)
Push vs Pull:
- Push CDN: Upload content to CDN manually
- Good for: Rarely changing content, full control
- Pull CDN: CDN fetches from origin on miss
- Good for: Frequently changing content, automatic
Cache invalidation:
- Set TTL (Time-To-Live)
- Purge/invalidate API calls
- URL versioning (style.v2.css)
When to use:
- Bandwidth > 1 GB/s
- Global users
- Static content > 30% of traffic
Popular CDNs: Cloudflare, AWS CloudFront, Akamai, Fastly, Cloudinary (images)
3. Load Balancer
What it is: Distributes incoming traffic across multiple servers
Why you need it:
- No single point of failure (redundancy)
- Scale horizontally (add more servers)
- Health checks (detect dead servers)
- Zero-downtime deployments
Types:
- Layer 4 (Transport): Routes based on IP/port (TCP/UDP)
- Fast, simple
- Canβt see HTTP content
- Layer 7 (Application): Routes based on HTTP headers/URL/cookies
- Slower, more intelligent
- Can route /api β API servers, /static β static servers
Load Balancing Algorithms:
-
Round Robin: Distribute requests sequentially
- Simple, fair
- Ignores server load
-
Weighted Round Robin: More powerful servers get more traffic
- Good for heterogeneous servers
-
Least Connections: Send to server with fewest active connections
- Good for long-lived connections (WebSockets)
-
Least Response Time: Send to fastest server
- Best performance
- More complex
-
IP Hash: Hash client IP to always hit same server
- Sticky sessions (session affinity)
- Uneven distribution
Health Checks:
- Periodic HTTP/TCP requests to servers
- If server fails health check β remove from pool
- When recovers β add back to pool
When to use:
- More than 1 server
- Need high availability
- Traffic > 1K QPS
Popular load balancers: NGINX, HAProxy, AWS ELB/ALB, GCP Load Balancer
4. API Gateway
What it is: Single entry point for all client requests to microservices
Why you need it:
- Authentication & Authorization: Verify JWT tokens, check permissions
- Rate Limiting: Prevent abuse (100 requests/min per user)
- Request Routing: Route /users β User Service, /posts β Post Service
- Request Aggregation: Combine multiple microservice calls into one
- Protocol Translation: REST β gRPC, HTTP β WebSocket
- Caching: Cache common responses
- Monitoring: Log all requests, track metrics
Problems it solves:
- Clients donβt need to know about all microservices
- Centralized authentication (donβt repeat in every service)
- Consistent error handling and logging
- Version management (/v1/users, /v2/users)
API Gateway vs Load Balancer:
| Feature | Load Balancer | API Gateway | |
|---|---|---|---|
| Purpose | Distribute traffic | Route & manage APIs | |
| Layer | L4/L7 | L7 only | |
| Intelligence | Basic | High | |
| Auth | No | Yes | |
| Rate limiting | Basic | Advanced | |
| Protocol translation | No | Yes |
When to use:
- Microservices architecture (mandatory!)
- Need centralized auth/rate limiting
- Multiple client types (web, mobile, IoT)
- API versioning
Features in interviews:
- βHow do clients discover services?β β API Gateway!
- βHow do you handle authentication?β β API Gateway!
- βHow do you prevent abuse?β β API Gateway rate limiting!
Popular API Gateways: Kong, AWS API Gateway, Azure API Management, Apigee, Envoy
Example:
Client request: GET /api/v1/users/123/posts
API Gateway:
1. Verify JWT token (authentication)
2. Check rate limit (100 req/min)
3. Route to appropriate microservice
4. Aggregate User Service + Post Service responses
5. Return combined response to client
5. WAF (Web Application Firewall)
What it is: Filters and monitors HTTP traffic to protect web applications
Why you need it:
- SQL Injection protection
- XSS (Cross-Site Scripting) protection
- DDoS mitigation
- Bot detection
- Geo-blocking (block certain countries)
When to use:
- Public-facing web applications
- Handling sensitive data
- Compliance requirements (PCI DSS, HIPAA)
Popular WAFs: Cloudflare WAF, AWS WAF, Imperva, Akamai
π₯οΈ Application Layer
6. Web Servers
What it is: Handles HTTP requests and returns responses
Why you need it:
- Serve web pages (HTML, CSS, JavaScript)
- Execute application logic
- Call databases and other services
- Generate dynamic content
Stateless vs Stateful:
- Stateless: No session data stored in server memory
- Can scale horizontally easily
- Any server can handle any request
- Store session in Redis or database
- Stateful: Session data in server memory
- Sticky sessions required (same user β same server)
- Harder to scale
- Avoid if possible!
When to use: Every web application needs web servers!
Popular web servers: NGINX, Apache, Node.js, Tomcat, IIS
7. Application Servers
What it is: Runs business logic, separate from web presentation layer
Why you need it:
- Separate concerns (presentation vs logic)
- Can scale independently
- Reusable business logic for multiple clients (web, mobile, API)
When to use:
- Complex business logic
- Enterprise applications
- Need to serve multiple client types
Popular app servers: Spring Boot, Express.js, Django, Flask, FastAPI
8. Microservices
What it is: Architecture where application is split into small, independent services
Why you need it (at scale):
- Independent deployment: Update User Service without touching Post Service
- Independent scaling: Scale payment service 10x without scaling others
- Technology flexibility: Use Python for ML service, Go for performance
- Team ownership: Each team owns specific services
- Fault isolation: If one service fails, others continue
Trade-offs:
| Pros | Cons |
|---|---|
| Independent scaling | Complex deployment |
| Technology flexibility | Network latency between services |
| Fault isolation | Distributed tracing needed |
| Team autonomy | Data consistency challenges |
| Easier to understand individual services | Harder to understand system as whole |
When NOT to use:
- Small team (< 10 people)
- Simple application
- MVP/prototype
- No clear service boundaries
When to use:
- Large team (> 50 people)
- Need independent scaling
- Long-term project (years)
- Clear service boundaries (User, Post, Payment, etc.)
Communication between microservices:
- Synchronous: REST, gRPC
- Asynchronous: Message queue, Pub/Sub
Popular frameworks: Spring Cloud, Istio (service mesh), Kubernetes
π¬ Communication Layer
9. Message Queue
What it is: Stores messages between producers and consumers
Why you need it:
- Decouple components (producer doesnβt wait for consumer)
- Async processing (email, notifications, video encoding)
- Buffer traffic spikes (queue absorbs bursts)
- Retry failed operations
- Guaranteed delivery (at-least-once)
Use cases:
- Send email after user signup
- Process video uploads
- Generate reports
- Resize images
- Send push notifications
Patterns:
- Point-to-Point: One message β one consumer
- Work Queue: Multiple workers compete for messages (load distribution)
When to use:
- Task takes > 1 second (offload from web server)
- Can be processed asynchronously
- Need retry logic
- Traffic is spiky
Popular message queues: RabbitMQ, AWS SQS, Azure Service Bus
10. Pub/Sub (Publish-Subscribe)
What it is: Broadcast messages to multiple subscribers
Why you need it:
- Event broadcasting: One event β many listeners
- Decouple producers from consumers
- Scalability: Add subscribers without changing producer
Use cases:
- User signup β Send welcome email, Analytics service, CRM update
- Order placed β Inventory service, Shipping service, Analytics
- Real-time updates β All connected WebSocket clients
Pub/Sub vs Message Queue:
| Feature | Message Queue | Pub/Sub |
|---|---|---|
| Receivers | One consumer | Multiple subscribers |
| Message delivery | Once | Multiple times (one per subscriber) |
| Use case | Work distribution | Event broadcasting |
When to use:
- Multiple services need same event
- Event-driven architecture
- Real-time notifications
Popular pub/sub: Apache Kafka, AWS SNS, Google Pub/Sub, Redis Pub/Sub
11. Apache Kafka
What it is: Distributed event streaming platform (hybrid message queue + pub/sub)
Why you need it:
- High throughput: Millions of messages per second
- Durability: Messages persisted to disk
- Replay: Can re-read old messages
- Real-time streaming: Process events as they arrive
- Scalability: Horizontally scalable
Use cases:
- Activity tracking (clicks, page views)
- Log aggregation
- Stream processing (real-time analytics)
- Event sourcing (store all state changes)
- Metrics collection
Key concepts:
- Topic: Category of messages (e.g., βuser-signupsβ)
- Partition: Subdivide topic for parallelism
- Consumer Group: Multiple consumers share load
- Offset: Position in the message stream
When to use:
- Need high throughput (> 100K messages/sec)
- Need to replay messages
- Real-time analytics
- Event-driven architecture at scale
Kafka vs RabbitMQ:
| Feature | Kafka | RabbitMQ |
|---|---|---|
| Throughput | Very high | Medium |
| Message persistence | Always | Optional |
| Message replay | Yes | No |
| Complexity | High | Medium |
| Use case | Event streaming | Task queues |
12. Service Mesh
What it is: Infrastructure layer for managing service-to-service communication in microservices
Why you need it (at scale):
- Service discovery: Services find each other automatically
- Load balancing: Between service instances
- Retry logic: Automatic retries on failure
- Circuit breaking: Stop calling failing services
- Observability: Trace requests across services
- mTLS: Secure service-to-service communication
Components:
- Data plane: Sidecar proxies (Envoy) next to each service
- Control plane: Manages configuration (Istio, Linkerd)
When to use:
- Many microservices (> 20)
- Need advanced traffic management
- Security critical (mTLS for all services)
- Complex observability requirements
When NOT to use:
- Few microservices (< 10)
- Adds significant complexity
- Monolith architecture
Popular service meshes: Istio, Linkerd, Consul Connect
πΎ Data Layer
13. Databases
See key-patterns > 4. SQL vs NoSQL for detailed comparison.
SQL Databases (PostgreSQL, MySQL):
- Structured data with relationships
- ACID transactions
- Complex queries and joins
- Vertical scaling (mostly)
NoSQL Databases:
- Key-Value (Redis, DynamoDB): Simple lookups
- Document (MongoDB, Couchbase): Flexible schema, JSON documents
- Column-Family (Cassandra, HBase): Wide tables, high write throughput
- Graph (Neo4j): Relationships and connections
When to use SQL: Default choice, structured data, need transactions
When to use NoSQL: Flexible schema, horizontal scaling, simple queries
14. Cache
What it is: In-memory data store for fast reads
Why you need it:
- Speed: Memory is 100x faster than disk
- Reduce DB load: 80-90% cache hit rate = 10x less DB traffic
- Scalability: Handle more reads without scaling database
What to cache:
- User sessions (ephemeral data)
- Database query results (hot data)
- Computed results (expensive calculations)
- API responses
Cache strategies: See key-patterns > 1. Caching Strategies
Cache invalidation (hardest problem!):
- TTL: Expire after X seconds
- Write invalidation: Delete on update
- Write-through: Update cache on write
Cache eviction policies:
- LRU (Least Recently Used): Evict oldest accessed
- LFU (Least Frequently Used): Evict least accessed
- FIFO: First in, first out
When to use:
- Database is bottleneck
- Read-heavy workload (10:1 ratio)
- Hot data accessed frequently
Popular caches: Redis, Memcached
Redis vs Memcached:
| Feature | Redis | Memcached |
|---|---|---|
| Data structures | Rich (lists, sets, sorted sets) | Simple (key-value) |
| Persistence | Yes | No |
| Replication | Yes | No |
| Use case | Complex caching, session store | Simple cache |
15. Search Engine
What it is: Specialized database optimized for text search and analytics
Why you need it:
- Full-text search: Search within text fields
- Fuzzy matching: Handle typos (autocomplete)
- Relevance ranking: Most relevant results first
- Faceted search: Filter by category, price, date
- Analytics: Aggregate and analyze large datasets
Use cases:
- Product search (e-commerce)
- Log analysis
- Autocomplete/typeahead
- Content search (documents, emails)
- Real-time analytics
When to use:
- Need full-text search
- SQL LIKE queries too slow
- Complex search requirements (filters, ranking)
When NOT to use:
- Simple exact match queries (use database)
- Primary data store (use as secondary index)
Popular search engines: Elasticsearch, OpenSearch, Solr, Algolia
Architecture:
Database (source of truth)
β (sync)
Search Index (optimized for search)
β
Users query search index, not database
16. Object Storage
What it is: Storage for large unstructured objects (files, images, videos)
Why you need it:
- Scalable: Store petabytes
- Durable: Replicated across multiple datacenters
- Cost-effective: Cheaper than database storage
- HTTP access: Access files via URL
Use cases:
- User uploads (images, videos, documents)
- Backups and archives
- Static website hosting
- Data lake (big data analytics)
When to use:
- Storing files > 1 MB
- Donβt need database features (queries, transactions)
- Need high durability (99.999999999% - 11 nines!)
When NOT to use:
- Structured data (use database)
- Need fast queries (use database + search)
- Files < 100 KB (database blob is fine)
Popular object stores: AWS S3, Google Cloud Storage, Azure Blob Storage
Features:
- Versioning: Keep old versions
- Lifecycle policies: Auto-delete after X days
- Access control: Fine-grained permissions
- Event notifications: Trigger on upload
17. Data Warehouse
What it is: Centralized repository for analytical queries (OLAP)
Why you need it:
- Analytics: Complex queries on historical data
- Reporting: Business intelligence, dashboards
- Donβt impact production DB: Run expensive queries separately
OLTP vs OLAP:
| Aspect | OLTP (Database) | OLAP (Data Warehouse) |
|---|---|---|
| Purpose | Transactional | Analytical |
| Queries | Simple, fast | Complex, slow |
| Data | Current | Historical |
| Volume | GB to TB | TB to PB |
| Users | Many | Few (analysts) |
When to use:
- Need complex analytics
- Historical data analysis
- Business intelligence
- Donβt want to impact production database
Popular data warehouses: Snowflake, BigQuery, Redshift, ClickHouse
π Observability Layer
18. Logging
What it is: Recording events that happen in the system
Why you need it:
- Debugging: Trace through execution
- Audit: Who did what when
- Compliance: Regulatory requirements
- Incident response: Understand what went wrong
Log levels:
- TRACE: Very detailed (development only)
- DEBUG: Debug information
- INFO: General information
- WARN: Warning, not critical
- ERROR: Error occurred
- FATAL: System crash
Centralized logging:
- Aggregate logs from all servers
- Searchable (grep across 1000 servers)
- Correlation (trace request across services)
When to use: Always! Every service needs logging.
Popular logging: ELK Stack (Elasticsearch, Logstash, Kibana), Splunk, Datadog, CloudWatch
19. Metrics
What it is: Numerical measurements of system behavior over time
Why you need it:
- Monitor health: Is system working?
- Detect issues: Spike in error rate?
- Capacity planning: When to add servers?
- Alerting: Page on-call when down
Key metrics (The Four Golden Signals):
- Latency: How long requests take (p50, p95, p99)
- Traffic: Requests per second (QPS)
- Errors: Error rate (5xx errors)
- Saturation: Resource utilization (CPU, memory, disk)
Additional metrics:
- Cache hit rate
- Database connection pool utilization
- Queue depth
- Active users
When to use: Always! Monitor everything.
Popular metrics: Prometheus + Grafana, Datadog, New Relic, CloudWatch
20. Distributed Tracing
What it is: Tracks requests as they flow through microservices
Why you need it:
- Find bottlenecks: Which service is slow?
- Debug failures: Where did request fail?
- Understand dependencies: Service call graph
How it works:
- Generate unique trace ID
- Pass through all services (in HTTP header)
- Each service logs with trace ID
- Visualize the complete request path
When to use:
- Microservices architecture
- Requests span multiple services
- Need to debug slow requests
Popular tracing: Jaeger, Zipkin, AWS X-Ray, Datadog APM
Example trace:
Request ID: abc123
1. API Gateway (5ms)
2. User Service (10ms)
3. Database query (8ms)
4. Post Service (50ms) β Bottleneck!
5. Database query (45ms) β Slow query
6. Response (total 65ms)
π Security Components
21. Authentication Service
What it is: Verifies user identity (who are you?)
Why you need it:
- Verify users are who they claim to be
- Issue authentication tokens
- SSO (Single Sign-On)
Methods:
- Username + Password
- OAuth 2.0 (Google, Facebook login)
- SAML (enterprise SSO)
- Multi-factor authentication (MFA)
JWT (JSON Web Token):
- Self-contained token with user info
- Stateless (no server-side session)
- Signed to prevent tampering
When to use: Every application with users!
Popular auth services: Auth0, Okta, Firebase Auth, AWS Cognito
22. Authorization Service
What it is: Determines what user can do (what are you allowed to do?)
Why you need it:
- Permission management
- Role-based access control (RBAC)
- Attribute-based access control (ABAC)
Patterns:
- RBAC: User β Role β Permissions
- Example: Admin role can delete users
- ABAC: Based on attributes
- Example: Owner of resource can edit
When to use: When different users have different permissions
π How Everything Ties Together
Example: Complete System (Twitter-like)
[Users]
β
[DNS] β "What's twitter.com's IP?"
β
[CDN] β Serves images, JS, CSS
β
[Load Balancer] β Distributes traffic
β
[API Gateway] β Auth, rate limiting, routing
β
ββββββββββββββββββΌβββββββββββββββββ
β β β
[User Service] [Post Service] [Timeline Service]
β β β
[Redis [Kafka] [Redis
Cache] β Events β Timeline Cache]
β β
[PostgreSQL Message Queue [Cassandra
User DB] β Background jobs β Post DB]
β
[Worker Service]
β
[S3 Object Storage]
β Media files β
[Elasticsearch]
β Search index β
Request flow: βPost a tweetβ
- Client β sends POST request
- DNS β resolves twitter.com to IP
- Load Balancer β picks healthy web server
- API Gateway:
- Verifies JWT token (authentication)
- Checks rate limit (100 tweets/hour)
- Routes to Post Service
- Post Service:
- Validates tweet (length, content)
- Writes to Cassandra (post database)
- Publishes event to Kafka (βtweet-postedβ)
- Returns success to client
- Kafka consumers (async):
- Timeline Service: Updates followersβ timeline caches
- Search Service: Indexes tweet in Elasticsearch
- Analytics Service: Tracks metrics
- Notification Service: Notifies mentioned users
- S3: If media attached, uploaded to object storage
- CDN: Media cached at edge locations
- Observability:
- Logs: Request logged with trace ID
- Metrics: QPS, latency tracked
- Tracing: Request path across services
Read flow: βView timelineβ
- Client β API Gateway β Timeline Service
- Check Redis cache: HIT (90% of time)
- Return cached timeline immediately
- If MISS: Query Cassandra β Build timeline β Cache β Return
π― Component Selection Guide
By Scale
< 10K users (Startup):
- Load balancer (1)
- Web servers (2-3)
- Database (1 primary + 1 replica)
- Redis cache (1)
10K - 100K users:
- Add: CDN, API Gateway
- Scale: More web servers (5-10)
- Add: Message queue (RabbitMQ)
100K - 1M users:
- Add: Microservices (split by domain)
- Scale: Database sharding
- Add: Search engine (Elasticsearch)
- Add: Object storage (S3)
1M - 10M users:
- Add: Kafka for event streaming
- Add: Service mesh (Istio)
- Add: Multi-region setup
- Add: Data warehouse (Snowflake)
10M+ users (FAANG scale):
- Everything above
- Add: Custom CDN
- Add: Custom message protocols
- Add: Advanced ML systems
By Use Case
E-commerce:
- Must have: Database, Cache, API Gateway, Object Storage (product images)
- Nice to have: Search engine (product search), Message queue (order processing)
Social Media:
- Must have: Database, Cache, CDN (images), Message queue (notifications)
- Nice to have: Kafka (activity tracking), Search (content search)
Video Streaming:
- Must have: CDN (mandatory!), Object Storage (videos), Adaptive bitrate
- Nice to have: Analytics, Recommendation engine
Real-time Chat:
- Must have: WebSocket servers, Message queue, Cache (presence)
- Nice to have: Push notifications, Read receipts
π Related Resources
- key-patterns: Design patterns using these components
- ch01-scale-from-zero-to-millions: When to add each component
- estimation-cheatsheet: Calculate requirements for each component
Remember: Start simple, add complexity only when needed. Not every system needs every component!
Last Updated: 2026-04-08