Distributed System Components - Complete Guide

A comprehensive reference for all major components in distributed systems, their roles, when to use them, and how they work together.

πŸ—οΈ System Architecture Layers

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    Client Layer                         β”‚
β”‚         (Web Browser, Mobile App, Desktop App)          β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                           ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    Edge Layer                           β”‚
β”‚              (DNS, CDN, DDoS Protection)                β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                           ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                  Gateway Layer                          β”‚
β”‚         (API Gateway, Load Balancer, WAF)               β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                           ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                 Application Layer                       β”‚
β”‚         (Web Servers, App Servers, Services)            β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                           ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                 Communication Layer                     β”‚
β”‚         (Message Queue, Service Mesh, RPC)              β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                           ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    Data Layer                           β”‚
β”‚    (Database, Cache, Search Engine, Object Storage)     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                           ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                 Observability Layer                     β”‚
β”‚         (Logging, Metrics, Tracing, Alerting)           β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸ“‘ Edge & Gateway Layer

1. DNS (Domain Name System)

What it is: Translates domain names (www.example.com) to IP addresses (192.168.1.1)

Why you need it:

  • Users type domain names, not IP addresses
  • Can route to different servers based on location (GeoDNS)
  • Enables failover (change IP if server fails)

Types:

  • A Record: Domain β†’ IPv4 address
  • AAAA Record: Domain β†’ IPv6 address
  • CNAME: Alias to another domain
  • MX Record: Mail server routing

GeoDNS (Geographic DNS):

  • Routes users to nearest datacenter
  • Example: US users β†’ US servers, EU users β†’ EU servers
  • Lower latency for global users

Interview relevance: β€œHow do users reach your system?” β†’ DNS!

Popular providers: Route 53 (AWS), Cloud DNS (GCP), Azure DNS


2. CDN (Content Delivery Network)

What it is: Geographically distributed servers that cache static content close to users

Why you need it:

  • Reduce latency (content closer to users)
  • Reduce origin server load (CDN serves 80-90% of requests)
  • Improve availability (highly redundant)
  • DDoS protection (distributed infrastructure)

What to cache:

  • Images, videos, audio files
  • JavaScript, CSS, fonts
  • Static HTML pages
  • Downloadable files (PDFs, binaries)

Push vs Pull:

  • Push CDN: Upload content to CDN manually
    • Good for: Rarely changing content, full control
  • Pull CDN: CDN fetches from origin on miss
    • Good for: Frequently changing content, automatic

Cache invalidation:

  • Set TTL (Time-To-Live)
  • Purge/invalidate API calls
  • URL versioning (style.v2.css)

When to use:

  • Bandwidth > 1 GB/s
  • Global users
  • Static content > 30% of traffic

Popular CDNs: Cloudflare, AWS CloudFront, Akamai, Fastly, Cloudinary (images)


3. Load Balancer

What it is: Distributes incoming traffic across multiple servers

Why you need it:

  • No single point of failure (redundancy)
  • Scale horizontally (add more servers)
  • Health checks (detect dead servers)
  • Zero-downtime deployments

Types:

  • Layer 4 (Transport): Routes based on IP/port (TCP/UDP)
    • Fast, simple
    • Can’t see HTTP content
  • Layer 7 (Application): Routes based on HTTP headers/URL/cookies
    • Slower, more intelligent
    • Can route /api β†’ API servers, /static β†’ static servers

Load Balancing Algorithms:

  1. Round Robin: Distribute requests sequentially

    • Simple, fair
    • Ignores server load
  2. Weighted Round Robin: More powerful servers get more traffic

    • Good for heterogeneous servers
  3. Least Connections: Send to server with fewest active connections

    • Good for long-lived connections (WebSockets)
  4. Least Response Time: Send to fastest server

    • Best performance
    • More complex
  5. IP Hash: Hash client IP to always hit same server

    • Sticky sessions (session affinity)
    • Uneven distribution

Health Checks:

  • Periodic HTTP/TCP requests to servers
  • If server fails health check β†’ remove from pool
  • When recovers β†’ add back to pool

When to use:

  • More than 1 server
  • Need high availability
  • Traffic > 1K QPS

Popular load balancers: NGINX, HAProxy, AWS ELB/ALB, GCP Load Balancer


4. API Gateway

What it is: Single entry point for all client requests to microservices

Why you need it:

  • Authentication & Authorization: Verify JWT tokens, check permissions
  • Rate Limiting: Prevent abuse (100 requests/min per user)
  • Request Routing: Route /users β†’ User Service, /posts β†’ Post Service
  • Request Aggregation: Combine multiple microservice calls into one
  • Protocol Translation: REST β†’ gRPC, HTTP β†’ WebSocket
  • Caching: Cache common responses
  • Monitoring: Log all requests, track metrics

Problems it solves:

  • Clients don’t need to know about all microservices
  • Centralized authentication (don’t repeat in every service)
  • Consistent error handling and logging
  • Version management (/v1/users, /v2/users)

API Gateway vs Load Balancer:

FeatureLoad BalancerAPI Gateway
PurposeDistribute trafficRoute & manage APIs
LayerL4/L7L7 only
IntelligenceBasicHigh
AuthNoYes
Rate limitingBasicAdvanced
Protocol translationNoYes

When to use:

  • Microservices architecture (mandatory!)
  • Need centralized auth/rate limiting
  • Multiple client types (web, mobile, IoT)
  • API versioning

Features in interviews:

  • β€œHow do clients discover services?” β†’ API Gateway!
  • β€œHow do you handle authentication?” β†’ API Gateway!
  • β€œHow do you prevent abuse?” β†’ API Gateway rate limiting!

Popular API Gateways: Kong, AWS API Gateway, Azure API Management, Apigee, Envoy

Example:

Client request: GET /api/v1/users/123/posts

API Gateway:
1. Verify JWT token (authentication)
2. Check rate limit (100 req/min)
3. Route to appropriate microservice
4. Aggregate User Service + Post Service responses
5. Return combined response to client

5. WAF (Web Application Firewall)

What it is: Filters and monitors HTTP traffic to protect web applications

Why you need it:

  • SQL Injection protection
  • XSS (Cross-Site Scripting) protection
  • DDoS mitigation
  • Bot detection
  • Geo-blocking (block certain countries)

When to use:

  • Public-facing web applications
  • Handling sensitive data
  • Compliance requirements (PCI DSS, HIPAA)

Popular WAFs: Cloudflare WAF, AWS WAF, Imperva, Akamai


πŸ–₯️ Application Layer

6. Web Servers

What it is: Handles HTTP requests and returns responses

Why you need it:

  • Serve web pages (HTML, CSS, JavaScript)
  • Execute application logic
  • Call databases and other services
  • Generate dynamic content

Stateless vs Stateful:

  • Stateless: No session data stored in server memory
    • Can scale horizontally easily
    • Any server can handle any request
    • Store session in Redis or database
  • Stateful: Session data in server memory
    • Sticky sessions required (same user β†’ same server)
    • Harder to scale
    • Avoid if possible!

When to use: Every web application needs web servers!

Popular web servers: NGINX, Apache, Node.js, Tomcat, IIS


7. Application Servers

What it is: Runs business logic, separate from web presentation layer

Why you need it:

  • Separate concerns (presentation vs logic)
  • Can scale independently
  • Reusable business logic for multiple clients (web, mobile, API)

When to use:

  • Complex business logic
  • Enterprise applications
  • Need to serve multiple client types

Popular app servers: Spring Boot, Express.js, Django, Flask, FastAPI


8. Microservices

What it is: Architecture where application is split into small, independent services

Why you need it (at scale):

  • Independent deployment: Update User Service without touching Post Service
  • Independent scaling: Scale payment service 10x without scaling others
  • Technology flexibility: Use Python for ML service, Go for performance
  • Team ownership: Each team owns specific services
  • Fault isolation: If one service fails, others continue

Trade-offs:

ProsCons
Independent scalingComplex deployment
Technology flexibilityNetwork latency between services
Fault isolationDistributed tracing needed
Team autonomyData consistency challenges
Easier to understand individual servicesHarder to understand system as whole

When NOT to use:

  • Small team (< 10 people)
  • Simple application
  • MVP/prototype
  • No clear service boundaries

When to use:

  • Large team (> 50 people)
  • Need independent scaling
  • Long-term project (years)
  • Clear service boundaries (User, Post, Payment, etc.)

Communication between microservices:

  • Synchronous: REST, gRPC
  • Asynchronous: Message queue, Pub/Sub

Popular frameworks: Spring Cloud, Istio (service mesh), Kubernetes


πŸ’¬ Communication Layer

9. Message Queue

What it is: Stores messages between producers and consumers

Why you need it:

  • Decouple components (producer doesn’t wait for consumer)
  • Async processing (email, notifications, video encoding)
  • Buffer traffic spikes (queue absorbs bursts)
  • Retry failed operations
  • Guaranteed delivery (at-least-once)

Use cases:

  • Send email after user signup
  • Process video uploads
  • Generate reports
  • Resize images
  • Send push notifications

Patterns:

  • Point-to-Point: One message β†’ one consumer
  • Work Queue: Multiple workers compete for messages (load distribution)

When to use:

  • Task takes > 1 second (offload from web server)
  • Can be processed asynchronously
  • Need retry logic
  • Traffic is spiky

Popular message queues: RabbitMQ, AWS SQS, Azure Service Bus


10. Pub/Sub (Publish-Subscribe)

What it is: Broadcast messages to multiple subscribers

Why you need it:

  • Event broadcasting: One event β†’ many listeners
  • Decouple producers from consumers
  • Scalability: Add subscribers without changing producer

Use cases:

  • User signup β†’ Send welcome email, Analytics service, CRM update
  • Order placed β†’ Inventory service, Shipping service, Analytics
  • Real-time updates β†’ All connected WebSocket clients

Pub/Sub vs Message Queue:

FeatureMessage QueuePub/Sub
ReceiversOne consumerMultiple subscribers
Message deliveryOnceMultiple times (one per subscriber)
Use caseWork distributionEvent broadcasting

When to use:

  • Multiple services need same event
  • Event-driven architecture
  • Real-time notifications

Popular pub/sub: Apache Kafka, AWS SNS, Google Pub/Sub, Redis Pub/Sub


11. Apache Kafka

What it is: Distributed event streaming platform (hybrid message queue + pub/sub)

Why you need it:

  • High throughput: Millions of messages per second
  • Durability: Messages persisted to disk
  • Replay: Can re-read old messages
  • Real-time streaming: Process events as they arrive
  • Scalability: Horizontally scalable

Use cases:

  • Activity tracking (clicks, page views)
  • Log aggregation
  • Stream processing (real-time analytics)
  • Event sourcing (store all state changes)
  • Metrics collection

Key concepts:

  • Topic: Category of messages (e.g., β€œuser-signups”)
  • Partition: Subdivide topic for parallelism
  • Consumer Group: Multiple consumers share load
  • Offset: Position in the message stream

When to use:

  • Need high throughput (> 100K messages/sec)
  • Need to replay messages
  • Real-time analytics
  • Event-driven architecture at scale

Kafka vs RabbitMQ:

FeatureKafkaRabbitMQ
ThroughputVery highMedium
Message persistenceAlwaysOptional
Message replayYesNo
ComplexityHighMedium
Use caseEvent streamingTask queues

12. Service Mesh

What it is: Infrastructure layer for managing service-to-service communication in microservices

Why you need it (at scale):

  • Service discovery: Services find each other automatically
  • Load balancing: Between service instances
  • Retry logic: Automatic retries on failure
  • Circuit breaking: Stop calling failing services
  • Observability: Trace requests across services
  • mTLS: Secure service-to-service communication

Components:

  • Data plane: Sidecar proxies (Envoy) next to each service
  • Control plane: Manages configuration (Istio, Linkerd)

When to use:

  • Many microservices (> 20)
  • Need advanced traffic management
  • Security critical (mTLS for all services)
  • Complex observability requirements

When NOT to use:

  • Few microservices (< 10)
  • Adds significant complexity
  • Monolith architecture

Popular service meshes: Istio, Linkerd, Consul Connect


πŸ’Ύ Data Layer

13. Databases

See key-patterns > 4. SQL vs NoSQL for detailed comparison.

SQL Databases (PostgreSQL, MySQL):

  • Structured data with relationships
  • ACID transactions
  • Complex queries and joins
  • Vertical scaling (mostly)

NoSQL Databases:

  • Key-Value (Redis, DynamoDB): Simple lookups
  • Document (MongoDB, Couchbase): Flexible schema, JSON documents
  • Column-Family (Cassandra, HBase): Wide tables, high write throughput
  • Graph (Neo4j): Relationships and connections

When to use SQL: Default choice, structured data, need transactions

When to use NoSQL: Flexible schema, horizontal scaling, simple queries


14. Cache

What it is: In-memory data store for fast reads

Why you need it:

  • Speed: Memory is 100x faster than disk
  • Reduce DB load: 80-90% cache hit rate = 10x less DB traffic
  • Scalability: Handle more reads without scaling database

What to cache:

  • User sessions (ephemeral data)
  • Database query results (hot data)
  • Computed results (expensive calculations)
  • API responses

Cache strategies: See key-patterns > 1. Caching Strategies

Cache invalidation (hardest problem!):

  • TTL: Expire after X seconds
  • Write invalidation: Delete on update
  • Write-through: Update cache on write

Cache eviction policies:

  • LRU (Least Recently Used): Evict oldest accessed
  • LFU (Least Frequently Used): Evict least accessed
  • FIFO: First in, first out

When to use:

  • Database is bottleneck
  • Read-heavy workload (10:1 ratio)
  • Hot data accessed frequently

Popular caches: Redis, Memcached

Redis vs Memcached:

FeatureRedisMemcached
Data structuresRich (lists, sets, sorted sets)Simple (key-value)
PersistenceYesNo
ReplicationYesNo
Use caseComplex caching, session storeSimple cache

15. Search Engine

What it is: Specialized database optimized for text search and analytics

Why you need it:

  • Full-text search: Search within text fields
  • Fuzzy matching: Handle typos (autocomplete)
  • Relevance ranking: Most relevant results first
  • Faceted search: Filter by category, price, date
  • Analytics: Aggregate and analyze large datasets

Use cases:

  • Product search (e-commerce)
  • Log analysis
  • Autocomplete/typeahead
  • Content search (documents, emails)
  • Real-time analytics

When to use:

  • Need full-text search
  • SQL LIKE queries too slow
  • Complex search requirements (filters, ranking)

When NOT to use:

  • Simple exact match queries (use database)
  • Primary data store (use as secondary index)

Popular search engines: Elasticsearch, OpenSearch, Solr, Algolia

Architecture:

Database (source of truth)
    ↓ (sync)
Search Index (optimized for search)
    ↓
Users query search index, not database

16. Object Storage

What it is: Storage for large unstructured objects (files, images, videos)

Why you need it:

  • Scalable: Store petabytes
  • Durable: Replicated across multiple datacenters
  • Cost-effective: Cheaper than database storage
  • HTTP access: Access files via URL

Use cases:

  • User uploads (images, videos, documents)
  • Backups and archives
  • Static website hosting
  • Data lake (big data analytics)

When to use:

  • Storing files > 1 MB
  • Don’t need database features (queries, transactions)
  • Need high durability (99.999999999% - 11 nines!)

When NOT to use:

  • Structured data (use database)
  • Need fast queries (use database + search)
  • Files < 100 KB (database blob is fine)

Popular object stores: AWS S3, Google Cloud Storage, Azure Blob Storage

Features:

  • Versioning: Keep old versions
  • Lifecycle policies: Auto-delete after X days
  • Access control: Fine-grained permissions
  • Event notifications: Trigger on upload

17. Data Warehouse

What it is: Centralized repository for analytical queries (OLAP)

Why you need it:

  • Analytics: Complex queries on historical data
  • Reporting: Business intelligence, dashboards
  • Don’t impact production DB: Run expensive queries separately

OLTP vs OLAP:

AspectOLTP (Database)OLAP (Data Warehouse)
PurposeTransactionalAnalytical
QueriesSimple, fastComplex, slow
DataCurrentHistorical
VolumeGB to TBTB to PB
UsersManyFew (analysts)

When to use:

  • Need complex analytics
  • Historical data analysis
  • Business intelligence
  • Don’t want to impact production database

Popular data warehouses: Snowflake, BigQuery, Redshift, ClickHouse


πŸ“Š Observability Layer

18. Logging

What it is: Recording events that happen in the system

Why you need it:

  • Debugging: Trace through execution
  • Audit: Who did what when
  • Compliance: Regulatory requirements
  • Incident response: Understand what went wrong

Log levels:

  • TRACE: Very detailed (development only)
  • DEBUG: Debug information
  • INFO: General information
  • WARN: Warning, not critical
  • ERROR: Error occurred
  • FATAL: System crash

Centralized logging:

  • Aggregate logs from all servers
  • Searchable (grep across 1000 servers)
  • Correlation (trace request across services)

When to use: Always! Every service needs logging.

Popular logging: ELK Stack (Elasticsearch, Logstash, Kibana), Splunk, Datadog, CloudWatch


19. Metrics

What it is: Numerical measurements of system behavior over time

Why you need it:

  • Monitor health: Is system working?
  • Detect issues: Spike in error rate?
  • Capacity planning: When to add servers?
  • Alerting: Page on-call when down

Key metrics (The Four Golden Signals):

  1. Latency: How long requests take (p50, p95, p99)
  2. Traffic: Requests per second (QPS)
  3. Errors: Error rate (5xx errors)
  4. Saturation: Resource utilization (CPU, memory, disk)

Additional metrics:

  • Cache hit rate
  • Database connection pool utilization
  • Queue depth
  • Active users

When to use: Always! Monitor everything.

Popular metrics: Prometheus + Grafana, Datadog, New Relic, CloudWatch


20. Distributed Tracing

What it is: Tracks requests as they flow through microservices

Why you need it:

  • Find bottlenecks: Which service is slow?
  • Debug failures: Where did request fail?
  • Understand dependencies: Service call graph

How it works:

  • Generate unique trace ID
  • Pass through all services (in HTTP header)
  • Each service logs with trace ID
  • Visualize the complete request path

When to use:

  • Microservices architecture
  • Requests span multiple services
  • Need to debug slow requests

Popular tracing: Jaeger, Zipkin, AWS X-Ray, Datadog APM

Example trace:

Request ID: abc123
1. API Gateway (5ms)
2. User Service (10ms)
   3. Database query (8ms)
4. Post Service (50ms) ← Bottleneck!
   5. Database query (45ms) ← Slow query
6. Response (total 65ms)

πŸ” Security Components

21. Authentication Service

What it is: Verifies user identity (who are you?)

Why you need it:

  • Verify users are who they claim to be
  • Issue authentication tokens
  • SSO (Single Sign-On)

Methods:

  • Username + Password
  • OAuth 2.0 (Google, Facebook login)
  • SAML (enterprise SSO)
  • Multi-factor authentication (MFA)

JWT (JSON Web Token):

  • Self-contained token with user info
  • Stateless (no server-side session)
  • Signed to prevent tampering

When to use: Every application with users!

Popular auth services: Auth0, Okta, Firebase Auth, AWS Cognito


22. Authorization Service

What it is: Determines what user can do (what are you allowed to do?)

Why you need it:

  • Permission management
  • Role-based access control (RBAC)
  • Attribute-based access control (ABAC)

Patterns:

  • RBAC: User β†’ Role β†’ Permissions
    • Example: Admin role can delete users
  • ABAC: Based on attributes
    • Example: Owner of resource can edit

When to use: When different users have different permissions


πŸ”„ How Everything Ties Together

Example: Complete System (Twitter-like)

                          [Users]
                             ↓
                          [DNS] ← "What's twitter.com's IP?"
                             ↓
                          [CDN] ← Serves images, JS, CSS
                             ↓
                    [Load Balancer] ← Distributes traffic
                             ↓
                     [API Gateway] ← Auth, rate limiting, routing
                             ↓
            β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
            ↓                ↓                ↓
    [User Service]   [Post Service]   [Timeline Service]
            ↓                ↓                ↓
         [Redis           [Kafka]          [Redis
          Cache]      ← Events β†’        Timeline Cache]
            ↓                                 ↓
    [PostgreSQL        Message Queue    [Cassandra
     User DB]     ← Background jobs β†’   Post DB]
                             ↓
                      [Worker Service]
                             ↓
                    [S3 Object Storage]
                    ← Media files β†’

                    [Elasticsearch]
                    ← Search index ←

Request flow: β€œPost a tweet”

  1. Client β†’ sends POST request
  2. DNS β†’ resolves twitter.com to IP
  3. Load Balancer β†’ picks healthy web server
  4. API Gateway:
    • Verifies JWT token (authentication)
    • Checks rate limit (100 tweets/hour)
    • Routes to Post Service
  5. Post Service:
    • Validates tweet (length, content)
    • Writes to Cassandra (post database)
    • Publishes event to Kafka (β€œtweet-posted”)
    • Returns success to client
  6. Kafka consumers (async):
    • Timeline Service: Updates followers’ timeline caches
    • Search Service: Indexes tweet in Elasticsearch
    • Analytics Service: Tracks metrics
    • Notification Service: Notifies mentioned users
  7. S3: If media attached, uploaded to object storage
  8. CDN: Media cached at edge locations
  9. Observability:
    • Logs: Request logged with trace ID
    • Metrics: QPS, latency tracked
    • Tracing: Request path across services

Read flow: β€œView timeline”

  1. Client β†’ API Gateway β†’ Timeline Service
  2. Check Redis cache: HIT (90% of time)
  3. Return cached timeline immediately
  4. If MISS: Query Cassandra β†’ Build timeline β†’ Cache β†’ Return

🎯 Component Selection Guide

By Scale

< 10K users (Startup):

  • Load balancer (1)
  • Web servers (2-3)
  • Database (1 primary + 1 replica)
  • Redis cache (1)

10K - 100K users:

  • Add: CDN, API Gateway
  • Scale: More web servers (5-10)
  • Add: Message queue (RabbitMQ)

100K - 1M users:

  • Add: Microservices (split by domain)
  • Scale: Database sharding
  • Add: Search engine (Elasticsearch)
  • Add: Object storage (S3)

1M - 10M users:

  • Add: Kafka for event streaming
  • Add: Service mesh (Istio)
  • Add: Multi-region setup
  • Add: Data warehouse (Snowflake)

10M+ users (FAANG scale):

  • Everything above
  • Add: Custom CDN
  • Add: Custom message protocols
  • Add: Advanced ML systems

By Use Case

E-commerce:

  • Must have: Database, Cache, API Gateway, Object Storage (product images)
  • Nice to have: Search engine (product search), Message queue (order processing)

Social Media:

  • Must have: Database, Cache, CDN (images), Message queue (notifications)
  • Nice to have: Kafka (activity tracking), Search (content search)

Video Streaming:

  • Must have: CDN (mandatory!), Object Storage (videos), Adaptive bitrate
  • Nice to have: Analytics, Recommendation engine

Real-time Chat:

  • Must have: WebSocket servers, Message queue, Cache (presence)
  • Nice to have: Push notifications, Read receipts


Remember: Start simple, add complexity only when needed. Not every system needs every component!


Last Updated: 2026-04-08