Distributed Systems Glossary: Essential Concepts with Real-World Examples
Distributed System
A distributed system contains multiple nodes that are physically separate but linked together by a network. These nodes communicate and coordinate to appear as a single coherent system to end users.
Real-world examples:
- Netflix: Content delivery across global CDN nodes
- WhatsApp: Message routing through distributed server clusters
- Google Search: Distributed indexing and query processing
- Uber: Real-time location tracking and ride matching across regions
Why distributed? Single machines have physical limits. Distributed systems enable:
- Geographic distribution (lower latency for global users)
- Fault isolation (one failure doesn't bring down everything)
- Cost efficiency (commodity hardware vs. expensive supercomputers)
Scalability
The ability to handle increased load by adding resources without redesigning the system architecture.
Horizontal Scaling (Scale Out)
Adding more machines to distribute the load.
Pros:
- No theoretical limit
- Better fault tolerance
- Can scale incrementally
Cons:
- Increased complexity (load balancing, data consistency)
- Network overhead
Examples:
- Cassandra: Add nodes to the cluster seamlessly
- Kafka: Add brokers to handle more message throughput
- Microservices: Deploy more container instances
Vertical Scaling (Scale Up)
Adding more power (CPU, RAM, disk) to existing machines.
Pros:
- Simpler architecture
- No distributed system complexity
- Lower network latency
Cons:
- Hardware limits (can't infinitely upgrade)
- Single point of failure
- Downtime during upgrades
- Expensive at high end
Examples:
- PostgreSQL: Upgrade to larger EC2 instance
- Redis: Increase memory for larger cache
Real-world approach: Most systems use both. Start with vertical scaling for simplicity, add horizontal scaling when needed.
Reliability
The probability that a system continues delivering services correctly even when components fail.
Key principle: Eliminate single points of failure through redundancy.
Techniques:
- Replication: Multiple copies of data/services (Netflix stores each video on multiple servers)
- Failover: Automatic switching to backup systems (AWS RDS Multi-AZ)
- Checkpointing: Save state periodically for recovery (Kafka consumer offsets)
- Circuit breakers: Prevent cascade failures (Hystrix in microservices)
Trade-off: Reliability costs money and complexity. Over-engineering for 99.999% uptime when 99.9% suffices wastes resources.
Availability
The percentage of time a system is operational and accessible.
Common SLA targets:
| Availability | Downtime/Year | Downtime/Month | Use Case |
|---|---|---|---|
| 99% (2 nines) | 3.65 days | 7.2 hours | Internal tools |
| 99.9% (3 nines) | 8.76 hours | 43.2 minutes | Standard web apps |
| 99.99% (4 nines) | 52.56 minutes | 4.32 minutes | Payment systems |
| 99.999% (5 nines) | 5.26 minutes | 25.9 seconds | Critical infrastructure |
How to improve availability:
-
Redundancy: Multiple servers in parallel
- 4 servers at 95% uptime each → 99.9994% combined availability
-
Load balancing: Distribute traffic across healthy nodes (AWS ELB, NGINX)
-
Health checks: Detect and route around failures automatically
-
Geographic distribution: Multi-region deployments (AWS, GCP regions)
Key insight: Reliability ≠ Availability
- Reliable but not available: System works correctly when up, but frequently down for maintenance
- Available but not reliable: System is always up but returns wrong results
Example: Google aims for 99.99% availability for Gmail, meaning less than 1 hour downtime per year.
Consistency
All nodes see the same data at the same time. After a write completes, all subsequent reads return the updated value.
Consistency models:
Strong Consistency
Reads always return the most recent write. Feels like a single machine.
- Use case: Banking transactions, inventory management
- Example: Google Spanner, traditional RDBMS
- Trade-off: Higher latency, lower availability (CAP theorem)
Eventual Consistency
Reads may return stale data temporarily, but all replicas converge eventually.
- Use case: Social media feeds, DNS, shopping carts
- Example: DynamoDB, Cassandra, S3
- Trade-off: Better performance and availability, but complexity in handling conflicts
Real-world example:
- Twitter: Your tweet might not appear immediately to all followers (eventual consistency is acceptable)
- Bank transfer: Money must be deducted and credited atomically (strong consistency required)
Common pitfall: Over-engineering for strong consistency when eventual consistency suffices. Instagram doesn't need strong consistency for likes count.
Latency vs Throughput
Two critical performance metrics that often trade off against each other.
Latency
Time to complete a single operation (response time).
Examples:
- Google Search: < 200ms for search results
- Trading systems: < 10ms for order execution
- Gaming: < 50ms for responsive gameplay
How to reduce:
- Caching (Redis, CDN)
- Geographic distribution (edge servers)
- Async processing (don't wait for non-critical operations)
- Database indexing
Throughput
Number of operations completed per unit time.
Examples:
- Kafka: Millions of messages/second
- Netflix: 250 million hours of content streamed/day
- Payment gateway: Thousands of transactions/second
How to improve:
- Batching requests
- Parallel processing
- Horizontal scaling
- Connection pooling
The trade-off:
- Batching increases throughput but adds latency
- Processing requests individually reduces latency but limits throughput
Example: YouTube prioritizes throughput (serve millions of videos) over latency (buffering a few seconds is acceptable). High-frequency trading prioritizes latency (microseconds matter) over throughput.
Observability & Manageability
The ability to understand system behavior and diagnose issues quickly.
Three pillars:
1. Metrics
Numeric measurements over time.
- Examples: CPU usage, request rate, error rate, latency percentiles
- Tools: Prometheus, Grafana, CloudWatch
2. Logs
Discrete events with context.
- Examples: Error messages, request traces, audit logs
- Tools: ELK Stack (Elasticsearch, Logstash, Kibana), Splunk
- Best practice: Structured logging with correlation IDs
3. Traces
Request flow across distributed services.
- Examples: Track a single API call through 10 microservices
- Tools: Jaeger, Zipkin, AWS X-Ray
Real-world impact:
- Netflix: Chaos engineering (intentionally break things) to ensure quick recovery
- Google: SRE practices with error budgets and SLOs
- Amazon: "Two-pizza teams" with full ownership including on-call
Key metric: MTTR (Mean Time To Recovery) - How fast can you fix issues? Often more important than preventing all failures.
Fault Tolerance
The ability to continue operating correctly despite component failures.
Murphy's Law in distributed systems: If something can fail, it will fail.
Common failures:
- Network partitions (nodes can't communicate)
- Disk failures
- Process crashes
- Slow nodes ("gray failures" - not dead but degraded)
- Data center outages
Fault tolerance techniques:
1. Replication
Multiple copies of data/services.
- Leader-follower: One primary, multiple replicas (PostgreSQL, MongoDB)
- Multi-leader: Multiple primaries (Cassandra, DynamoDB)
2. Consensus Algorithms
Ensure agreement despite failures.
- Raft: Easier to understand (etcd, Consul)
- Paxos: More complex but proven (Google Chubby)
3. Graceful Degradation
Reduce functionality instead of complete failure.
- Example: Netflix shows cached recommendations when recommendation service is down
4. Bulkheads
Isolate failures to prevent cascade.
- Example: Separate thread pools for different services
Real-world: AWS designs for "everything fails all the time" - entire availability zones can go down without service interruption.
Partitioning (Sharding)
Splitting data across multiple nodes to handle datasets larger than a single machine can store.
Why partition?
- Dataset too large for one machine (Twitter has billions of tweets)
- Query load too high for one machine
- Regulatory requirements (data locality)
Partitioning strategies:
1. Range-based
Split by key ranges (A-M on server 1, N-Z on server 2).
- Pro: Range queries are efficient
- Con: Risk of hot spots (uneven distribution)
2. Hash-based
Use hash function to distribute data.
- Pro: Even distribution
- Con: Range queries require scanning all partitions
- Example: Cassandra uses consistent hashing
3. Geographic
Partition by region.
- Example: European user data in EU servers (GDPR compliance)
Challenge: Cross-partition queries are expensive. Design your partition key carefully based on access patterns.
Real-world: Instagram partitions user data by user ID. Your photos are on specific shards, enabling fast lookups.
CAP Theorem
You can only have 2 out of 3 guarantees:
- Consistency: All nodes see the same data
- Availability: Every request gets a response
- Partition tolerance: System works despite network failures
In practice: Network partitions happen, so you must choose between consistency and availability.
CP systems (Consistency + Partition tolerance):
- Banking, inventory systems
- Example: HBase, MongoDB (with strong consistency)
AP systems (Availability + Partition tolerance):
- Social media, caching, analytics
- Example: Cassandra, DynamoDB, Riak
Real-world nuance: Most systems offer tunable consistency (DynamoDB, Cassandra) - you choose per operation.
Read this detailed CAP Theorem blog for deeper insights.
Key Takeaways
- No silver bullet: Every design decision is a trade-off
- Start simple: Don't build for Google scale on day one
- Measure everything: You can't improve what you don't measure
- Design for failure: Assume everything will fail
- Understand your requirements: Does your use case need strong consistency? What's acceptable downtime?
Related Posts
Continue exploring similar topics
Merkle Trees: Implementation in Java and Real-World Applications
A comprehensive guide to Merkle Trees with Java implementation, practical applications in blockchain, distributed systems, and data integrity verification.
CAP Theorem
CAP theorem states that any distributed system with a state can have...
Spring Boot 3.x: What Actually Changed (and What Matters)
A practical look at Spring Boot 3.x features that change how you build services - virtual threads, reactive patterns, security gotchas, and performance lessons from production.