Distributed Systems Glossary: Essential Concepts with Real-World Examples

Distributed System

A distributed system contains multiple nodes that are physically separate but linked together by a network. These nodes communicate and coordinate to appear as a single coherent system to end users.

Real-world examples:

Netflix: Content delivery across global CDN nodes
WhatsApp: Message routing through distributed server clusters
Google Search: Distributed indexing and query processing
Uber: Real-time location tracking and ride matching across regions

Why distributed? Single machines have physical limits. Distributed systems enable:

Geographic distribution (lower latency for global users)
Fault isolation (one failure doesn't bring down everything)
Cost efficiency (commodity hardware vs. expensive supercomputers)

Scalability

The ability to handle increased load by adding resources without redesigning the system architecture.

Horizontal Scaling (Scale Out)

Adding more machines to distribute the load.

Pros:

No theoretical limit
Better fault tolerance
Can scale incrementally

Cons:

Increased complexity (load balancing, data consistency)
Network overhead

Examples:

Cassandra: Add nodes to the cluster seamlessly
Kafka: Add brokers to handle more message throughput
Microservices: Deploy more container instances

Vertical Scaling (Scale Up)

Adding more power (CPU, RAM, disk) to existing machines.

Pros:

Simpler architecture
No distributed system complexity
Lower network latency

Cons:

Hardware limits (can't infinitely upgrade)
Single point of failure
Downtime during upgrades
Expensive at high end

Examples:

PostgreSQL: Upgrade to larger EC2 instance
Redis: Increase memory for larger cache

Real-world approach: Most systems use both. Start with vertical scaling for simplicity, add horizontal scaling when needed.

Reliability

The probability that a system continues delivering services correctly even when components fail.

Key principle: Eliminate single points of failure through redundancy.

Techniques:

Replication: Multiple copies of data/services (Netflix stores each video on multiple servers)
Failover: Automatic switching to backup systems (AWS RDS Multi-AZ)
Checkpointing: Save state periodically for recovery (Kafka consumer offsets)
Circuit breakers: Prevent cascade failures (Hystrix in microservices)

Trade-off: Reliability costs money and complexity. Over-engineering for 99.999% uptime when 99.9% suffices wastes resources.

Availability

The percentage of time a system is operational and accessible.

Common SLA targets:

Availability	Downtime/Year	Downtime/Month	Use Case
99% (2 nines)	3.65 days	7.2 hours	Internal tools
99.9% (3 nines)	8.76 hours	43.2 minutes	Standard web apps
99.99% (4 nines)	52.56 minutes	4.32 minutes	Payment systems
99.999% (5 nines)	5.26 minutes	25.9 seconds	Critical infrastructure

How to improve availability:

Redundancy: Multiple servers in parallel
- 4 servers at 95% uptime each → 99.9994% combined availability
Load balancing: Distribute traffic across healthy nodes (AWS ELB, NGINX)
Health checks: Detect and route around failures automatically
Geographic distribution: Multi-region deployments (AWS, GCP regions)

Key insight: Reliability ≠ Availability

Reliable but not available: System works correctly when up, but frequently down for maintenance
Available but not reliable: System is always up but returns wrong results

Example: Google aims for 99.99% availability for Gmail, meaning less than 1 hour downtime per year.

Consistency

All nodes see the same data at the same time. After a write completes, all subsequent reads return the updated value.

Consistency models:

Strong Consistency

Reads always return the most recent write. Feels like a single machine.

Use case: Banking transactions, inventory management
Example: Google Spanner, traditional RDBMS
Trade-off: Higher latency, lower availability (CAP theorem)

Eventual Consistency

Reads may return stale data temporarily, but all replicas converge eventually.

Use case: Social media feeds, DNS, shopping carts
Example: DynamoDB, Cassandra, S3
Trade-off: Better performance and availability, but complexity in handling conflicts

Real-world example:

Twitter: Your tweet might not appear immediately to all followers (eventual consistency is acceptable)
Bank transfer: Money must be deducted and credited atomically (strong consistency required)

Common pitfall: Over-engineering for strong consistency when eventual consistency suffices. Instagram doesn't need strong consistency for likes count.

Latency vs Throughput

Two critical performance metrics that often trade off against each other.

Latency

Time to complete a single operation (response time).

Examples:

Google Search: < 200ms for search results
Trading systems: < 10ms for order execution
Gaming: < 50ms for responsive gameplay

How to reduce:

Caching (Redis, CDN)
Geographic distribution (edge servers)
Async processing (don't wait for non-critical operations)
Database indexing

Throughput

Number of operations completed per unit time.

Examples:

Kafka: Millions of messages/second
Netflix: 250 million hours of content streamed/day
Payment gateway: Thousands of transactions/second

How to improve:

Batching requests
Parallel processing
Horizontal scaling
Connection pooling

The trade-off:

Batching increases throughput but adds latency
Processing requests individually reduces latency but limits throughput

Example: YouTube prioritizes throughput (serve millions of videos) over latency (buffering a few seconds is acceptable). High-frequency trading prioritizes latency (microseconds matter) over throughput.

Observability & Manageability

The ability to understand system behavior and diagnose issues quickly.

Three pillars:

1. Metrics

Numeric measurements over time.

Examples: CPU usage, request rate, error rate, latency percentiles
Tools: Prometheus, Grafana, CloudWatch

2. Logs

Discrete events with context.

Examples: Error messages, request traces, audit logs
Tools: ELK Stack (Elasticsearch, Logstash, Kibana), Splunk
Best practice: Structured logging with correlation IDs

3. Traces

Request flow across distributed services.

Examples: Track a single API call through 10 microservices
Tools: Jaeger, Zipkin, AWS X-Ray

Real-world impact:

Netflix: Chaos engineering (intentionally break things) to ensure quick recovery
Google: SRE practices with error budgets and SLOs
Amazon: "Two-pizza teams" with full ownership including on-call

Key metric: MTTR (Mean Time To Recovery) - How fast can you fix issues? Often more important than preventing all failures.

Fault Tolerance

The ability to continue operating correctly despite component failures.

Murphy's Law in distributed systems: If something can fail, it will fail.

Common failures:

Network partitions (nodes can't communicate)
Disk failures
Process crashes
Slow nodes ("gray failures" - not dead but degraded)
Data center outages

Fault tolerance techniques:

1. Replication

Multiple copies of data/services.

Leader-follower: One primary, multiple replicas (PostgreSQL, MongoDB)
Multi-leader: Multiple primaries (Cassandra, DynamoDB)

2. Consensus Algorithms

Ensure agreement despite failures.

Raft: Easier to understand (etcd, Consul)
Paxos: More complex but proven (Google Chubby)

3. Graceful Degradation

Reduce functionality instead of complete failure.

Example: Netflix shows cached recommendations when recommendation service is down

4. Bulkheads

Isolate failures to prevent cascade.

Example: Separate thread pools for different services

Real-world: AWS designs for "everything fails all the time" - entire availability zones can go down without service interruption.

Partitioning (Sharding)

Splitting data across multiple nodes to handle datasets larger than a single machine can store.

Why partition?

Dataset too large for one machine (Twitter has billions of tweets)
Query load too high for one machine
Regulatory requirements (data locality)

Partitioning strategies:

1. Range-based

Split by key ranges (A-M on server 1, N-Z on server 2).

Pro: Range queries are efficient
Con: Risk of hot spots (uneven distribution)

2. Hash-based

Use hash function to distribute data.

Pro: Even distribution
Con: Range queries require scanning all partitions
Example: Cassandra uses consistent hashing

3. Geographic

Partition by region.

Example: European user data in EU servers (GDPR compliance)

Challenge: Cross-partition queries are expensive. Design your partition key carefully based on access patterns.

Real-world: Instagram partitions user data by user ID. Your photos are on specific shards, enabling fast lookups.

CAP Theorem

You can only have 2 out of 3 guarantees:

Consistency: All nodes see the same data
Availability: Every request gets a response
Partition tolerance: System works despite network failures

In practice: Network partitions happen, so you must choose between consistency and availability.

CP systems (Consistency + Partition tolerance):

Banking, inventory systems
Example: HBase, MongoDB (with strong consistency)

AP systems (Availability + Partition tolerance):

Social media, caching, analytics
Example: Cassandra, DynamoDB, Riak

Real-world nuance: Most systems offer tunable consistency (DynamoDB, Cassandra) - you choose per operation.

Read this detailed CAP Theorem blog for deeper insights.

Key Takeaways

No silver bullet: Every design decision is a trade-off
Start simple: Don't build for Google scale on day one
Measure everything: You can't improve what you don't measure
Design for failure: Assume everything will fail
Understand your requirements: Does your use case need strong consistency? What's acceptable downtime?