Back-of-the-Envelope Calculations | System Design Fundamentals

You are in a system design interview. The interviewer says "Design Instagram." You start describing your architecture, and the interviewer asks: How much storage do you need for photos?

Your mind goes blank. You know Instagram stores billions of photos, but what does that actually mean in terabytes? Should you admit you're guessing, or confidently throw out a number and hope for the best?

We've all been there in the same moment when abstract concepts like scale suddenly need to become concrete numbers.

This is where most candidates lose points—not because they don't know architecture patterns, but because they can't do basic math about systems. They throw out random numbers or, worse, say we'll figure it out later.

Here's the thing: you don't need to know the exact answer. What interviewers want to see is that you can think about the problem systematically. I've sat on both sides of these interviews, and I can tell you that showing your reasoning matters more than getting the perfect number.

Back-of-the-envelope calculations are quick, rough estimates that help you make informed design decisions. They're not about getting exact numbers but they're about understanding the order of magnitude. Are we dealing with gigabytes or petabytes? Hundreds of requests or millions?

This lesson teaches you how to do these calculations quickly and confidently.

From the trenches: At Google, every design doc includes a "back-of-the-envelope" section (source). Senior engineers routinely spot fatal flaws in 30 seconds by running the numbers. Systems that look elegant on paper sometimes need 10,000 database servers before code is written. Math saves time.

What You Will Learn

The essential numbers every engineer should memorize
How to estimate storage requirements
How to calculate queries per second (QPS)
How to estimate bandwidth needs
How to determine server counts
Common calculation patterns for interviews
How to sanity-check your estimates

The Numbers You Must Know

Before you can estimate anything, you need reference points. Memorize these below numbers. They're your toolkit.

Time Conversions

plaintext
1 second   = 1,000 milliseconds (ms)
1 minute   = 60 seconds
1 hour     = 3,600 seconds        ≈ 4,000 (for quick math)
1 day      = 86,400 seconds       ≈ 100,000 (10^5)
1 month    = 2.6 million seconds  ≈ 2.5 × 10^6
1 year     = 31.5 million seconds ≈ 30 × 10^6

Why this matters: When someone says 100 million requests per day, you need to convert to requests per second to understand load.

plaintext
100M requests/day and there are 100K seconds/day (rough estimate) = 1,000 requests/second

Data Size Conversions

plaintext
1 Byte     = 8 bits
1 KB       = 1,000 bytes          ≈ 10^3
1 MB       = 1,000 KB             ≈ 10^6
1 GB       = 1,000 MB             ≈ 10^9
1 TB       = 1,000 GB             ≈ 10^12
1 PB       = 1,000 TB             ≈ 10^15

Common data sizes:

Data Type	Typical Size
Character (ASCII)	1 byte
Character (UTF-8, with emojis)	1-4 bytes
Integer	4-8 bytes
UUID	16 bytes
Short URL/ID	7-10 bytes
Tweet (140 chars)	~280 bytes
Average web page	2-3 MB
Average photo (compressed)	200 KB - 2 MB
Average video (1 min, compressed)	10-50 MB

Latency Numbers

These help you understand where time goes in a system:

plaintext
L1 cache reference:                    1 ns
L2 cache reference:                    4 ns
RAM reference:                        100 ns
SSD random read:                   15,000 ns  (15 μs)
HDD random read:                2,000,000 ns  (2 ms)
Round trip same datacenter:       500,000 ns  (0.5 ms)
Round trip cross-continent:   150,000,000 ns  (150 ms)

The key insight: Network and disk are SLOW compared to memory. That's why caching exists.

Classic reference: These numbers are from Jeff Dean's famous Numbers Everyone Should Know talk. They're slightly outdated (from ~2010), but the relative differences still hold. Memory is still 1000× faster than disk, and cross-continent calls are still painfully slow.

Server Capacities (Rough Estimates)

plaintext
Single web server (modern):     1,000 - 10,000 requests/second
Single database server:         1,000 - 5,000 queries/second
Redis (single instance):        100,000+ operations/second
Kafka (single broker):          100,000+ messages/second

These vary wildly based on workload, but give you a starting point.

The Estimation Framework

For any system, you typically estimate four things:

Storage - How much data do we need to store?
QPS - How many queries/requests per second?
Bandwidth - How much data flows through the system?
Servers - How many machines do we need?

Let me walk through each with examples.

Estimating Storage

Let's start with something familiar. Think about your phone's photo library. If you're like me, you probably have a few thousand photos taking up maybe 20-30 GB. Now imagine Instagram where 500 million people uploading photos every single day.

How do we go from I have 3,000 photos to Instagram needs 150 petabytes? Let me show you.

The formula is simple:

plaintext
Storage = (Number of items) × (Size per item) × (Time period)

Example: Instagram Photo Storage

Given:

500 million daily active users (DAU)
10% of users post a photo each day
Average photo size: 500 KB (after compression)
Store photos for 5 years

Step 1: Photos per day

plaintext
500M users × 10% posting = 50 million photos/day

Step 2: Storage per day

plaintext
50M photos × 500 KB = 25 TB/day

Step 3: Storage for 5 years

plaintext
25 TB/day × 365 days × 5 years = 45 PB

Step 4: Add overhead (replication, metadata)

3x replication: 45 PB × 3 = 135 PB
Add 10% for metadata: ~150 PB total

The answer: ~150 petabytes for photo storage.

Wait, did we just say 150 petabytes? That sounds insane. But here's why it makes sense: 50 million photos per day, each half a megabyte, for five years, with backups. The math doesn't lie.

Real-world note: Actual storage is often 2-5× raw data due to indexes, backups, logs, and internal data structures. We accounted for replication above, but production systems also need backup snapshots, database indexes, and write-ahead logs. For critical data, total storage can easily hit 5× the raw data size.

Pro tip: In interviews, round aggressively. 500 KB becomes 0.5 MB. 365 becomes 400. The goal is speed and correct order of magnitude, not precision.

Example: Twitter Messages

Given:

300 million DAU
Each user posts 2 tweets/day on average
Average tweet: 200 bytes (text + metadata)
Store for 10 years

Calculation:

plaintext
Tweets per day:    300M × 2 = 600M tweets/day
Storage per day:   600M × 200 bytes = 120 GB/day
Storage per year:  120 GB × 365 = ~44 TB/year
Storage for 10y:   44 TB × 10 = 440 TB ≈ 0.5 PB

Text is cheap. Photos and videos are expensive.

Estimating QPS (Queries Per Second)

Okay, we've calculated storage which is quite massive. Now let's figure out the other critical number: how many requests per second will your system actually handle?

This is where things get interesting. QPS tells you whether you need 3 servers or 300.

plaintext
QPS = (Daily active users × Actions per user per day) / Seconds per day
Peak QPS = Average QPS × Peak multiplier (usually 2-3x)

Example: URL Shortener

Given:

100 million URLs created per month
Read-to-write ratio: 100:1

Write QPS:

plaintext
100M URLs/month ÷ 2.5M seconds/month = 40 writes/second
Peak: 40 × 2 = 80 writes/second

Read QPS:

plaintext
40 writes/second × 100 reads/write = 4,000 reads/second
Peak: 4,000 × 2 = 8,000 reads/second

The insight: 8,000 reads/second is significant but manageable with caching. If the cache hit rate is 90%, the database only sees 800 queries/second.

Example: Social Media Feed

Given:

500 million DAU
Users check feed 10 times per day
Each feed load fetches 20 posts

Feed requests per day:

plaintext
500M users × 10 checks = 5 billion feed requests/day

QPS:

plaintext
5B requests ÷ 100K seconds/day = 50,000 requests/second

Database queries (if not cached):

plaintext
50,000 requests × 20 posts = 1,000,000 queries/second

Stop for a second and think about what this means: one million queries per second. No wonder Facebook invested billions in infrastructure. This is the moment in interviews where I see people's eyes light up—when the math actually reveals something surprising about why systems are built the way they are.

This is why social media companies invest heavily in caching and pre-computation. No database can handle a million queries per second directly.

From the trenches: Twitter's feed generation was originally computed on every request. As they scaled, this became impossible. They moved to a fan-out on write model where feeds are pre-computed and cached (technical writeup). When read QPS exceeds database capacity, you pre-compute or fail.

Estimating Bandwidth

Alright, you've calculated storage and QPS. But here's the question everyone forgets to ask: How much data is actually flowing through your network?

Bandwidth is data flowing through your system per second, and it's often the hidden bottleneck.

plaintext
Bandwidth = QPS × Size per request

Example: Video Streaming Service

Given:

1 million concurrent viewers
Average video bitrate: 5 Mbps (megabits per second)

Bandwidth:

plaintext
1M viewers × 5 Mbps = 5 Tbps (terabits per second)

Convert to bytes:

plaintext
5 Tbps ÷ 8 = 625 GB/second

This is massive bandwidth. That's why video services use CDNs extensively—they distribute this load across hundreds of edge servers worldwide.

Real numbers: Netflix peaks at ~200 Gbps per edge server, with thousands of servers globally. YouTube serves over 1 billion hours of video daily. These numbers are only possible because of CDNs—no central datacenter could handle this bandwidth.

Example: API Service

Given:

10,000 API requests/second
Average response size: 10 KB

Bandwidth:

plaintext
10,000 requests × 10 KB = 100 MB/second outbound

Very manageable. A single server with a 1 Gbps connection handles 125 MB/second.

Estimating Server Count

Okay, we've calculated storage (massive), QPS (also massive), and bandwidth (you get it). But here's the question everyone actually cares about: How many servers is this going to cost me?

I remember reviewing a design doc early in my career where someone proposed a single database for a system expecting 10 million users. The senior engineer didn't explain why it wouldn't work instead he just pulled out a piece of paper and ran the numbers in 30 seconds. The database would need to handle 50,000 queries per second. The proposed system could handle maybe 5,000. This type of math would save anyone from building something that would've crashed on launch day and help them make better architectural decisions early in the design process.

That's the power of back-of-envelope calculations.

plaintext
Servers needed = Total QPS / QPS per server

Example: Web Application

Given:

50,000 requests/second peak
Single server handles 5,000 requests/second
Want 3x headroom for growth

Calculation:

plaintext
Minimum servers: 50,000 / 5,000 = 10 servers
With headroom:   10 × 3 = 30 servers

But also consider:

At least 2 servers per availability zone for redundancy
If 3 AZs: minimum 6 servers just for redundancy
Final answer: 30 application servers across 3 AZs

Example: Database Capacity

Given:

10,000 read queries/second
100 write queries/second
Single PostgreSQL instance handles 3,000 queries/second

Primary database:

plaintext
1 primary for writes (100 QPS is easy)

Read replicas:

plaintext
10,000 reads / 3,000 per replica = 3.3 → 4 read replicas

But if you add a cache with 90% hit rate:

plaintext
10,000 reads × 10% miss rate = 1,000 reads hit database
1,000 reads / 3,000 per replica = 0.33 → 1 replica is enough

The insight: Caching dramatically reduces database server requirements. You'll notice that a 90% cache hit rate just saved you from buying 3 extra database servers. That's real money.

From the trenches: Amazon's common pattern is cache everything, invalidate intelligently. Services routinely achieve 95-99% cache hit rates. Teams regularly reduce database clusters from 50 servers to 5 just by adding proper caching. The ROI is massive.

Common Calculation Patterns

These patterns appear repeatedly in interviews. Memorize the approach.

Pattern 1: Daily to Per-Second

plaintext
Per second = Daily count / 100,000

Example: 500 million daily requests = 5,000 requests/second

Pattern 2: Storage Over Time

plaintext
Total storage = Daily storage × Days × Replication factor

Example: 10 GB/day for 3 years with 3x replication = 10 GB × 1,000 days × 3 = 30 TB

Note: Notice how I am rounding numbers up to the nearest 10 factor

Pattern 3: Read/Write Ratio Impact

Most systems are read-heavy (10:1 to 1000:1). Always ask about this ratio.

If you have 1,000 writes/second and 100:1 read ratio:

Write load: 1,000/second (hits primary DB)
Read load: 100,000/second (can be cached and replicated)

Pattern 4: Cache Hit Rate Impact

plaintext
Database load = Total reads × (1 - cache hit rate)

Example: 100,000 reads/second with 95% cache hit rate = 100,000 × 0.05 = 5,000 reads/second to database

Pattern 5: Concurrent Users to QPS

plaintext
QPS = Concurrent users × Requests per user per second

Example: 100,000 concurrent users, each making 1 request every 5 seconds = 100,000 / 5 = 20,000 requests/second

Red Flags

Your estimate is too high if:

You need more than 1,000 servers for a startup
Storage exceeds 1 PB for a non-video service
QPS exceeds 1 million without caching

Your estimate is too low if:

A popular consumer app needs only 1 database server
Video storage is measured in terabytes, not petabytes
Peak traffic equals average traffic

Quick Sanity Checks

Does the number make sense?
- If your photo storage estimate is 10 GB for Instagram, something is wrong.
Can money buy it?
- 100 PB of storage is ~$1M/month on AWS. Possible for big companies, unlikely for startups.
Does the architecture support it?
- If you calculated 1M QPS but drew one database, there's a mismatch.

Worked Example: Design a URL Shortener

Let me walk through a complete estimation for a URL shortener like Bitly.

Step 1: Gather Requirements

100 million URLs shortened per month
URLs accessed for 5 years on average
Read-to-write ratio: 100:1

Step 2: QPS Calculations

Writes:

plaintext
100M / month ÷ 2.5M seconds/month = 40 URL creations/second
Peak (2x): 80/second

Reads:

plaintext
40 writes × 100 read ratio = 4,000 redirects/second
Peak (2x): 8,000/second

Step 3: Storage Calculations

Total URLs over 5 years:

plaintext
100M/month × 12 months × 5 years = 6 billion URLs

Storage per URL:

Short code: 7 bytes
Long URL: 100 bytes average
Timestamp: 8 bytes
User ID: 8 bytes
Total: ~125 bytes, round to 150 bytes with overhead

Total storage:

plaintext
6 billion URLs × 150 bytes = 900 GB ≈ 1 TB
With 3x replication: 3 TB

Step 4: Bandwidth Calculations

Read bandwidth:

plaintext
8,000 requests × 150 bytes = 1.2 MB/second

Step 5: Server Estimation

Application servers:

plaintext
8,000 QPS ÷ 5,000 per server = 1.6 → 3 servers (for redundancy)

Database:

plaintext
With 90% cache hit rate: 800 queries/second
1 PostgreSQL instance handles this easily
Add 1 replica for failover

Cache (Redis):

plaintext
7,200 cache hits/second
Single Redis instance handles 100K+ ops/second
1 Redis, maybe 1 replica

Final Summary

Component	Count	Justification
App Servers	3-6	8K QPS, redundancy
Database (Primary)	1	80 writes/second is trivial
Database (Replica)	1	Failover + some reads
Cache (Redis)	1-2	7K ops/second
Storage	3 TB	6B URLs, 3x replication

Interview Estimates vs Real-World Estimates

The calculations you've learned will get you through interviews. But real production systems have complexities that interview math ignores. Here's what changes when you go from whiteboard to production.

1. Traffic Is Bursty, Not Smooth

Interview assumption:

Uniform traffic throughout the day
2-3× multiplier for peak
Predictable growth

Real-world reality:

Traffic spikes from notifications, viral posts, or marketing campaigns
Cron jobs create artificial load spikes
Retries during partial outages amplify load
Holiday shopping, sports events, breaking news = 10-50× normal load

What this means: In interviews, you say 2× peak multiplier. In production, you plan for 10× short bursts for critical paths. That's why systems have circuit breakers, rate limiting, and auto-scaling—they're built for chaos, not averages.

2. Cost Math Is Missing (But Critical)

In interviews, nobody asks can you afford this? In real life, that's often the first question.

Quick cost pattern:

plaintext
Monthly cost ≈ (Storage × $/GB) + (Bandwidth × $/GB) + (Compute × $/hour)

Example: Your Instagram calculation showed 150 PB of photo storage.

plaintext
150 PB = 150,000 TB
At ~$0.023/GB/month (AWS S3): ~$3.5M/month just for storage
Add bandwidth, compute, databases: ~$5-10M/month

The insight: Only companies with massive revenue can afford this. If you're designing a startup photo app, you need compression, deduplication, or a completely different model. Cost constraints drive architecture decisions.

3. Latency Chains Kill Performance

You listed latency numbers earlier. Now let's apply them to show why architecture matters.

Example: Synchronous Service Chain

If Service A → B → C are all synchronous calls:

Service A calls B: 50 ms
Service B calls C: 50 ms
Service C queries DB: 10 ms
Response travels back: 50 ms

Total latency: ~160 ms (best case)

But that's P50. At P99 with retries:

Each service P99: 200 ms
Add retry logic: 2× worst case
P99 latency: 400-600 ms

Why this matters:

3 sync calls = unusable mobile experience
5+ sync calls = timeouts and cascading failures
This is why event-driven architecture exists

In interviews, you draw the boxes. In production, you measure the latency and rewrite the whole thing async.

4. Hidden Multipliers Everywhere

Interview math: 1 TB of user data = 1 TB storage

Real production storage for that same 1 TB:

Raw data: 1 TB
Indexes: +30-50% (1.5 TB total)
Replication: ×3 (4.5 TB)
Backups/snapshots: +20% (5.4 TB)
Write-ahead logs: +10% (6 TB)
Actual storage: 6× the raw data

Similarly for compute:

Interview: Need 10 servers
Real: 10 servers + 3 for failover + 5 for staging + 2 for testing = 20 servers (This might be a bit exaggerated depending on team and practices set up)

The gap: Interview math is 1×. Production math is 3-6×.

Common Mistakes to Avoid

By the way, I still sometimes second-guess my estimates mid-interview. The difference now is that I can explain my reasoning and adjust when I spot mistakes. That's what we're really teaching here—not perfection, but structured thinking.

Here are the mistakes I see most often (and yes, I've made all of them):

Mistake 1: Forgetting Peak vs Average

Average traffic is useless for capacity planning. Always multiply by 2-3x for peak.

plaintext
Wrong:  1,000 QPS average → need 1 server handling 1,000 QPS
Right:  1,000 QPS average → 2,500 QPS peak → need servers for 2,500 QPS

Mistake 2: Ignoring Replication

Storage estimates without replication are incomplete. Most production systems use 3x replication.

plaintext
Wrong:  1 TB of data → need 1 TB storage
Right:  1 TB of data → 3 TB with replication → 3.5 TB with backups

Mistake 3: Not Considering Read vs Write

Lumping all operations together hides the real bottleneck.

plaintext
Wrong:  100,000 operations/second → need big database
Right:  99,000 reads (cacheable) + 1,000 writes → cache + modest database

Mistake 4: Over-Precision

Spending 5 minutes calculating exact numbers is wasted time.

plaintext
Wrong:  86,400 seconds/day × 365.25 days/year = 31,557,600 seconds/year
Right:  ~30 million seconds/year (close enough)

Key Takeaways

Memorize the reference numbers. Time conversions, data sizes, and latency numbers are your toolkit.
Follow the framework. Storage → QPS → Bandwidth → Servers. Same pattern every time.
Round aggressively. Use powers of 10. 86,400 becomes 100,000. Speed beats precision.
Account for peak traffic. Multiply average by 2-3x. Systems fail during peaks, not averages.
Sanity check against reality. Compare your estimates to known systems. If something feels off, it probably is.
Show your work in interviews. The calculation process matters more than the final number.

Practice Problems

Alright, your turn. Grab a piece of paper (or your Notes app). Let's see if you can work through these without looking at the reference numbers. Don't worry about being perfect; just see if you can get within the right order of magnitude:

Twitter DM Storage:
- 200 million DAU, 10% send DMs daily
- Average 5 DMs per active user
- Average DM size: 500 bytes
- Store for 3 years
- How much storage?
E-commerce QPS:
- 50 million DAU
- Each user views 20 products per session
- Each product view makes 3 API calls
- What is peak QPS?
Video Platform Bandwidth:
- 10 million concurrent viewers
- 60% watch at 1080p (8 Mbps)
- 40% watch at 720p (4 Mbps)
- Total bandwidth needed?

Leaving the part to validate answers to yourselves.

Additional Resources

For deeper learning on capacity planning and estimation:

Essential reads:

Latency Numbers Every Programmer Should Know - Jeff Dean's classic reference
System Design Primer - Comprehensive back-of-envelope guide
AWS Architecture Blog - Real-world capacity planning

What's Next

Here's the truth: the first time you do these calculations, you'll probably feel slow and uncertain. That's normal. But after practicing with 5-10 system design problems, something clicks. You'll start seeing these patterns everywhere in the apps you use, in the news about cloud costs, in your own work. These numbers stop being abstract and start being tools.

The next time an interviewer asks how much storage Instagram needs, you won't panic. You'll smile, grab your imaginary piece of paper, and start calculating.

Ready to keep building? Next up: Scalability Fundamentals where we take these numbers and figure out how to actually build systems that don't fall over when they grow. You'll learn vertical vs horizontal scaling, stateless design, and the key decisions that determine whether a system scales gracefully or collapses under load.

You are in a system design interview. The interviewer says "Design Instagram." You start describing your architecture, and the interviewer asks: How much storage do you need for photos?

We've all been there in the same moment when abstract concepts like scale suddenly need to become concrete numbers.

This lesson teaches you how to do these calculations quickly and confidently.

From the trenches: At Google, every design doc includes a "back-of-the-envelope" section (source). Senior engineers routinely spot fatal flaws in 30 seconds by running the numbers. Systems that look elegant on paper sometimes need 10,000 database servers before code is written. Math saves time.

What You Will Learn

The essential numbers every engineer should memorize
How to estimate storage requirements
How to calculate queries per second (QPS)
How to estimate bandwidth needs
How to determine server counts
Common calculation patterns for interviews
How to sanity-check your estimates

The Numbers You Must Know

Before you can estimate anything, you need reference points. Memorize these below numbers. They're your toolkit.

Time Conversions

plaintext
1 second   = 1,000 milliseconds (ms)
1 minute   = 60 seconds
1 hour     = 3,600 seconds        ≈ 4,000 (for quick math)
1 day      = 86,400 seconds       ≈ 100,000 (10^5)
1 month    = 2.6 million seconds  ≈ 2.5 × 10^6
1 year     = 31.5 million seconds ≈ 30 × 10^6

Why this matters: When someone says 100 million requests per day, you need to convert to requests per second to understand load.

plaintext
100M requests/day and there are 100K seconds/day (rough estimate) = 1,000 requests/second

Data Size Conversions

plaintext
1 Byte     = 8 bits
1 KB       = 1,000 bytes          ≈ 10^3
1 MB       = 1,000 KB             ≈ 10^6
1 GB       = 1,000 MB             ≈ 10^9
1 TB       = 1,000 GB             ≈ 10^12
1 PB       = 1,000 TB             ≈ 10^15

Common data sizes:

Data Type	Typical Size
Character (ASCII)	1 byte
Character (UTF-8, with emojis)	1-4 bytes
Integer	4-8 bytes
UUID	16 bytes
Short URL/ID	7-10 bytes
Tweet (140 chars)	~280 bytes
Average web page	2-3 MB
Average photo (compressed)	200 KB - 2 MB
Average video (1 min, compressed)	10-50 MB

Latency Numbers

These help you understand where time goes in a system:

plaintext
L1 cache reference:                    1 ns
L2 cache reference:                    4 ns
RAM reference:                        100 ns
SSD random read:                   15,000 ns  (15 μs)
HDD random read:                2,000,000 ns  (2 ms)
Round trip same datacenter:       500,000 ns  (0.5 ms)
Round trip cross-continent:   150,000,000 ns  (150 ms)

The key insight: Network and disk are SLOW compared to memory. That's why caching exists.

Classic reference: These numbers are from Jeff Dean's famous Numbers Everyone Should Know talk. They're slightly outdated (from ~2010), but the relative differences still hold. Memory is still 1000× faster than disk, and cross-continent calls are still painfully slow.

Server Capacities (Rough Estimates)

plaintext
Single web server (modern):     1,000 - 10,000 requests/second
Single database server:         1,000 - 5,000 queries/second
Redis (single instance):        100,000+ operations/second
Kafka (single broker):          100,000+ messages/second

These vary wildly based on workload, but give you a starting point.

The Estimation Framework

For any system, you typically estimate four things:

Storage - How much data do we need to store?
QPS - How many queries/requests per second?
Bandwidth - How much data flows through the system?
Servers - How many machines do we need?

Let me walk through each with examples.

Estimating Storage

How do we go from I have 3,000 photos to Instagram needs 150 petabytes? Let me show you.

The formula is simple:

plaintext
Storage = (Number of items) × (Size per item) × (Time period)

Example: Instagram Photo Storage

Given:

500 million daily active users (DAU)
10% of users post a photo each day
Average photo size: 500 KB (after compression)
Store photos for 5 years

Step 1: Photos per day

plaintext
500M users × 10% posting = 50 million photos/day

Step 2: Storage per day

plaintext
50M photos × 500 KB = 25 TB/day

Step 3: Storage for 5 years

plaintext
25 TB/day × 365 days × 5 years = 45 PB

Step 4: Add overhead (replication, metadata)

3x replication: 45 PB × 3 = 135 PB
Add 10% for metadata: ~150 PB total

The answer: ~150 petabytes for photo storage.

Wait, did we just say 150 petabytes? That sounds insane. But here's why it makes sense: 50 million photos per day, each half a megabyte, for five years, with backups. The math doesn't lie.

Real-world note: Actual storage is often 2-5× raw data due to indexes, backups, logs, and internal data structures. We accounted for replication above, but production systems also need backup snapshots, database indexes, and write-ahead logs. For critical data, total storage can easily hit 5× the raw data size.

Pro tip: In interviews, round aggressively. 500 KB becomes 0.5 MB. 365 becomes 400. The goal is speed and correct order of magnitude, not precision.

Example: Twitter Messages

Given:

300 million DAU
Each user posts 2 tweets/day on average
Average tweet: 200 bytes (text + metadata)
Store for 10 years

Calculation:

plaintext
Tweets per day:    300M × 2 = 600M tweets/day
Storage per day:   600M × 200 bytes = 120 GB/day
Storage per year:  120 GB × 365 = ~44 TB/year
Storage for 10y:   44 TB × 10 = 440 TB ≈ 0.5 PB

Text is cheap. Photos and videos are expensive.

Estimating QPS (Queries Per Second)

Okay, we've calculated storage which is quite massive. Now let's figure out the other critical number: how many requests per second will your system actually handle?

This is where things get interesting. QPS tells you whether you need 3 servers or 300.

plaintext
QPS = (Daily active users × Actions per user per day) / Seconds per day
Peak QPS = Average QPS × Peak multiplier (usually 2-3x)

Example: URL Shortener

Given:

100 million URLs created per month
Read-to-write ratio: 100:1

Write QPS:

plaintext
100M URLs/month ÷ 2.5M seconds/month = 40 writes/second
Peak: 40 × 2 = 80 writes/second

Read QPS:

plaintext
40 writes/second × 100 reads/write = 4,000 reads/second
Peak: 4,000 × 2 = 8,000 reads/second

The insight: 8,000 reads/second is significant but manageable with caching. If the cache hit rate is 90%, the database only sees 800 queries/second.

Example: Social Media Feed

Given:

500 million DAU
Users check feed 10 times per day
Each feed load fetches 20 posts

Feed requests per day:

plaintext
500M users × 10 checks = 5 billion feed requests/day

QPS:

plaintext
5B requests ÷ 100K seconds/day = 50,000 requests/second

Database queries (if not cached):

plaintext
50,000 requests × 20 posts = 1,000,000 queries/second

This is why social media companies invest heavily in caching and pre-computation. No database can handle a million queries per second directly.

From the trenches: Twitter's feed generation was originally computed on every request. As they scaled, this became impossible. They moved to a fan-out on write model where feeds are pre-computed and cached (technical writeup). When read QPS exceeds database capacity, you pre-compute or fail.

Estimating Bandwidth

Alright, you've calculated storage and QPS. But here's the question everyone forgets to ask: How much data is actually flowing through your network?

Bandwidth is data flowing through your system per second, and it's often the hidden bottleneck.

plaintext
Bandwidth = QPS × Size per request

Example: Video Streaming Service

Given:

1 million concurrent viewers
Average video bitrate: 5 Mbps (megabits per second)

Bandwidth:

plaintext
1M viewers × 5 Mbps = 5 Tbps (terabits per second)

Convert to bytes:

plaintext
5 Tbps ÷ 8 = 625 GB/second

This is massive bandwidth. That's why video services use CDNs extensively—they distribute this load across hundreds of edge servers worldwide.

Real numbers: Netflix peaks at ~200 Gbps per edge server, with thousands of servers globally. YouTube serves over 1 billion hours of video daily. These numbers are only possible because of CDNs—no central datacenter could handle this bandwidth.

Example: API Service

Given:

10,000 API requests/second
Average response size: 10 KB

Bandwidth:

plaintext
10,000 requests × 10 KB = 100 MB/second outbound

Very manageable. A single server with a 1 Gbps connection handles 125 MB/second.

Estimating Server Count

Okay, we've calculated storage (massive), QPS (also massive), and bandwidth (you get it). But here's the question everyone actually cares about: How many servers is this going to cost me?

That's the power of back-of-envelope calculations.

plaintext
Servers needed = Total QPS / QPS per server

Example: Web Application

Given:

50,000 requests/second peak
Single server handles 5,000 requests/second
Want 3x headroom for growth

Calculation:

plaintext
Minimum servers: 50,000 / 5,000 = 10 servers
With headroom:   10 × 3 = 30 servers

But also consider:

At least 2 servers per availability zone for redundancy
If 3 AZs: minimum 6 servers just for redundancy
Final answer: 30 application servers across 3 AZs

Example: Database Capacity

Given:

10,000 read queries/second
100 write queries/second
Single PostgreSQL instance handles 3,000 queries/second

Primary database:

plaintext
1 primary for writes (100 QPS is easy)

Read replicas:

plaintext
10,000 reads / 3,000 per replica = 3.3 → 4 read replicas

But if you add a cache with 90% hit rate:

plaintext
10,000 reads × 10% miss rate = 1,000 reads hit database
1,000 reads / 3,000 per replica = 0.33 → 1 replica is enough

The insight: Caching dramatically reduces database server requirements. You'll notice that a 90% cache hit rate just saved you from buying 3 extra database servers. That's real money.

From the trenches: Amazon's common pattern is cache everything, invalidate intelligently. Services routinely achieve 95-99% cache hit rates. Teams regularly reduce database clusters from 50 servers to 5 just by adding proper caching. The ROI is massive.

Common Calculation Patterns

These patterns appear repeatedly in interviews. Memorize the approach.

Pattern 1: Daily to Per-Second

plaintext
Per second = Daily count / 100,000

Example: 500 million daily requests = 5,000 requests/second

Pattern 2: Storage Over Time

plaintext
Total storage = Daily storage × Days × Replication factor

Example: 10 GB/day for 3 years with 3x replication = 10 GB × 1,000 days × 3 = 30 TB

Note: Notice how I am rounding numbers up to the nearest 10 factor

Pattern 3: Read/Write Ratio Impact

Most systems are read-heavy (10:1 to 1000:1). Always ask about this ratio.

If you have 1,000 writes/second and 100:1 read ratio:

Write load: 1,000/second (hits primary DB)
Read load: 100,000/second (can be cached and replicated)

Pattern 4: Cache Hit Rate Impact

plaintext
Database load = Total reads × (1 - cache hit rate)

Example: 100,000 reads/second with 95% cache hit rate = 100,000 × 0.05 = 5,000 reads/second to database

Pattern 5: Concurrent Users to QPS

plaintext
QPS = Concurrent users × Requests per user per second

Example: 100,000 concurrent users, each making 1 request every 5 seconds = 100,000 / 5 = 20,000 requests/second

Red Flags

Your estimate is too high if:

You need more than 1,000 servers for a startup
Storage exceeds 1 PB for a non-video service
QPS exceeds 1 million without caching

Your estimate is too low if:

A popular consumer app needs only 1 database server
Video storage is measured in terabytes, not petabytes
Peak traffic equals average traffic

Quick Sanity Checks

Does the number make sense?
- If your photo storage estimate is 10 GB for Instagram, something is wrong.
Can money buy it?
- 100 PB of storage is ~$1M/month on AWS. Possible for big companies, unlikely for startups.
Does the architecture support it?
- If you calculated 1M QPS but drew one database, there's a mismatch.

Worked Example: Design a URL Shortener

Let me walk through a complete estimation for a URL shortener like Bitly.

Step 1: Gather Requirements

100 million URLs shortened per month
URLs accessed for 5 years on average
Read-to-write ratio: 100:1

Step 2: QPS Calculations

Writes:

plaintext
100M / month ÷ 2.5M seconds/month = 40 URL creations/second
Peak (2x): 80/second

Reads:

plaintext
40 writes × 100 read ratio = 4,000 redirects/second
Peak (2x): 8,000/second

Step 3: Storage Calculations

Total URLs over 5 years:

plaintext
100M/month × 12 months × 5 years = 6 billion URLs

Storage per URL:

Short code: 7 bytes
Long URL: 100 bytes average
Timestamp: 8 bytes
User ID: 8 bytes
Total: ~125 bytes, round to 150 bytes with overhead

Total storage:

plaintext
6 billion URLs × 150 bytes = 900 GB ≈ 1 TB
With 3x replication: 3 TB

Step 4: Bandwidth Calculations

Read bandwidth:

plaintext
8,000 requests × 150 bytes = 1.2 MB/second

Step 5: Server Estimation

Application servers:

plaintext
8,000 QPS ÷ 5,000 per server = 1.6 → 3 servers (for redundancy)

Database:

plaintext
With 90% cache hit rate: 800 queries/second
1 PostgreSQL instance handles this easily
Add 1 replica for failover

Cache (Redis):

plaintext
7,200 cache hits/second
Single Redis instance handles 100K+ ops/second
1 Redis, maybe 1 replica

Final Summary

Component	Count	Justification
App Servers	3-6	8K QPS, redundancy
Database (Primary)	1	80 writes/second is trivial
Database (Replica)	1	Failover + some reads
Cache (Redis)	1-2	7K ops/second
Storage	3 TB	6B URLs, 3x replication

Interview Estimates vs Real-World Estimates

1. Traffic Is Bursty, Not Smooth

Interview assumption:

Uniform traffic throughout the day
2-3× multiplier for peak
Predictable growth

Real-world reality:

Traffic spikes from notifications, viral posts, or marketing campaigns
Cron jobs create artificial load spikes
Retries during partial outages amplify load
Holiday shopping, sports events, breaking news = 10-50× normal load

2. Cost Math Is Missing (But Critical)

In interviews, nobody asks can you afford this? In real life, that's often the first question.

Quick cost pattern:

plaintext
Monthly cost ≈ (Storage × $/GB) + (Bandwidth × $/GB) + (Compute × $/hour)

Example: Your Instagram calculation showed 150 PB of photo storage.

plaintext
150 PB = 150,000 TB
At ~$0.023/GB/month (AWS S3): ~$3.5M/month just for storage
Add bandwidth, compute, databases: ~$5-10M/month

3. Latency Chains Kill Performance

You listed latency numbers earlier. Now let's apply them to show why architecture matters.

Example: Synchronous Service Chain

If Service A → B → C are all synchronous calls:

Service A calls B: 50 ms
Service B calls C: 50 ms
Service C queries DB: 10 ms
Response travels back: 50 ms

Total latency: ~160 ms (best case)

But that's P50. At P99 with retries:

Each service P99: 200 ms
Add retry logic: 2× worst case
P99 latency: 400-600 ms

Why this matters:

3 sync calls = unusable mobile experience
5+ sync calls = timeouts and cascading failures
This is why event-driven architecture exists

In interviews, you draw the boxes. In production, you measure the latency and rewrite the whole thing async.

4. Hidden Multipliers Everywhere

Interview math: 1 TB of user data = 1 TB storage

Real production storage for that same 1 TB:

Raw data: 1 TB
Indexes: +30-50% (1.5 TB total)
Replication: ×3 (4.5 TB)
Backups/snapshots: +20% (5.4 TB)
Write-ahead logs: +10% (6 TB)
Actual storage: 6× the raw data

Similarly for compute:

Interview: Need 10 servers
Real: 10 servers + 3 for failover + 5 for staging + 2 for testing = 20 servers (This might be a bit exaggerated depending on team and practices set up)

The gap: Interview math is 1×. Production math is 3-6×.

Common Mistakes to Avoid

Here are the mistakes I see most often (and yes, I've made all of them):

Mistake 1: Forgetting Peak vs Average

Average traffic is useless for capacity planning. Always multiply by 2-3x for peak.

plaintext
Wrong:  1,000 QPS average → need 1 server handling 1,000 QPS
Right:  1,000 QPS average → 2,500 QPS peak → need servers for 2,500 QPS

Mistake 2: Ignoring Replication

Storage estimates without replication are incomplete. Most production systems use 3x replication.

plaintext
Wrong:  1 TB of data → need 1 TB storage
Right:  1 TB of data → 3 TB with replication → 3.5 TB with backups

Mistake 3: Not Considering Read vs Write

Lumping all operations together hides the real bottleneck.

plaintext
Wrong:  100,000 operations/second → need big database
Right:  99,000 reads (cacheable) + 1,000 writes → cache + modest database

Mistake 4: Over-Precision

Spending 5 minutes calculating exact numbers is wasted time.

plaintext
Wrong:  86,400 seconds/day × 365.25 days/year = 31,557,600 seconds/year
Right:  ~30 million seconds/year (close enough)

Key Takeaways

Memorize the reference numbers. Time conversions, data sizes, and latency numbers are your toolkit.
Follow the framework. Storage → QPS → Bandwidth → Servers. Same pattern every time.
Round aggressively. Use powers of 10. 86,400 becomes 100,000. Speed beats precision.
Account for peak traffic. Multiply average by 2-3x. Systems fail during peaks, not averages.
Sanity check against reality. Compare your estimates to known systems. If something feels off, it probably is.
Show your work in interviews. The calculation process matters more than the final number.

Practice Problems

Twitter DM Storage:
- 200 million DAU, 10% send DMs daily
- Average 5 DMs per active user
- Average DM size: 500 bytes
- Store for 3 years
- How much storage?
E-commerce QPS:
- 50 million DAU
- Each user views 20 products per session
- Each product view makes 3 API calls
- What is peak QPS?
Video Platform Bandwidth:
- 10 million concurrent viewers
- 60% watch at 1080p (8 Mbps)
- 40% watch at 720p (4 Mbps)
- Total bandwidth needed?

Leaving the part to validate answers to yourselves.

Additional Resources

For deeper learning on capacity planning and estimation:

Essential reads:

Latency Numbers Every Programmer Should Know - Jeff Dean's classic reference
System Design Primer - Comprehensive back-of-envelope guide
AWS Architecture Blog - Real-world capacity planning

What's Next

The next time an interviewer asks how much storage Instagram needs, you won't panic. You'll smile, grab your imaginary piece of paper, and start calculating.