Back-of-the-Envelope Calculations

You are in a system design interview. The interviewer says "Design Instagram." You start describing your architecture, and the interviewer asks: How much storage do you need for photos?
Your mind goes blank. You know Instagram stores billions of photos, but what does that actually mean in terabytes? Should you admit you're guessing, or confidently throw out a number and hope for the best?
We've all been there in the same moment when abstract concepts like scale suddenly need to become concrete numbers.
This is where most candidates lose points—not because they don't know architecture patterns, but because they can't do basic math about systems. They throw out random numbers or, worse, say we'll figure it out later.
Here's the thing: you don't need to know the exact answer. What interviewers want to see is that you can think about the problem systematically. I've sat on both sides of these interviews, and I can tell you that showing your reasoning matters more than getting the perfect number.
Back-of-the-envelope calculations are quick, rough estimates that help you make informed design decisions. They're not about getting exact numbers but they're about understanding the order of magnitude. Are we dealing with gigabytes or petabytes? Hundreds of requests or millions?
This lesson teaches you how to do these calculations quickly and confidently.
From the trenches: At Google, every design doc includes a "back-of-the-envelope" section (source). Senior engineers routinely spot fatal flaws in 30 seconds by running the numbers. Systems that look elegant on paper sometimes need 10,000 database servers before code is written. Math saves time.
What You Will Learn
- The essential numbers every engineer should memorize
- How to estimate storage requirements
- How to calculate queries per second (QPS)
- How to estimate bandwidth needs
- How to determine server counts
- Common calculation patterns for interviews
- How to sanity-check your estimates
The Numbers You Must Know
Before you can estimate anything, you need reference points. Memorize these below numbers. They're your toolkit.
Time Conversions
plaintext1 second = 1,000 milliseconds (ms) 1 minute = 60 seconds 1 hour = 3,600 seconds ≈ 4,000 (for quick math) 1 day = 86,400 seconds ≈ 100,000 (10^5) 1 month = 2.6 million seconds ≈ 2.5 × 10^6 1 year = 31.5 million seconds ≈ 30 × 10^6
Why this matters: When someone says 100 million requests per day, you need to convert to requests per second to understand load.
plaintext100M requests/day and there are 100K seconds/day (rough estimate) = 1,000 requests/second
Data Size Conversions
plaintext1 Byte = 8 bits 1 KB = 1,000 bytes ≈ 10^3 1 MB = 1,000 KB ≈ 10^6 1 GB = 1,000 MB ≈ 10^9 1 TB = 1,000 GB ≈ 10^12 1 PB = 1,000 TB ≈ 10^15
Common data sizes:
| Data Type | Typical Size |
|---|---|
| Character (ASCII) | 1 byte |
| Character (UTF-8, with emojis) | 1-4 bytes |
| Integer | 4-8 bytes |
| UUID | 16 bytes |
| Short URL/ID | 7-10 bytes |
| Tweet (140 chars) | ~280 bytes |
| Average web page | 2-3 MB |
| Average photo (compressed) | 200 KB - 2 MB |
| Average video (1 min, compressed) | 10-50 MB |
Latency Numbers
These help you understand where time goes in a system:
plaintextL1 cache reference: 1 ns L2 cache reference: 4 ns RAM reference: 100 ns SSD random read: 15,000 ns (15 μs) HDD random read: 2,000,000 ns (2 ms) Round trip same datacenter: 500,000 ns (0.5 ms) Round trip cross-continent: 150,000,000 ns (150 ms)
The key insight: Network and disk are SLOW compared to memory. That's why caching exists.
Classic reference: These numbers are from Jeff Dean's famous Numbers Everyone Should Know talk. They're slightly outdated (from ~2010), but the relative differences still hold. Memory is still 1000× faster than disk, and cross-continent calls are still painfully slow.
Server Capacities (Rough Estimates)
plaintextSingle web server (modern): 1,000 - 10,000 requests/second Single database server: 1,000 - 5,000 queries/second Redis (single instance): 100,000+ operations/second Kafka (single broker): 100,000+ messages/second
These vary wildly based on workload, but give you a starting point.
The Estimation Framework
For any system, you typically estimate four things:
- Storage - How much data do we need to store?
- QPS - How many queries/requests per second?
- Bandwidth - How much data flows through the system?
- Servers - How many machines do we need?
Let me walk through each with examples.
Estimating Storage
Let's start with something familiar. Think about your phone's photo library. If you're like me, you probably have a few thousand photos taking up maybe 20-30 GB. Now imagine Instagram where 500 million people uploading photos every single day.
How do we go from I have 3,000 photos to Instagram needs 150 petabytes? Let me show you.
The formula is simple:
plaintextStorage = (Number of items) × (Size per item) × (Time period)
Example: Instagram Photo Storage
Given:
- 500 million daily active users (DAU)
- 10% of users post a photo each day
- Average photo size: 500 KB (after compression)
- Store photos for 5 years
Step 1: Photos per day
plaintext500M users × 10% posting = 50 million photos/day
Step 2: Storage per day
plaintext50M photos × 500 KB = 25 TB/day
Step 3: Storage for 5 years
plaintext25 TB/day × 365 days × 5 years = 45 PB
Step 4: Add overhead (replication, metadata)
- 3x replication: 45 PB × 3 = 135 PB
- Add 10% for metadata: ~150 PB total
The answer: ~150 petabytes for photo storage.
Wait, did we just say 150 petabytes? That sounds insane. But here's why it makes sense: 50 million photos per day, each half a megabyte, for five years, with backups. The math doesn't lie.
Real-world note: Actual storage is often 2-5× raw data due to indexes, backups, logs, and internal data structures. We accounted for replication above, but production systems also need backup snapshots, database indexes, and write-ahead logs. For critical data, total storage can easily hit 5× the raw data size.
Pro tip: In interviews, round aggressively. 500 KB becomes 0.5 MB. 365 becomes 400. The goal is speed and correct order of magnitude, not precision.
Example: Twitter Messages
Given:
- 300 million DAU
- Each user posts 2 tweets/day on average
- Average tweet: 200 bytes (text + metadata)
- Store for 10 years
Calculation:
plaintextTweets per day: 300M × 2 = 600M tweets/day Storage per day: 600M × 200 bytes = 120 GB/day Storage per year: 120 GB × 365 = ~44 TB/year Storage for 10y: 44 TB × 10 = 440 TB ≈ 0.5 PB
Text is cheap. Photos and videos are expensive.
Estimating QPS (Queries Per Second)
Okay, we've calculated storage which is quite massive. Now let's figure out the other critical number: how many requests per second will your system actually handle?
This is where things get interesting. QPS tells you whether you need 3 servers or 300.
plaintextQPS = (Daily active users × Actions per user per day) / Seconds per day Peak QPS = Average QPS × Peak multiplier (usually 2-3x)
Example: URL Shortener
Given:
- 100 million URLs created per month
- Read-to-write ratio: 100:1
Write QPS:
plaintext100M URLs/month ÷ 2.5M seconds/month = 40 writes/second Peak: 40 × 2 = 80 writes/second
Read QPS:
plaintext40 writes/second × 100 reads/write = 4,000 reads/second Peak: 4,000 × 2 = 8,000 reads/second
The insight: 8,000 reads/second is significant but manageable with caching. If the cache hit rate is 90%, the database only sees 800 queries/second.
Example: Social Media Feed
Given:
- 500 million DAU
- Users check feed 10 times per day
- Each feed load fetches 20 posts
Feed requests per day:
plaintext500M users × 10 checks = 5 billion feed requests/day
QPS:
plaintext5B requests ÷ 100K seconds/day = 50,000 requests/second
Database queries (if not cached):
plaintext50,000 requests × 20 posts = 1,000,000 queries/second
Stop for a second and think about what this means: one million queries per second. No wonder Facebook invested billions in infrastructure. This is the moment in interviews where I see people's eyes light up—when the math actually reveals something surprising about why systems are built the way they are.
This is why social media companies invest heavily in caching and pre-computation. No database can handle a million queries per second directly.
From the trenches: Twitter's feed generation was originally computed on every request. As they scaled, this became impossible. They moved to a fan-out on write model where feeds are pre-computed and cached (technical writeup). When read QPS exceeds database capacity, you pre-compute or fail.
Estimating Bandwidth
Alright, you've calculated storage and QPS. But here's the question everyone forgets to ask: How much data is actually flowing through your network?
Bandwidth is data flowing through your system per second, and it's often the hidden bottleneck.
plaintextBandwidth = QPS × Size per request
Example: Video Streaming Service
Given:
- 1 million concurrent viewers
- Average video bitrate: 5 Mbps (megabits per second)
Bandwidth:
plaintext1M viewers × 5 Mbps = 5 Tbps (terabits per second)
Convert to bytes:
plaintext5 Tbps ÷ 8 = 625 GB/second
This is massive bandwidth. That's why video services use CDNs extensively—they distribute this load across hundreds of edge servers worldwide.
Real numbers: Netflix peaks at ~200 Gbps per edge server, with thousands of servers globally. YouTube serves over 1 billion hours of video daily. These numbers are only possible because of CDNs—no central datacenter could handle this bandwidth.
Example: API Service
Given:
- 10,000 API requests/second
- Average response size: 10 KB
Bandwidth:
plaintext10,000 requests × 10 KB = 100 MB/second outbound
Very manageable. A single server with a 1 Gbps connection handles 125 MB/second.
Estimating Server Count
Okay, we've calculated storage (massive), QPS (also massive), and bandwidth (you get it). But here's the question everyone actually cares about: How many servers is this going to cost me?
I remember reviewing a design doc early in my career where someone proposed a single database for a system expecting 10 million users. The senior engineer didn't explain why it wouldn't work instead he just pulled out a piece of paper and ran the numbers in 30 seconds. The database would need to handle 50,000 queries per second. The proposed system could handle maybe 5,000. This type of math would save anyone from building something that would've crashed on launch day and help them make better architectural decisions early in the design process.
That's the power of back-of-envelope calculations.
plaintextServers needed = Total QPS / QPS per server
Example: Web Application
Given:
- 50,000 requests/second peak
- Single server handles 5,000 requests/second
- Want 3x headroom for growth
Calculation:
plaintextMinimum servers: 50,000 / 5,000 = 10 servers With headroom: 10 × 3 = 30 servers
But also consider:
- At least 2 servers per availability zone for redundancy
- If 3 AZs: minimum 6 servers just for redundancy
- Final answer: 30 application servers across 3 AZs
Example: Database Capacity
Given:
- 10,000 read queries/second
- 100 write queries/second
- Single PostgreSQL instance handles 3,000 queries/second
Primary database:
plaintext1 primary for writes (100 QPS is easy)
Read replicas:
plaintext10,000 reads / 3,000 per replica = 3.3 → 4 read replicas
But if you add a cache with 90% hit rate:
plaintext10,000 reads × 10% miss rate = 1,000 reads hit database 1,000 reads / 3,000 per replica = 0.33 → 1 replica is enough
The insight: Caching dramatically reduces database server requirements. You'll notice that a 90% cache hit rate just saved you from buying 3 extra database servers. That's real money.
From the trenches: Amazon's common pattern is cache everything, invalidate intelligently. Services routinely achieve 95-99% cache hit rates. Teams regularly reduce database clusters from 50 servers to 5 just by adding proper caching. The ROI is massive.
Common Calculation Patterns
These patterns appear repeatedly in interviews. Memorize the approach.
Pattern 1: Daily to Per-Second
plaintextPer second = Daily count / 100,000
Example: 500 million daily requests = 5,000 requests/second
Pattern 2: Storage Over Time
plaintextTotal storage = Daily storage × Days × Replication factor
Example: 10 GB/day for 3 years with 3x replication = 10 GB × 1,000 days × 3 = 30 TB
Note: Notice how I am rounding numbers up to the nearest 10 factor
Pattern 3: Read/Write Ratio Impact
Most systems are read-heavy (10:1 to 1000:1). Always ask about this ratio.
If you have 1,000 writes/second and 100:1 read ratio:
- Write load: 1,000/second (hits primary DB)
- Read load: 100,000/second (can be cached and replicated)
Pattern 4: Cache Hit Rate Impact
plaintextDatabase load = Total reads × (1 - cache hit rate)
Example: 100,000 reads/second with 95% cache hit rate = 100,000 × 0.05 = 5,000 reads/second to database
Pattern 5: Concurrent Users to QPS
plaintextQPS = Concurrent users × Requests per user per second
Example: 100,000 concurrent users, each making 1 request every 5 seconds = 100,000 / 5 = 20,000 requests/second
Red Flags
Your estimate is too high if:
- You need more than 1,000 servers for a startup
- Storage exceeds 1 PB for a non-video service
- QPS exceeds 1 million without caching
Your estimate is too low if:
- A popular consumer app needs only 1 database server
- Video storage is measured in terabytes, not petabytes
- Peak traffic equals average traffic
Quick Sanity Checks
-
Does the number make sense?
- If your photo storage estimate is 10 GB for Instagram, something is wrong.
-
Can money buy it?
- 100 PB of storage is ~$1M/month on AWS. Possible for big companies, unlikely for startups.
-
Does the architecture support it?
- If you calculated 1M QPS but drew one database, there's a mismatch.
Worked Example: Design a URL Shortener
Let me walk through a complete estimation for a URL shortener like Bitly.
Step 1: Gather Requirements
- 100 million URLs shortened per month
- URLs accessed for 5 years on average
- Read-to-write ratio: 100:1
Step 2: QPS Calculations
Writes:
plaintext100M / month ÷ 2.5M seconds/month = 40 URL creations/second Peak (2x): 80/second
Reads:
plaintext40 writes × 100 read ratio = 4,000 redirects/second Peak (2x): 8,000/second
Step 3: Storage Calculations
Total URLs over 5 years:
plaintext100M/month × 12 months × 5 years = 6 billion URLs
Storage per URL:
- Short code: 7 bytes
- Long URL: 100 bytes average
- Timestamp: 8 bytes
- User ID: 8 bytes
- Total: ~125 bytes, round to 150 bytes with overhead
Total storage:
plaintext6 billion URLs × 150 bytes = 900 GB ≈ 1 TB With 3x replication: 3 TB
Step 4: Bandwidth Calculations
Read bandwidth:
plaintext8,000 requests × 150 bytes = 1.2 MB/second
Step 5: Server Estimation
Application servers:
plaintext8,000 QPS ÷ 5,000 per server = 1.6 → 3 servers (for redundancy)
Database:
plaintextWith 90% cache hit rate: 800 queries/second 1 PostgreSQL instance handles this easily Add 1 replica for failover
Cache (Redis):
plaintext7,200 cache hits/second Single Redis instance handles 100K+ ops/second 1 Redis, maybe 1 replica
Final Summary
| Component | Count | Justification |
|---|---|---|
| App Servers | 3-6 | 8K QPS, redundancy |
| Database (Primary) | 1 | 80 writes/second is trivial |
| Database (Replica) | 1 | Failover + some reads |
| Cache (Redis) | 1-2 | 7K ops/second |
| Storage | 3 TB | 6B URLs, 3x replication |
Interview Estimates vs Real-World Estimates
The calculations you've learned will get you through interviews. But real production systems have complexities that interview math ignores. Here's what changes when you go from whiteboard to production.
1. Traffic Is Bursty, Not Smooth
Interview assumption:
- Uniform traffic throughout the day
- 2-3× multiplier for peak
- Predictable growth
Real-world reality:
- Traffic spikes from notifications, viral posts, or marketing campaigns
- Cron jobs create artificial load spikes
- Retries during partial outages amplify load
- Holiday shopping, sports events, breaking news = 10-50× normal load
What this means: In interviews, you say 2× peak multiplier. In production, you plan for 10× short bursts for critical paths. That's why systems have circuit breakers, rate limiting, and auto-scaling—they're built for chaos, not averages.
2. Cost Math Is Missing (But Critical)
In interviews, nobody asks can you afford this? In real life, that's often the first question.
Quick cost pattern:
plaintextMonthly cost ≈ (Storage × $/GB) + (Bandwidth × $/GB) + (Compute × $/hour)
Example: Your Instagram calculation showed 150 PB of photo storage.
plaintext150 PB = 150,000 TB At ~$0.023/GB/month (AWS S3): ~$3.5M/month just for storage Add bandwidth, compute, databases: ~$5-10M/month
The insight: Only companies with massive revenue can afford this. If you're designing a startup photo app, you need compression, deduplication, or a completely different model. Cost constraints drive architecture decisions.
3. Latency Chains Kill Performance
You listed latency numbers earlier. Now let's apply them to show why architecture matters.
Example: Synchronous Service Chain
If Service A → B → C are all synchronous calls:
- Service A calls B: 50 ms
- Service B calls C: 50 ms
- Service C queries DB: 10 ms
- Response travels back: 50 ms
Total latency: ~160 ms (best case)
But that's P50. At P99 with retries:
- Each service P99: 200 ms
- Add retry logic: 2× worst case
- P99 latency: 400-600 ms
Why this matters:
- 3 sync calls = unusable mobile experience
- 5+ sync calls = timeouts and cascading failures
- This is why event-driven architecture exists
In interviews, you draw the boxes. In production, you measure the latency and rewrite the whole thing async.
4. Hidden Multipliers Everywhere
Interview math: 1 TB of user data = 1 TB storage
Real production storage for that same 1 TB:
- Raw data: 1 TB
- Indexes: +30-50% (1.5 TB total)
- Replication: ×3 (4.5 TB)
- Backups/snapshots: +20% (5.4 TB)
- Write-ahead logs: +10% (6 TB)
- Actual storage: 6× the raw data
Similarly for compute:
- Interview: Need 10 servers
- Real: 10 servers + 3 for failover + 5 for staging + 2 for testing = 20 servers (This might be a bit exaggerated depending on team and practices set up)
The gap: Interview math is 1×. Production math is 3-6×.
Common Mistakes to Avoid
By the way, I still sometimes second-guess my estimates mid-interview. The difference now is that I can explain my reasoning and adjust when I spot mistakes. That's what we're really teaching here—not perfection, but structured thinking.
Here are the mistakes I see most often (and yes, I've made all of them):
Mistake 1: Forgetting Peak vs Average
Average traffic is useless for capacity planning. Always multiply by 2-3x for peak.
plaintextWrong: 1,000 QPS average → need 1 server handling 1,000 QPS Right: 1,000 QPS average → 2,500 QPS peak → need servers for 2,500 QPS
Mistake 2: Ignoring Replication
Storage estimates without replication are incomplete. Most production systems use 3x replication.
plaintextWrong: 1 TB of data → need 1 TB storage Right: 1 TB of data → 3 TB with replication → 3.5 TB with backups
Mistake 3: Not Considering Read vs Write
Lumping all operations together hides the real bottleneck.
plaintextWrong: 100,000 operations/second → need big database Right: 99,000 reads (cacheable) + 1,000 writes → cache + modest database
Mistake 4: Over-Precision
Spending 5 minutes calculating exact numbers is wasted time.
plaintextWrong: 86,400 seconds/day × 365.25 days/year = 31,557,600 seconds/year Right: ~30 million seconds/year (close enough)
Key Takeaways
-
Memorize the reference numbers. Time conversions, data sizes, and latency numbers are your toolkit.
-
Follow the framework. Storage → QPS → Bandwidth → Servers. Same pattern every time.
-
Round aggressively. Use powers of 10. 86,400 becomes 100,000. Speed beats precision.
-
Account for peak traffic. Multiply average by 2-3x. Systems fail during peaks, not averages.
-
Sanity check against reality. Compare your estimates to known systems. If something feels off, it probably is.
-
Show your work in interviews. The calculation process matters more than the final number.
Practice Problems
Alright, your turn. Grab a piece of paper (or your Notes app). Let's see if you can work through these without looking at the reference numbers. Don't worry about being perfect; just see if you can get within the right order of magnitude:
-
Twitter DM Storage:
- 200 million DAU, 10% send DMs daily
- Average 5 DMs per active user
- Average DM size: 500 bytes
- Store for 3 years
- How much storage?
-
E-commerce QPS:
- 50 million DAU
- Each user views 20 products per session
- Each product view makes 3 API calls
- What is peak QPS?
-
Video Platform Bandwidth:
- 10 million concurrent viewers
- 60% watch at 1080p (8 Mbps)
- 40% watch at 720p (4 Mbps)
- Total bandwidth needed?
Leaving the part to validate answers to yourselves.
Additional Resources
For deeper learning on capacity planning and estimation:
Essential reads:
- Latency Numbers Every Programmer Should Know - Jeff Dean's classic reference
- System Design Primer - Comprehensive back-of-envelope guide
- AWS Architecture Blog - Real-world capacity planning
What's Next
Here's the truth: the first time you do these calculations, you'll probably feel slow and uncertain. That's normal. But after practicing with 5-10 system design problems, something clicks. You'll start seeing these patterns everywhere in the apps you use, in the news about cloud costs, in your own work. These numbers stop being abstract and start being tools.
The next time an interviewer asks how much storage Instagram needs, you won't panic. You'll smile, grab your imaginary piece of paper, and start calculating.
Ready to keep building? Next up: Scalability Fundamentals—where we take these numbers and figure out how to actually build systems that don't fall over when they grow. You'll learn vertical vs horizontal scaling, stateless design, and the key decisions that determine whether a system scales gracefully or collapses under load.