Cost Optimization | System Design Fundamentals

You built the system. It works. It scales. It's secure. Then you get the cloud bill.

$47,000. For one month.

Welcome to the club. Every engineering team has that moment where they realize cloud computing isn't cheap, it just delays the expense.

The good news: most teams waste 30-40% of their cloud spend. Finding that waste isn't hard once you know where to look.

What You Will Learn

Where cloud costs actually come from
The biggest sources of waste (and how to find them)
Right-sizing: paying for what you use
Reserved instances and spot pricing
Architecture decisions that save money
Building a cost-aware culture
Real numbers from real companies

The Grocery Store Analogy

Think about how you shop for groceries.

The expensive way: Grab whatever looks good. Buy stuff you already have at home. Let food expire. Order delivery every time.

The smart way: Check what you need. Buy in bulk for staples. Use what you buy. Pick up groceries on your way home.

Same food. Half the cost.

Cloud spending works the same way. The services are the same. The architecture can be the same. But small choices add up to huge differences.

Where Does the Money Go?

Let's break down a typical cloud bill.

Category	Typical % of Bill
Compute (EC2, VMs)	40-60%
Storage (S3, EBS)	15-25%
Database (RDS, DynamoDB)	10-20%
Data Transfer	5-15%
Other (Queues, Lambda, etc.)	5-10%

Compute is usually the biggest chunk. Start there.

The Low-Hanging Fruit

1. Turn Off What You're Not Using

Sounds obvious. But you'd be amazed.

Development environments running 24/7

Your dev and staging servers don't need to run at 3 AM.

plaintext
# Turn off at 8 PM, turn on at 8 AM
# 12 hours off = 50% savings on dev environments

AWS, GCP, and Azure all have scheduled start/stop. Set it up once, save forever.

Orphaned resources

EBS volumes from deleted instances
Old snapshots nobody needs
Load balancers pointing to nothing
IP addresses allocated but unused

These accumulate quietly. Run a cleanup sweep monthly.

That experiment from 6 months ago

Someone spun up a Kubernetes cluster to test something. They forgot about it. It's been running for 6 months at $2,000/month.

Tag everything. Review untagged resources monthly. Delete ruthlessly.

2. Right-Size Your Instances

Most instances are oversized. Way oversized.

The common pattern:

Day 1: Let's start with a big instance to be safe. Day 30: Traffic is 10% of what we estimated. Day 365: Still running the big instance. Nobody wants to touch it.

The fix:

Check actual utilization:

CPU averaging 15%? You're 3-4x oversized.
Memory at 20%? Same problem.

plaintext
Before: m5.4xlarge (16 vCPU, 64GB RAM) - $0.768/hour = $560/month
After:  m5.xlarge (4 vCPU, 16GB RAM) - $0.192/hour = $140/month

Savings: $420/month per instance

Multiply by 50 instances and you're saving $21,000/month.

How to right-size:

Enable detailed monitoring (CloudWatch, etc.)
Look at p95 CPU and memory over 30 days
Pick the smallest instance that handles your peak + 20% headroom
Automate scaling for spikes instead of provisioning for peak

3. Use Reserved Instances and Savings Plans

On-demand pricing is expensive. It's the convenience store markup.

The options:

Type	Commitment	Savings
On-Demand	None	0% (full price)
Savings Plans	1-3 years	30-60%
Reserved Instances	1-3 years	30-72%
Spot Instances	None (can be interrupted)	60-90%

What to reserve:

Look at your baseline. What's always running?

Production databases
Core API servers
Baseline traffic handlers

Reserve those. Save 40-60%.

What to keep on-demand:

Unpredictable workloads. New services. Anything that might change.

Spot instances:

AWS sells spare capacity cheap. But they can take it back with 2 minutes notice.

Good for:

Batch processing
CI/CD pipelines
Development environments
Stateless workers that can restart

Not good for:

Production databases
User-facing traffic (unless you have fallback)

4. Storage Lifecycle Policies

Not all data is equal. Last week's logs matter. Last year's logs probably don't.

S3 storage classes(Prices will change with time):

Class	Cost (per GB/month)	Use for
Standard	$0.023	Frequently accessed
Infrequent Access	$0.0125	Monthly access
Glacier	$0.004	Archives (minutes to retrieve)
Glacier Deep Archive	$0.00099	Archives (hours to retrieve)

Lifecycle policy example:

plaintext
Day 1-30: Standard (frequently accessed)
Day 31-90: Infrequent Access
Day 91-365: Glacier
Day 366+: Delete or Deep Archive

Set this up once. Your storage bill drops 50-70%.

Database storage:

Delete old backups (do you need 365 daily backups?)
Remove unused indexes (they take space too)
Archive old records to cheaper storage

Architecture Decisions That Save Money

Some decisions have long-term cost implications.

Serverless vs Always-On

Always-on (EC2, containers):

Pay whether there's traffic or not
Good for: steady, predictable traffic

Serverless (Lambda, Cloud Functions):

Pay per request
Good for: spiky or low traffic

The math:

Lambda: $0.0000002 per request + compute time EC2 t3.small: $0.0208/hour = $15/month

If you have less than ~75 million requests/month, Lambda might be cheaper. Above that, EC2 wins.

But Lambda has hidden costs:

Cold starts (might need more memory)
Higher per-request cost at scale
Vendor lock-in

Multi-Region: Do You Need It?

Multi-region is great for reliability. It's also 2x (or more) the cost.

Questions to ask:

What's your uptime requirement? (99.9% vs 99.99%)
What's the cost of downtime per hour?
Does multi-region actually help your users?

For many companies, one region with multiple availability zones is enough. True multi-region is for companies where an hour of downtime costs more than the infrastructure.

Caching: The Cheapest Optimization

A cache hit is 10-100x cheaper than a database query.

Before: 1,000 requests/second × database query = huge database bill

After: 950 cache hits + 50 database queries = small cache bill + tiny database bill

Redis on ElastiCache costs ~$12/month for a small instance. That can eliminate thousands in database costs.

CDN for Static Content

Serving images from your server is expensive. Every request costs compute and bandwidth.

Serving from a CDN:

Cheaper per GB
Faster for users
Less load on your servers

CloudFront, Cloudflare, etc. often pay for themselves immediately.

Data Transfer: The Hidden Cost

Data transfer is the cost nobody thinks about until the bill arrives.

The Charges

Transfer	Cost
Into AWS	Free
Within same AZ	Free
Between AZs in same region	$0.01/GB
Between regions	$0.02-0.09/GB
Out to internet	$0.09/GB

That $0.09/GB adds up fast if you're serving video or large files.

How to Reduce It

Use a CDN: CloudFront's data transfer is cheaper than EC2's. Plus caching.

Keep traffic in-region: Multi-AZ for reliability, not multi-region unless you need it.

Compress everything: Gzip responses. Compress images. Every byte costs money.

VPC endpoints: Talking to S3 from EC2? Use a VPC endpoint instead of going through the internet. It's faster and often cheaper.

Building Cost-Aware Culture

Tools don't save money. People do.

Make Costs Visible

If developers don't see costs, they won't optimize.

Put cloud spending on dashboards
Share monthly cost reports with the team
Show cost per service/team

When people see our service costs $8,000/month, they start asking why?

Tag Everything

Tags let you track costs by team, project, environment.

plaintext
Name: api-server-prod
Team: platform
Environment: production
Project: checkout
Owner: alice@company.com

Now you can answer:

What does the checkout project cost?
How much is Team Platform spending?
What's our production vs development ratio?

Set Budgets and Alerts

AWS, GCP, and Azure all have budget alerts.

plaintext
Alert at 50%: "Heads up, halfway through budget"
Alert at 80%: "Getting close, review spending"
Alert at 100%: "Over budget, investigate now"

Catching a runaway bill at 50% is way better than catching it at 300%.

Review Regularly

Monthly cost review:

Compare to last month and budget
Identify biggest increases
Find unused resources
Evaluate reservation opportunities

Takes an hour. Saves thousands.

Real Numbers

The "We Never Got That Traffic" Story

A team I know ran their AWS bill through a reality check. Two years earlier, they'd provisioned for traffic that never came. The classic let's be safe overprovisioning.

What they found:

An entire staging environment nobody had touched in months
RDS with provisioned IOPS sized for 10x their actual load
Elasticsearch cluster running hot standby that wasn't needed
Storage allocated for growth that never happened

They deleted the unused environment, downgraded the database IOPS, right-sized Elasticsearch, and trimmed the storage buffers.

Result: $3,500/month saved. That's $42,000/year gone, just by cutting things they weren't actually using.

The awkward part? This stuff had been running for two years. That's potentially $84,000 spent on resources nobody needed. The lesson: revisit your just in case provisioning. If the traffic didn't come in two years, it's probably not coming.

The Optimization Checklist

Quick wins (do this week):

Identify and delete unused resources
Schedule dev/staging to turn off at night
Enable S3 lifecycle policies
Set up budget alerts

Medium effort (do this month):

Right-size oversized instances
Evaluate reserved instances for stable workloads
Add CloudFront for static content
Implement auto-scaling

Ongoing:

Monthly cost review
Tag all new resources
Review reservation utilization
Track cost per service/team

The Bottom Line

Cloud cost optimization isn't about being cheap. It's about not wasting money.

Turn off what you're not using. Dev environments at night. Orphaned resources. Old experiments.

Right-size everything. Most instances are 2-4x bigger than needed.

Commit to what's stable. Reserved instances save 40-60% on predictable workloads.

Use lifecycle policies. Not all data needs hot storage.

Make costs visible. What gets measured gets managed.

Review regularly. An hour a month can save thousands.

The money you save on infrastructure is money you can spend on features, people, or profit. Don't leave it on the table.

What's Next

You've learned all the building blocks. Now it's time to see how they fit together in a real design.

Next up is Putting It All Together where we'll walk through a complete system design from requirements to architecture, showing how all these concepts combine into a coherent solution.