Cost Optimization
You built the system. It works. It scales. It's secure. Then you get the cloud bill.
$47,000. For one month.
Welcome to the club. Every engineering team has that moment where they realize cloud computing isn't cheap, it just delays the expense.
The good news: most teams waste 30-40% of their cloud spend. Finding that waste isn't hard once you know where to look.
What You Will Learn
- Where cloud costs actually come from
- The biggest sources of waste (and how to find them)
- Right-sizing: paying for what you use
- Reserved instances and spot pricing
- Architecture decisions that save money
- Building a cost-aware culture
- Real numbers from real companies
The Grocery Store Analogy
Think about how you shop for groceries.
The expensive way: Grab whatever looks good. Buy stuff you already have at home. Let food expire. Order delivery every time.
The smart way: Check what you need. Buy in bulk for staples. Use what you buy. Pick up groceries on your way home.
Same food. Half the cost.
Cloud spending works the same way. The services are the same. The architecture can be the same. But small choices add up to huge differences.
Where Does the Money Go?
Let's break down a typical cloud bill.
| Category | Typical % of Bill |
|---|---|
| Compute (EC2, VMs) | 40-60% |
| Storage (S3, EBS) | 15-25% |
| Database (RDS, DynamoDB) | 10-20% |
| Data Transfer | 5-15% |
| Other (Queues, Lambda, etc.) | 5-10% |
Compute is usually the biggest chunk. Start there.
The Low-Hanging Fruit
1. Turn Off What You're Not Using
Sounds obvious. But you'd be amazed.
Development environments running 24/7
Your dev and staging servers don't need to run at 3 AM.
plaintext# Turn off at 8 PM, turn on at 8 AM # 12 hours off = 50% savings on dev environments
AWS, GCP, and Azure all have scheduled start/stop. Set it up once, save forever.
Orphaned resources
- EBS volumes from deleted instances
- Old snapshots nobody needs
- Load balancers pointing to nothing
- IP addresses allocated but unused
These accumulate quietly. Run a cleanup sweep monthly.
That experiment from 6 months ago
Someone spun up a Kubernetes cluster to test something. They forgot about it. It's been running for 6 months at $2,000/month.
Tag everything. Review untagged resources monthly. Delete ruthlessly.
2. Right-Size Your Instances
Most instances are oversized. Way oversized.
The common pattern:
Day 1: Let's start with a big instance to be safe. Day 30: Traffic is 10% of what we estimated. Day 365: Still running the big instance. Nobody wants to touch it.
The fix:
Check actual utilization:
- CPU averaging 15%? You're 3-4x oversized.
- Memory at 20%? Same problem.
plaintextBefore: m5.4xlarge (16 vCPU, 64GB RAM) - $0.768/hour = $560/month After: m5.xlarge (4 vCPU, 16GB RAM) - $0.192/hour = $140/month Savings: $420/month per instance
Multiply by 50 instances and you're saving $21,000/month.
How to right-size:
- Enable detailed monitoring (CloudWatch, etc.)
- Look at p95 CPU and memory over 30 days
- Pick the smallest instance that handles your peak + 20% headroom
- Automate scaling for spikes instead of provisioning for peak
3. Use Reserved Instances and Savings Plans
On-demand pricing is expensive. It's the convenience store markup.
The options:
| Type | Commitment | Savings |
|---|---|---|
| On-Demand | None | 0% (full price) |
| Savings Plans | 1-3 years | 30-60% |
| Reserved Instances | 1-3 years | 30-72% |
| Spot Instances | None (can be interrupted) | 60-90% |
What to reserve:
Look at your baseline. What's always running?
- Production databases
- Core API servers
- Baseline traffic handlers
Reserve those. Save 40-60%.
What to keep on-demand:
Unpredictable workloads. New services. Anything that might change.
Spot instances:
AWS sells spare capacity cheap. But they can take it back with 2 minutes notice.
Good for:
- Batch processing
- CI/CD pipelines
- Development environments
- Stateless workers that can restart
Not good for:
- Production databases
- User-facing traffic (unless you have fallback)
4. Storage Lifecycle Policies
Not all data is equal. Last week's logs matter. Last year's logs probably don't.
S3 storage classes(Prices will change with time):
| Class | Cost (per GB/month) | Use for |
|---|---|---|
| Standard | $0.023 | Frequently accessed |
| Infrequent Access | $0.0125 | Monthly access |
| Glacier | $0.004 | Archives (minutes to retrieve) |
| Glacier Deep Archive | $0.00099 | Archives (hours to retrieve) |
Lifecycle policy example:
plaintextDay 1-30: Standard (frequently accessed) Day 31-90: Infrequent Access Day 91-365: Glacier Day 366+: Delete or Deep Archive
Set this up once. Your storage bill drops 50-70%.
Database storage:
- Delete old backups (do you need 365 daily backups?)
- Remove unused indexes (they take space too)
- Archive old records to cheaper storage
Architecture Decisions That Save Money
Some decisions have long-term cost implications.
Serverless vs Always-On
Always-on (EC2, containers):
- Pay whether there's traffic or not
- Good for: steady, predictable traffic
Serverless (Lambda, Cloud Functions):
- Pay per request
- Good for: spiky or low traffic
The math:
Lambda: $0.0000002 per request + compute time EC2 t3.small: $0.0208/hour = $15/month
If you have less than ~75 million requests/month, Lambda might be cheaper. Above that, EC2 wins.
But Lambda has hidden costs:
- Cold starts (might need more memory)
- Higher per-request cost at scale
- Vendor lock-in
Multi-Region: Do You Need It?
Multi-region is great for reliability. It's also 2x (or more) the cost.
Questions to ask:
- What's your uptime requirement? (99.9% vs 99.99%)
- What's the cost of downtime per hour?
- Does multi-region actually help your users?
For many companies, one region with multiple availability zones is enough. True multi-region is for companies where an hour of downtime costs more than the infrastructure.
Caching: The Cheapest Optimization
A cache hit is 10-100x cheaper than a database query.
Before: 1,000 requests/second × database query = huge database bill
After: 950 cache hits + 50 database queries = small cache bill + tiny database bill
Redis on ElastiCache costs ~$12/month for a small instance. That can eliminate thousands in database costs.
CDN for Static Content
Serving images from your server is expensive. Every request costs compute and bandwidth.
Serving from a CDN:
- Cheaper per GB
- Faster for users
- Less load on your servers
CloudFront, Cloudflare, etc. often pay for themselves immediately.
Data Transfer: The Hidden Cost
Data transfer is the cost nobody thinks about until the bill arrives.
The Charges
| Transfer | Cost |
|---|---|
| Into AWS | Free |
| Within same AZ | Free |
| Between AZs in same region | $0.01/GB |
| Between regions | $0.02-0.09/GB |
| Out to internet | $0.09/GB |
That $0.09/GB adds up fast if you're serving video or large files.
How to Reduce It
Use a CDN: CloudFront's data transfer is cheaper than EC2's. Plus caching.
Keep traffic in-region: Multi-AZ for reliability, not multi-region unless you need it.
Compress everything: Gzip responses. Compress images. Every byte costs money.
VPC endpoints: Talking to S3 from EC2? Use a VPC endpoint instead of going through the internet. It's faster and often cheaper.
Building Cost-Aware Culture
Tools don't save money. People do.
Make Costs Visible
If developers don't see costs, they won't optimize.
- Put cloud spending on dashboards
- Share monthly cost reports with the team
- Show cost per service/team
When people see our service costs $8,000/month, they start asking why?
Tag Everything
Tags let you track costs by team, project, environment.
plaintextName: api-server-prod Team: platform Environment: production Project: checkout Owner: alice@company.com
Now you can answer:
- What does the checkout project cost?
- How much is Team Platform spending?
- What's our production vs development ratio?
Set Budgets and Alerts
AWS, GCP, and Azure all have budget alerts.
plaintextAlert at 50%: "Heads up, halfway through budget" Alert at 80%: "Getting close, review spending" Alert at 100%: "Over budget, investigate now"
Catching a runaway bill at 50% is way better than catching it at 300%.
Review Regularly
Monthly cost review:
- Compare to last month and budget
- Identify biggest increases
- Find unused resources
- Evaluate reservation opportunities
Takes an hour. Saves thousands.
Real Numbers
The "We Never Got That Traffic" Story
A team I know ran their AWS bill through a reality check. Two years earlier, they'd provisioned for traffic that never came. The classic let's be safe overprovisioning.
What they found:
- An entire staging environment nobody had touched in months
- RDS with provisioned IOPS sized for 10x their actual load
- Elasticsearch cluster running hot standby that wasn't needed
- Storage allocated for growth that never happened
They deleted the unused environment, downgraded the database IOPS, right-sized Elasticsearch, and trimmed the storage buffers.
Result: $3,500/month saved. That's $42,000/year gone, just by cutting things they weren't actually using.
The awkward part? This stuff had been running for two years. That's potentially $84,000 spent on resources nobody needed. The lesson: revisit your just in case provisioning. If the traffic didn't come in two years, it's probably not coming.
The Optimization Checklist
Quick wins (do this week):
- Identify and delete unused resources
- Schedule dev/staging to turn off at night
- Enable S3 lifecycle policies
- Set up budget alerts
Medium effort (do this month):
- Right-size oversized instances
- Evaluate reserved instances for stable workloads
- Add CloudFront for static content
- Implement auto-scaling
Ongoing:
- Monthly cost review
- Tag all new resources
- Review reservation utilization
- Track cost per service/team
The Bottom Line
Cloud cost optimization isn't about being cheap. It's about not wasting money.
Turn off what you're not using. Dev environments at night. Orphaned resources. Old experiments.
Right-size everything. Most instances are 2-4x bigger than needed.
Commit to what's stable. Reserved instances save 40-60% on predictable workloads.
Use lifecycle policies. Not all data needs hot storage.
Make costs visible. What gets measured gets managed.
Review regularly. An hour a month can save thousands.
The money you save on infrastructure is money you can spend on features, people, or profit. Don't leave it on the table.
What's Next
You've learned all the building blocks. Now it's time to see how they fit together in a real design.
Next up is Putting It All Together where we'll walk through a complete system design from requirements to architecture, showing how all these concepts combine into a coherent solution.