Database Fundamentals

Our coffee shop has been running smoothly. Customers find us (DNS), traffic is distributed (load balancing), and popular drinks are pre-made (caching). But there's one thing we haven't talked about: where do we store the actual business data?
There's quite a lot of data to store like Customer names, Order history and the Inventory itself. This data shouldn't just disappear when you restart your server.
This is where database comes into picture. And honestly, picking the wrong database is one of the easiest ways to screw up your architecture. I've seen it happen more times than I'd like to admit.
Get it right early, and you're fine. Get it wrong, and you're looking at a painful migration 18 months later when you're already at scale. Ask me how I know.
This lesson isn't about SQL syntax. It's about understanding what databases are actually good at, where they fall apart, and how to pick one without overthinking it.
What You Will Learn
- Why relational databases (MySQL, PostgreSQL) are still the default choice
- When NoSQL databases make sense (and when they don't)
- How indexes work and why they matter
- ACID transactions and why they're important
- Database scaling patterns: replicas, sharding, and when to use each
- How to choose the right database for your use case
Relational Databases: The Default Choice
Relational databases are like spreadsheets on steroids. Tables with rows and columns. Everything has a place. Everything follows rules.
MySQL and PostgreSQL have been around forever. They're boring but that's actually a good thing.
sqlCREATE TABLE users ( id SERIAL PRIMARY KEY, email VARCHAR(255) UNIQUE NOT NULL, name VARCHAR(100) ); CREATE TABLE orders ( id SERIAL PRIMARY KEY, user_id INTEGER REFERENCES users(id), total DECIMAL(10, 2) );
Why They Win
Data integrity. The database won't let you insert garbage:
sqlINSERT INTO orders (user_id) VALUES (99999); -- ERROR: foreign key violation (user doesn't exist)
This saved me from so many bugs when I was starting out.
JOINs just work:
sqlSELECT users.name, orders.total FROM users JOIN orders ON orders.user_id = users.id;
Transactions. Either everything happens or nothing happens:
sqlBEGIN; UPDATE accounts SET balance = balance - 100 WHERE id = 1; UPDATE accounts SET balance = balance + 100 WHERE id = 2; COMMIT;
If the second UPDATE fails, the first one gets rolled back. Money doesn't vanish.
You can ask questions you didn't plan for. Need a report on users who ordered in the last 30 days? Just write a query. With NoSQL, you better hope you designed for that query upfront.
Where They Struggle
Scaling horizontally sucks. These databases were built for one big server. If we want to split data across multiple servers then Good luck! JOINs will break. Transactions will break. Everything gets complicated.
Schema changes are scary. Alter commands take time.
Rigid schemas. If your data structure keeps changing, you're fighting the database.
NoSQL: When Relational Doesn't Fit
NoSQL isn't better or worse than relational. It's different as it trades some guarantees for other benefits.
Here's the thing: relational databases are great when your data is structured and you need to query it in flexible ways. But what if you're storing user activity logs? Or sensor data from IoT devices? Or a product catalog where every product has wildly different attributes?
That's where NoSQL makes sense. Sometimes.
Document Databases (MongoDB, DynamoDB)
Instead of rigid tables, you store JSON documents. Each document can have different fields.
json{ "_id": "user_123", "email": "alice@example.com", "preferences": { "theme": "dark", "notifications": true }, "orders": [ {"id": "order_1", "total": 99.99, "items": 3}, {"id": "order_2", "total": 49.99, "items": 1} ] }
Use it for:
- Content management systems where every article has different fields
- Product catalogs (a shirt has size/color, a laptop has CPU/RAM/storage)
- User profiles that evolve over time
Don't use it for:
- Anything that needs JOINs across collections
- Financial transactions
- Reports that aggregate across multiple document types
MongoDB is popular because it's easy to get started.
Key-Value Stores (Redis, DynamoDB)
It's literally just key-value pairs. Like a hash map that's persistent and distributed.
plaintextSET user:123:session "abc123xyz" GET user:123:session → "abc123xyz" SET user:123:cart '{"items": [1, 2, 3]}' GET user:123:cart → '{"items": [1, 2, 3]}'
Use it for:
- Session storage
- Caching (we just covered this in previous lesson)
- Leaderboards
- Rate limiting
Don't use it for:
- Anything where you need to search by value
- Relationships
Redis is the most common one. It's really fast with sub-millisecond latency.
Wide-Column (Cassandra, ScyllaDB)
Wide-column stores like Cassandra are built for massive write throughput. We're talking hundreds of thousands of writes per second.
Good for:
- Time-series data (metrics, logs, IoT sensor data)
- Event logging at massive scale
- Write-heavy workloads (100K+ writes/second)
Bad for:
- Complex queries and aggregations
- JOINs
- Transactions across rows
When to use it: Time-series data at scale, Logs, Metrics and IoT sensor data. Anything where you're writing constantly and querying by time ranges.
I worked on a system that ingested around 200K events per second. Cassandra handled it seamlessly.
Graph Databases (Neo4j)
The analogy: A web of sticky notes connected by strings. Great for following connections.
plaintext(Alice)-[:FRIEND]->(Bob)-[:WORKS_AT]->(Google) (Alice)-[:FRIEND]->(Carol)-[:WORKS_AT]->(Google) Query: "Find friends of Alice who work at Google"
Good for:
- Social networks (friends of friends)
- Recommendation engines
- Fraud detection (finding suspicious patterns)
Bad for:
- Simple CRUD operations
- Large-scale data that doesn't have relationships
- Horizontal scaling
When to use it: Social networks. Recommendation engines. Fraud detection.
If you're constantly traversing relationships ("friends of friends who like X"), graph databases are way faster than doing recursive JOINs in SQL.
But honestly? Most apps don't need this.
Quick Comparison
| Type | Best For | Don't Use For |
|---|---|---|
| Relational (MySQL/Postgres) | Most applications | Massive write throughput |
| Document (MongoDB) | Flexible schemas | Complex queries across collections |
| Key-Value (Redis) | Cache, sessions | Searching by value |
| Wide-Column (Cassandra) | Time-series at scale | Transactions |
| Graph (Neo4j) | Social networks | Simple CRUD apps |
Indexing: The Library Card Catalog
Imagine a library with 1 million books, but no catalog system. To find "Database Design Patterns," you'd have to walk through every shelf, checking every book. That's a full table scan and it's painfully slow.
Now imagine the library has a card catalog organized alphabetically by title. You look up "Database," find the card, and it tells you "Shelf 47, Position 12." You walk directly there. That's an index.
How Database Indexes Work
Without an index, the database checks every row:
plaintextQuery: Find user with email 'alice@example.com' Check row 1... no Check row 2... no Check row 3... no ... (1 million rows later) Check row 847,293... yes! Found it.
With an index on the email column:
plaintextQuery: Find user with email 'alice@example.com' Look up 'alice@example.com' in index → Row 847,293 Jump directly to row 847,293 → Found it!
The index is a sorted data structure (usually a B-tree) that maps values to row locations. Lookups go from O(n) to O(log n).
Types of Indexes
Primary key index (automatic): Every table should have a primary key. The database automatically creates an index on it.
Secondary index (you create these):
sqlCREATE INDEX idx_users_email ON users(email);
Composite index (multiple columns):
sqlCREATE INDEX idx_orders_user_status ON orders(user_id, status); -- This uses the index: SELECT * FROM orders WHERE user_id = 123 AND status = 'pending'; -- This also uses the index: SELECT * FROM orders WHERE user_id = 123; -- This CANNOT use the index: SELECT * FROM orders WHERE status = 'pending';
Composite indexes work left-to-right. An index on (user_id, status) helps queries on user_id or user_id + status. But NOT queries on just status.
I've debugged this exact issue multiple times. The query is slow, you check, and there's an index on the column but it's the second column in a composite index.
The Index Trade-off
Indexes aren't free:
- They take disk space (the index itself is data)
- They slow down writes (every INSERT/UPDATE has to update the indexes too)
Index columns you actually query. If you have WHERE email = ... in your queries, index email.
Don't index everything. I've seen tables with 15 indexes. Half of them were never used. Just slowing down writes for no reason. Don't index columns you never query.
Always Use EXPLAIN
Use EXPLAIN to check if your index is actually being used:
sqlEXPLAIN ANALYZE SELECT * FROM users WHERE email = 'alice@example.com';
If you see "Seq Scan" on a big table, you need an index. If you see "Index Scan," you're good.
Transactions and ACID
ACID is an acronym but honestly I just remember:
- Atomicity: All or nothing. No half-done transactions.
- Consistency: No invalid states.
- Isolation: Transactions don't step on each other.
- Durability: Once committed, it's permanent. Even if the server crashes.
Isolation Levels
| Level | What You Get | Performance |
|---|---|---|
| Read Uncommitted | Can see uncommitted changes | Fastest |
| Read Committed | Only see committed changes | Good balance |
| Repeatable Read | Same query = same results | Slower |
| Serializable | Complete isolation | Slowest |
Most databases default to "Read Committed" and that's fine for 99% of use cases.
The only time I've changed isolation levels was for a financial system that needed Serializable. Everything else? Default is fine.
Distributed Transactions
Distributed transactions (across multiple databases) are a pain. Most systems just accept eventual consistency.
2PC (Two-Phase Commit): Coordinator ensures all commit or all abort. Slow, rarely used at scale.
Saga pattern: Local transactions with compensating actions. If step 2 fails, undo step 1.
Eventual consistency: Accept temporary inconsistency, resolve later.
Scaling Patterns
Vertical Scaling (Bigger Server)
Just get a bigger server. More RAM, faster CPU, SSDs.
A good PostgreSQL server can handle 10-15K queries per second. That's enough for most apps. Don't overthink scaling.
Read Replicas
One primary database for writes. Multiple read replicas for reads.
This is the most common scaling pattern. Works great for read-heavy workloads (which is most web apps).
Catch: Replication lag. Replicas are usually a few milliseconds behind. Sometimes seconds if there's a spike. So if a user updates their profile and immediately refreshes, they might see old data.
You can handle this by routing reads for recently-written data to the primary. Or just accept it. Most users don't notice a 100ms delay.
Sharding (Last Resort)
Sharding means splitting your data across multiple databases.
Example: Users 1-1M go to Database 1. Users 1M-2M go to Database 2.
Why it sucks:
- Can't JOIN across shards
- Can't do transactions across shards
- Rebalancing is a nightmare
- Schema changes hit every shard
Don't shard until you absolutely have to. Try everything else first: bigger servers, read replicas, caching, optimizing queries.
Choosing a Database
The Pragmatic Approach
Start with MySQL. Seriously.
MySQL is the world's most popular open-source database for a reason. It's simple to set up, well-documented, runs everywhere, and handles the needs of 90% of applications. Facebook, Twitter, and Airbnb all started with MySQL.
When you outgrow MySQL or need more advanced features, PostgreSQL is the natural next step. PostgreSQL offers:
- Better support for complex queries
- JSON documents (JSONB column type)
- Full-text search built-in
- Geographic data (PostGIS extension)
- More advanced data types (arrays, hstore, etc.)
But don't start with PostgreSQL just because it's "more powerful." MySQL's simplicity is a feature, not a bug. You can always migrate later when you actually need those advanced capabilities.
Decision Guide
| Need | Choice |
|---|---|
| Starting a new web app | MySQL |
| Need advanced features | PostgreSQL |
| Session/cache | Redis |
| Write-heavy at scale | Cassandra, ScyllaDB |
| Full-text search | Elasticsearch |
| Analytics | ClickHouse, BigQuery |
| Social graph | Neo4j + MySQL/PostgreSQL |
Questions to Ask
- Access patterns: How will you query? Complex JOINs or simple lookups?
- Data structure: Structured tables or flexible documents?
- Scale: Single server sufficient? Write-heavy?
- Consistency: Strong required or eventual acceptable?
Performance Tips
Profile first. EXPLAIN ANALYZE before optimizing.
Index strategically. Columns you query. Not everything.
Connection pooling. Opening connections is expensive. Reuse them.
Batch operations:
sql-- Slow: 1000 inserts INSERT INTO logs (msg) VALUES ('log 1'); INSERT INTO logs (msg) VALUES ('log 2'); -- Fast: batch INSERT INTO logs (msg) VALUES ('log 1'), ('log 2'), ...;
Denormalize carefully. Sometimes duplicating data avoids expensive JOINs. Trade-off: data can become stale.
Some Real World Examples
Discord's database migration was painful
Discord stored billions of chat messages in MongoDB. It worked fine initially, but as they scaled, MongoDB couldn't keep up. Reading the last 50 messages in a channel was taking 40-125ms. That's noticeable lag.
They spent months migrating to Cassandra. Why Cassandra? Because chat messages are basically time-series data. You're always appending new messages and reading recent ones. Cassandra is built for exactly this pattern.
After the migration, reads dropped to 5ms. Writes became sub-millisecond. But the migration itself? Months of work, careful data movement, running both databases in parallel.
The lesson here isn't use Cassandra. It's that your access pattern matters more than anything else. MongoDB is fine for lots of things. Just not billions of time-series records.
Source: Discord Engineering Blog
Key Takeaways
Relational is the default. Start with MySQL for simplicity. Graduate to PostgreSQL when you need advanced features.
NoSQL solves specific problems. Use it when relational genuinely can't meet your needs, not because it's trendy.
Indexes make reads fast, writes slow. Index columns you query frequently. Use EXPLAIN to verify.
Scaling is a progression. Vertical → Read replicas → Caching → Sharding. Don't skip steps.
Choose based on access patterns. How you query matters more than how much data you have.
What's Next
You understand the fundamentals of databases. But how do you choose between relational and NoSQL? When does each make sense?
Next up: Data Partitioning & Replication — how to split data across multiple databases and keep copies in sync when one server isn't enough.