Networking Fundamentals | System Design Fundamentals

Remember our coffee shop from the last lesson? We scaled from one location to five. Customers are distributed across locations. But here's a question we skipped: how does a customer even find your coffee shop?

In the physical world, they use Google Maps. Type Best Coffee Shop, get an address, follow directions. Simple.

In the internet world, the same thing happens, just with different names. Your browser needs to find where pranaybathini.com actually lives. It needs directions. It needs to establish a connection. And it needs to do all of this securely.

This is networking. And when someone says the app is slow, understanding networking is how you figure out whether the problem is the coffee shop (your server) or the directions to get there (the network).

What You Will Learn

How DNS works (the internet's Google Maps)
Why TCP connections take time to establish
The difference between latency and bandwidth (and why it matters)
How HTTP and HTTPS work
Connection pooling and why it's crucial for performance
How to debug common network issues
Timeout strategies that prevent cascading failures

The Journey of a Request: Following the Directions

When you visit a website, here's what actually happens:

plaintext
1. Browser asks DNS: "What's the IP for pranaybathini.com?"
2. Browser opens TCP connection to that IP
3. Browser negotiates TLS encryption (HTTPS)
4. Browser sends HTTP request
5. Server responds
6. Browser renders the page

Each step can fail. Each step can be slow. Let's understand each one.

DNS: The Internet's Address Book

Think of DNS as the contact list on your phone. You don't memorize phone numbers. You save as Mom and your phone knows to dial 555-123-4567.

DNS works the same way. Humans remember names (google.com). Computers need numbers (142.250.80.14). DNS translates between them.

How it works: When you type pranaybathini.com, your browser asks a DNS resolver (like your phone's contact list). The resolver might already know the answer (cached). If not, it asks a chain of servers until it finds Google's authoritative DNS server, which says pranaybathini.com lives at 64.29.17.65.

This lookup typically takes 10-100ms for a fresh request, but results are cached at multiple levels (browser, operating system, ISP), so repeated lookups are nearly instant.

Why DNS Matters for System Design

DNS-based load balancing is a simple way to distribute traffic. Configure your DNS to return different IPs for the same domain:

plaintext
Request 1: pranaybathini.com → 10.0.0.1 (Server A)
Request 2: pranaybathini.com → 10.0.0.2 (Server B)

But it has serious limitations:

No health checks: DNS happily returns IP addresses of dead servers
Slow updates: Caching means changes take minutes to hours to propagate
No intelligence: Can't route based on server load or capacity

GeoDNS is smarter as it returns different IPs based on where the user is located. User in India gets an IP for your Mumbai datacenter. User in the US gets your Virginia datacenter. This reduces latency by routing users to nearby servers.

TTL (Time To Live) controls caching duration. Short TTLs (60 seconds) let you change IPs quickly during outages. Long TTLs (1 day) reduce DNS lookup overhead but make failover slow.

Rule of thumb: For critical services, use 1-5 minute TTLs. You want the ability to redirect traffic quickly when things break.

TCP: The Polite Introduction

Imagine you're calling someone on the phone. Before you can talk, there's a ritual:

You: "Hello?"
Them: "Hello, who's this?"
You: "It's Pranay, can we talk?"
Them: "Sure, go ahead."

Only then do you start the actual conversation. TCP works the same way with its three-way handshake:

plaintext
Client: "Hey, want to talk?" (SYN)
Server: "Sure, let's talk" (SYN-ACK)  
Client: "Great, here we go" (ACK)

This handshake guarantees both sides are ready. But it costs time which is one full round-trip before any actual data flows.

For a server in the same datacenter, this is ~1ms. No big deal. For a server across the world? That's 150ms of just saying hello. For every new connection.

UDP is the rude alternative. No handshake, no guarantees. It just starts sending data and hopes for the best. Packets can arrive out of order or not at all. But it's faster. Use UDP for real-time stuff (video calls, games) where pretty good most of the time beats perfect but delayed.

Latency vs Bandwidth: The Highway Analogy

These two concepts confuse people constantly. Let me make it simple.

Latency is how long it takes for data to travel from A to B. Think of it as the length of a highway. A 100 kilometer highway takes time to drive, no matter how fast you go.

Some numbers to give perspective:

plaintext
Same datacenter:    0.5 ms   (across the room)
Same region:        5-20 ms  (across the city)
Cross-continent:    50-100 ms (New York to LA)
Around the world:   150-300 ms (New York to Tokyo)

Light in fiber travels ~200km per millisecond. New York to London is ~5,500km = 27ms one way, 55ms round trip. No amount of money or engineering beats physics.

Bandwidth is how much data can flow at once. Think of it as how many lanes the highway has. More lanes = more cars at the same time.

Some numbers to give perspective:

plaintext
Home internet:    100 Mbps - 1 Gbps (2-4 lanes)
Datacenter:       10-100 Gbps (hundreds of lanes)

The key insight: More bandwidth doesn't reduce latency. They're different problems with different solutions.

HTTP: The Conversation Protocol

HTTP is how your browser talks to servers. Think of it like a formal letter exchange:

The Request (your letter):

plaintext
POST /api/orders HTTP/1.1        ← Method + Path + Version
Host: api.coffeeshop.com         ← Which server
Authorization: Bearer abc123     ← Who you are
Content-Type: application/json   ← What format

{"drink": "latte", "size": "large"}  ← The actual content

The Response (their reply):

plaintext
HTTP/1.1 201 Created             ← Status code
Content-Type: application/json

{"order_id": 456, "status": "preparing"}

HTTP Methods: What You're Asking For

Method	Purpose	Example
GET	Retrieve data	Get user profile
POST	Create something new	Place an order
PUT	Replace entirely	Update entire profile
PATCH	Partial update	Change just the email
DELETE	Remove	Cancel an order

Status Codes: What Happened

Think of these as the tone of the reply:

2xx - Success (thumbs up)

200 OK - Here's what you asked for
201 Created - Made the new thing you wanted
204 No Content - Done, nothing to say

3xx - Redirect (go elsewhere)

301 Moved Permanently - It's at a new address forever
302 Found - Temporarily somewhere else

4xx - Client Error (you messed up)

400 Bad Request - Your request doesn't make sense
401 Unauthorized - Who are you? Log in first
403 Forbidden - I know who you are, but you can't do this
404 Not Found - That doesn't exist
429 Too Many Requests - Slow down, you're being rate limited

5xx - Server Error (we messed up)

500 Internal Server Error - Something broke on our end
502 Bad Gateway - The server behind us is broken
503 Service Unavailable - We're overloaded or down for maintenance

HTTP Versions: Getting Faster

HTTP/1.1: One request at a time per connection. Browsers work around this by opening 6 parallel connections.

HTTP/2: Multiplexes multiple requests on one connection. Compresses headers. The modern default.

HTTP/3: Uses QUIC instead of TCP for even faster connection setup. Emerging for latency sensitive applications like video streaming.

HTTPS: Not Optional

Without HTTPS, anyone on the network can read your data, modify it in transit, or impersonate your server. HTTPS is not optional for any production system.

HTTPS = HTTP + TLS encryption. TLS requires a handshake to establish encryption:

TLS 1.2: 2 round trips before data flows
TLS 1.3: 1 round trip (or 0 with session resumption)

For a 100ms round trip, TLS 1.2 adds 200ms. TLS 1.3 cuts this in half. Use TLS 1.3.

Certificate management: Certificates expire. Automate renewal with Let's Encrypt or AWS ACM. Monitor expiration dates as expired certs cause outages.

TLS termination: Most systems decrypt HTTPS at the load balancer. Simpler for backends, though they see plaintext internally. If you need end-to-end encryption, terminate at the application (costs more CPU).

Connection Management: Don't Rebuild the Road Every Trip

Remember all those steps to establish a connection? DNS lookup, TCP handshake, TLS handshake. For a server 50ms away, that's 150-200ms before any actual data flows.

Now imagine doing that for every single request. User clicks a button? 200ms of handshaking. Loads an image? Another 200ms. Fetches data? 200ms more. Your app feels sluggish even though your server responds in 5ms.

Keep-Alive: Leave the Phone Line Open

Old phones required dialing for each call. Modern phones can keep the line open.

HTTP/1.1 introduced keep-alive connections:

plaintext
Without keep-alive:
[dial][talk][hang up] -> [dial][talk][hang up] -> [dial][talk][hang up]

With keep-alive:
[dial][talk][talk][talk][talk]...[hang up later]

One handshake, many requests. Modern browsers and servers enable this by default.

Connection Pooling: A Fleet of Open Lines

For server-to-server communication, connection pooling is essential. Instead of opening a new connection for each request, maintain a pool of ready-to-use connections:

python
# Slow: new connection every time (like dialing for each call)
for url in urls:
    conn = open_connection(url)  # 150ms overhead
    conn.request(url)            # 5ms actual work
    conn.close()

# Fast: reuse connections (pool of open lines)
pool = ConnectionPool(max_size=20)
for url in urls:
    conn = pool.get_connection()  # Nearly instant
    conn.request(url)             # 5ms actual work  
    pool.return_connection(conn)

Watch for connection leaks: If code borrows a connection but forgets to return it, the pool slowly drains until nothing works. Always use try/finally:

python
def query():
    conn = pool.get_connection()
    try:
        return conn.query("SELECT ...")
    finally:
        pool.return_connection(conn)  # Always return!

Server-Side Connection Limits

Every open connection consumes server memory. Servers enforce limits:

plaintext
MySQL: max_connections = 151 (default)
PostgreSQL: max_connections = 100 (default)
Redis: maxclients = 10000 (default)

When these limits are hit, new connections wait or fail. I've seen production outages where the database was barely working not because queries were slow, but because connection limits were exhausted. Check your connection pool sizes and server limits.

Latency Budgets

When your system is slow, you need to know where time goes. Break down a request:

plaintext
Total: 200ms
├── Network to server: 50ms
├── App processing: 10ms
├── Database query: 30ms
├── Network to client: 50ms
└── Buffer: 60ms

If any component exceeds its budget, overall latency fails.

Measure percentiles, not averages. P50 (median) is fine, but P99 matters more. An average of 50ms hides the fact that 1% of users wait 2 seconds.

plaintext
Average: 50ms  (looks great!)
P99: 2000ms    (1% of users are furious)

Reducing Latency

Technique	What it fixes
Caching	Avoids repeated slow work
Connection pooling	Eliminates connection setup
CDN	Reduces network distance
Async processing	Removes work from critical path
Database indexes	Speeds up queries

Some latency is physics. To serve global users fast, put servers near them (CDNs, multi-region deployment).

Timeouts: Non-Negotiable

Every network call needs a timeout. Without one, a stuck dependency blocks your service forever.

Recommended timeouts:

Database queries: 5-30 seconds
Internal API calls: 1-5 seconds
External API calls: 5-10 seconds
User-facing requests: 30 seconds total

Cascading timeouts matter: If Service A calls B calls C:

plaintext
A's timeout to B: 5 seconds
B's timeout to C: 3 seconds (must be less)

If B's timeout is longer than A's, A gives up before B even finishes. Always make downstream timeouts shorter than upstream.

Network Architecture Basics

Public vs Private Networks

Only load balancers need public IPs. App servers and databases stay in private subnets. Databases should never be directly internet-accessible.

Service-to-Service Communication

Direct HTTP: Simple but tight coupling. Service A must know Service B's location.

Message queue: Service A → Queue → Service B. Decoupled and async, but adds latency.

Service mesh: Sidecars handle discovery, load balancing, and encryption. More infrastructure, but cleaner application code.

Debugging Network Issues

High latency:

Check user locations (maybe they're far from servers)
Measure DNS lookup time (should be <50ms)
Count connections being opened (should reuse)
Trace the request path (find slow dependencies)

Connection timeouts:

Server overloaded? Scale up.
Connection pool exhausted? Increase pool size or fix leaks.
Firewall blocking? Check security groups.

Intermittent failures:

Often DNS issues or connection pool exhaustion
Add retries with exponential backoff
Check if you're hitting server connection limits

Key Takeaways

DNS is the first step. Misconfigured DNS causes hard-to-debug failures. Keep TTLs appropriate for your failover needs.

TCP adds latency for reliability. Connection setup takes round trips. Reuse connections with pooling and keep-alive.

Latency and bandwidth are different. Latency is physics (distance). Bandwidth is money (bigger pipe). No amount of bandwidth fixes cross-continent latency.

HTTPS is mandatory. Use TLS 1.3. Automate certificate management.

Every network call needs a timeout. Cascade timeouts correctly so downstream is shorter than upstream.

Keep databases private. Only load balancers need public IPs.

What's Next

Now that you understand how data travels across networks, the next question is: when multiple servers exist, how do you decide which one handles each request? Next up: Load Balancing where we talk about algorithms, health checks, and making sure traffic goes to healthy servers.

In the physical world, they use Google Maps. Type Best Coffee Shop, get an address, follow directions. Simple.

What You Will Learn

How DNS works (the internet's Google Maps)
Why TCP connections take time to establish
The difference between latency and bandwidth (and why it matters)
How HTTP and HTTPS work
Connection pooling and why it's crucial for performance
How to debug common network issues
Timeout strategies that prevent cascading failures

The Journey of a Request: Following the Directions

When you visit a website, here's what actually happens:

plaintext
1. Browser asks DNS: "What's the IP for pranaybathini.com?"
2. Browser opens TCP connection to that IP
3. Browser negotiates TLS encryption (HTTPS)
4. Browser sends HTTP request
5. Server responds
6. Browser renders the page

Each step can fail. Each step can be slow. Let's understand each one.

DNS: The Internet's Address Book

Think of DNS as the contact list on your phone. You don't memorize phone numbers. You save as Mom and your phone knows to dial 555-123-4567.

DNS works the same way. Humans remember names (google.com). Computers need numbers (142.250.80.14). DNS translates between them.

This lookup typically takes 10-100ms for a fresh request, but results are cached at multiple levels (browser, operating system, ISP), so repeated lookups are nearly instant.

Why DNS Matters for System Design

DNS-based load balancing is a simple way to distribute traffic. Configure your DNS to return different IPs for the same domain:

plaintext
Request 1: pranaybathini.com → 10.0.0.1 (Server A)
Request 2: pranaybathini.com → 10.0.0.2 (Server B)

But it has serious limitations:

No health checks: DNS happily returns IP addresses of dead servers
Slow updates: Caching means changes take minutes to hours to propagate
No intelligence: Can't route based on server load or capacity

TTL (Time To Live) controls caching duration. Short TTLs (60 seconds) let you change IPs quickly during outages. Long TTLs (1 day) reduce DNS lookup overhead but make failover slow.

Rule of thumb: For critical services, use 1-5 minute TTLs. You want the ability to redirect traffic quickly when things break.

TCP: The Polite Introduction

Imagine you're calling someone on the phone. Before you can talk, there's a ritual:

You: "Hello?"
Them: "Hello, who's this?"
You: "It's Pranay, can we talk?"
Them: "Sure, go ahead."

Only then do you start the actual conversation. TCP works the same way with its three-way handshake:

plaintext
Client: "Hey, want to talk?" (SYN)
Server: "Sure, let's talk" (SYN-ACK)  
Client: "Great, here we go" (ACK)

This handshake guarantees both sides are ready. But it costs time which is one full round-trip before any actual data flows.

For a server in the same datacenter, this is ~1ms. No big deal. For a server across the world? That's 150ms of just saying hello. For every new connection.

Latency vs Bandwidth: The Highway Analogy

These two concepts confuse people constantly. Let me make it simple.

Latency is how long it takes for data to travel from A to B. Think of it as the length of a highway. A 100 kilometer highway takes time to drive, no matter how fast you go.

Some numbers to give perspective:

plaintext
Same datacenter:    0.5 ms   (across the room)
Same region:        5-20 ms  (across the city)
Cross-continent:    50-100 ms (New York to LA)
Around the world:   150-300 ms (New York to Tokyo)

Light in fiber travels ~200km per millisecond. New York to London is ~5,500km = 27ms one way, 55ms round trip. No amount of money or engineering beats physics.

Bandwidth is how much data can flow at once. Think of it as how many lanes the highway has. More lanes = more cars at the same time.

Some numbers to give perspective:

plaintext
Home internet:    100 Mbps - 1 Gbps (2-4 lanes)
Datacenter:       10-100 Gbps (hundreds of lanes)

The key insight: More bandwidth doesn't reduce latency. They're different problems with different solutions.

HTTP: The Conversation Protocol

HTTP is how your browser talks to servers. Think of it like a formal letter exchange:

The Request (your letter):

plaintext
POST /api/orders HTTP/1.1        ← Method + Path + Version
Host: api.coffeeshop.com         ← Which server
Authorization: Bearer abc123     ← Who you are
Content-Type: application/json   ← What format

{"drink": "latte", "size": "large"}  ← The actual content

The Response (their reply):

plaintext
HTTP/1.1 201 Created             ← Status code
Content-Type: application/json

{"order_id": 456, "status": "preparing"}

HTTP Methods: What You're Asking For

Method	Purpose	Example
GET	Retrieve data	Get user profile
POST	Create something new	Place an order
PUT	Replace entirely	Update entire profile
PATCH	Partial update	Change just the email
DELETE	Remove	Cancel an order

Status Codes: What Happened

Think of these as the tone of the reply:

2xx - Success (thumbs up)

200 OK - Here's what you asked for
201 Created - Made the new thing you wanted
204 No Content - Done, nothing to say

3xx - Redirect (go elsewhere)

301 Moved Permanently - It's at a new address forever
302 Found - Temporarily somewhere else

4xx - Client Error (you messed up)

400 Bad Request - Your request doesn't make sense
401 Unauthorized - Who are you? Log in first
403 Forbidden - I know who you are, but you can't do this
404 Not Found - That doesn't exist
429 Too Many Requests - Slow down, you're being rate limited

5xx - Server Error (we messed up)

500 Internal Server Error - Something broke on our end
502 Bad Gateway - The server behind us is broken
503 Service Unavailable - We're overloaded or down for maintenance

HTTP Versions: Getting Faster

HTTP/1.1: One request at a time per connection. Browsers work around this by opening 6 parallel connections.

HTTP/2: Multiplexes multiple requests on one connection. Compresses headers. The modern default.

HTTP/3: Uses QUIC instead of TCP for even faster connection setup. Emerging for latency sensitive applications like video streaming.

HTTPS: Not Optional

Without HTTPS, anyone on the network can read your data, modify it in transit, or impersonate your server. HTTPS is not optional for any production system.

HTTPS = HTTP + TLS encryption. TLS requires a handshake to establish encryption:

TLS 1.2: 2 round trips before data flows
TLS 1.3: 1 round trip (or 0 with session resumption)

For a 100ms round trip, TLS 1.2 adds 200ms. TLS 1.3 cuts this in half. Use TLS 1.3.

Certificate management: Certificates expire. Automate renewal with Let's Encrypt or AWS ACM. Monitor expiration dates as expired certs cause outages.

Connection Management: Don't Rebuild the Road Every Trip

Remember all those steps to establish a connection? DNS lookup, TCP handshake, TLS handshake. For a server 50ms away, that's 150-200ms before any actual data flows.

Keep-Alive: Leave the Phone Line Open

Old phones required dialing for each call. Modern phones can keep the line open.

HTTP/1.1 introduced keep-alive connections:

plaintext
Without keep-alive:
[dial][talk][hang up] -> [dial][talk][hang up] -> [dial][talk][hang up]

With keep-alive:
[dial][talk][talk][talk][talk]...[hang up later]

One handshake, many requests. Modern browsers and servers enable this by default.

Connection Pooling: A Fleet of Open Lines

For server-to-server communication, connection pooling is essential. Instead of opening a new connection for each request, maintain a pool of ready-to-use connections:

python
# Slow: new connection every time (like dialing for each call)
for url in urls:
    conn = open_connection(url)  # 150ms overhead
    conn.request(url)            # 5ms actual work
    conn.close()

# Fast: reuse connections (pool of open lines)
pool = ConnectionPool(max_size=20)
for url in urls:
    conn = pool.get_connection()  # Nearly instant
    conn.request(url)             # 5ms actual work  
    pool.return_connection(conn)

Watch for connection leaks: If code borrows a connection but forgets to return it, the pool slowly drains until nothing works. Always use try/finally:

python
def query():
    conn = pool.get_connection()
    try:
        return conn.query("SELECT ...")
    finally:
        pool.return_connection(conn)  # Always return!

Server-Side Connection Limits

Every open connection consumes server memory. Servers enforce limits:

plaintext
MySQL: max_connections = 151 (default)
PostgreSQL: max_connections = 100 (default)
Redis: maxclients = 10000 (default)

Latency Budgets

When your system is slow, you need to know where time goes. Break down a request:

plaintext
Total: 200ms
├── Network to server: 50ms
├── App processing: 10ms
├── Database query: 30ms
├── Network to client: 50ms
└── Buffer: 60ms

If any component exceeds its budget, overall latency fails.

Measure percentiles, not averages. P50 (median) is fine, but P99 matters more. An average of 50ms hides the fact that 1% of users wait 2 seconds.

plaintext
Average: 50ms  (looks great!)
P99: 2000ms    (1% of users are furious)

Reducing Latency

Technique	What it fixes
Caching	Avoids repeated slow work
Connection pooling	Eliminates connection setup
CDN	Reduces network distance
Async processing	Removes work from critical path
Database indexes	Speeds up queries

Some latency is physics. To serve global users fast, put servers near them (CDNs, multi-region deployment).

Timeouts: Non-Negotiable

Every network call needs a timeout. Without one, a stuck dependency blocks your service forever.

Recommended timeouts:

Database queries: 5-30 seconds
Internal API calls: 1-5 seconds
External API calls: 5-10 seconds
User-facing requests: 30 seconds total

Cascading timeouts matter: If Service A calls B calls C:

plaintext
A's timeout to B: 5 seconds
B's timeout to C: 3 seconds (must be less)

If B's timeout is longer than A's, A gives up before B even finishes. Always make downstream timeouts shorter than upstream.

Network Architecture Basics

Public vs Private Networks

Only load balancers need public IPs. App servers and databases stay in private subnets. Databases should never be directly internet-accessible.

Service-to-Service Communication

Direct HTTP: Simple but tight coupling. Service A must know Service B's location.

Message queue: Service A → Queue → Service B. Decoupled and async, but adds latency.

Service mesh: Sidecars handle discovery, load balancing, and encryption. More infrastructure, but cleaner application code.

Debugging Network Issues

High latency:

Check user locations (maybe they're far from servers)
Measure DNS lookup time (should be <50ms)
Count connections being opened (should reuse)
Trace the request path (find slow dependencies)

Connection timeouts:

Server overloaded? Scale up.
Connection pool exhausted? Increase pool size or fix leaks.
Firewall blocking? Check security groups.

Intermittent failures:

Often DNS issues or connection pool exhaustion
Add retries with exponential backoff
Check if you're hitting server connection limits

Key Takeaways

DNS is the first step. Misconfigured DNS causes hard-to-debug failures. Keep TTLs appropriate for your failover needs.

TCP adds latency for reliability. Connection setup takes round trips. Reuse connections with pooling and keep-alive.

Latency and bandwidth are different. Latency is physics (distance). Bandwidth is money (bigger pipe). No amount of bandwidth fixes cross-continent latency.

HTTPS is mandatory. Use TLS 1.3. Automate certificate management.

Every network call needs a timeout. Cascade timeouts correctly so downstream is shorter than upstream.

Keep databases private. Only load balancers need public IPs.