Message Queues & Event Streaming | System Design Fundamentals

Our coffee shop grown well. Now we have a problem. When a customer places an order, we need to do several things such as process the payment, update the inventory, send a confirmation email, notify the kitchen and update the loyalty points. If we do all of this before telling the customer Order received!, they're standing there for 10 seconds staring at a loading bar which is a bad user experience.

Worse thing is even if one of the things goes wrong, the order fails. For example, if the email service is down, the order fails.

What we need is a way to say Order received! and tell the customer that we'll handle the rest in the background. The customer gets instant feedback. The other tasks happen asynchronously.

This is where message queues and event streaming comes into the play. They decouple services so they don't have to wait for each other.

What You Will Learn

Why synchronous communication creates fragile systems
Message queues: the postal service of software
Event streaming: the newspaper model
When to use queues vs streams
Popular tools: RabbitMQ, SQS, Kafka
Patterns for reliable message processing
Common pitfalls and how to avoid them

The Problem with Synchronous Communication

The Chain of Dependency

In a synchronous system, services call each other directly and wait for responses:

Total time: Sum of all service response times.

If Email Service is slow: Everything waits.

If Email Service is down: The entire order fails.

This is like a relay race where every runner must hand off perfectly. One stumble and the whole team loses.

The Pizza Shop Analogy

Imagine you are ordering a pizza by phone. Here is how the synchronous looks like:

You call and say I want a pepperoni pizza
The cashier says Hold on and walks to the kitchen
The cashier waits while the chef makes the pizza
The cashier waits while the delivery driver delivers it
After you receive the pizza, the cashier says "Okay, your pizza has been delivered. That'll be 50 bucks."

You've been on hold for 45 minutes. Sounds ridiculous, right?

But this doesn't happen in real world. It looks something like this:

You call and say "I want a pepperoni pizza"
Cashier: "Got it! Order #47. It'll be there in 30 minutes." (hangs up)
Cashier writes the order on a ticket and puts it in the queue
Kitchen picks up the ticket when ready
Driver picks up pizza when it's done

The cashier is free to take more orders. The kitchen works at its own pace. The driver doesn't need to know who ordered what. That ticket is a message queue.

Message Queues:

How It Works

Producer sends a message to the queue
Queue stores the message until someone retrieves it
Consumer pulls messages and processes them

The producer and consumer don't need to be running at the same time. They don't even need to know about each other.

Message Queue Properties

Persistence: Messages survive queue restarts. If the server reboots, your messages aren't lost.

Acknowledgment: Consumers confirm when they've processed a message. If a consumer crashes mid-processing, the message goes back to the queue.

Ordering: Messages are typically processed in order (FIFO), though this varies by implementation.

At-least-once delivery: The queue guarantees the message will be delivered at least once. Your consumer might see duplicates.

Event Streaming:

Event streaming is different from message queues. Think of it like a newspaper vs mail.

Message Queue (Mail):

Addressed to a specific recipient
Once received, it's gone
Recipient must be known upfront

Event Stream (Newspaper):

Published for anyone interested
Stays available for a period
New subscribers can read past issues

How It Works

Key differences:

Multiple consumers can read the same event
Events are retained for a configurable time (hours, days, weeks)
Consumers track their position in the stream
New consumers can start from the beginning or any point

Coffee Shop Example: Events

When an order is placed, we publish an OrderCreated event:

json
{
  "eventType": "OrderCreated",
  "timestamp": "2024-03-15T10:30:00Z",
  "data": {
    "orderId": "order-123",
    "customerId": "cust-456",
    "items": [{"name": "Latte", "price": 4.50}],
    "total": 4.50
  }
}

Multiple services subscribe to this event:

Payment Service: Charges the customer
Inventory Service: Decrements stock
Analytics Service: Updates dashboards
Loyalty Service: Awards points
Kitchen Display: Shows the order

Each service processes the event independently. If we add a new Marketing Service next month, it can subscribe and even replay past events to catch up.

Queues vs Streams: When to Use Each

Aspect	Message Queue	Event Stream
Delivery	To one consumer	To many consumers
Retention	Until consumed	For a time period
Replay	No	Yes
Use case	Task distribution	Event broadcast
Example	"Process this payment"	"An order was placed"

Use Message Queues When:

Work distribution: You have tasks that need to be done once by one worker
Load leveling: Buffer requests during traffic spikes
Guaranteed processing: Each message must be handled exactly once

Example: Email sending, image processing, report generation

Use Event Streams When:

Multiple consumers: Different services need the same data
Event sourcing: You want a complete history of what happened
Real-time analytics: Dashboard updates, monitoring
Replay capability: New services need to process historical events

Example: User activity tracking, audit logs, real-time feeds

Popular Tools

RabbitMQ: The Reliable Workhorse

Traditional message queue. Great for task distribution.

python
# Producer
channel.basic_publish(
    exchange='',
    routing_key='order_queue',
    body='{"order_id": 123}'
)

# Consumer
def callback(ch, method, properties, body):
    process_order(body)
    ch.basic_ack(delivery_tag=method.delivery_tag)

channel.basic_consume(queue='order_queue', on_message_callback=callback)

Strengths: Flexible routing, multiple protocols, mature ecosystem.

Use for: Task queues, RPC, complex routing logic

Amazon SQS: The Managed Option

Fully managed queue service. No infrastructure to manage.

python
import boto3

sqs = boto3.client('sqs')

# Send message
sqs.send_message(
    QueueUrl='https://sqs.../my-queue',
    MessageBody='{"order_id": 123}'
)

# Receive messages
response = sqs.receive_message(QueueUrl='https://sqs.../my-queue')
for message in response.get('Messages', []):
    process(message['Body'])
    sqs.delete_message(
        QueueUrl='https://sqs.../my-queue',
        ReceiptHandle=message['ReceiptHandle']
    )

Strengths: Zero ops, scales automatically, cheap.

Use for: AWS-native applications, simple queue needs.

Apache Kafka: The Event Streaming Giant

Distributed event streaming platform. Built for scale.

python
from kafka import KafkaProducer, KafkaConsumer

# Producer
producer = KafkaProducer(bootstrap_servers='localhost:9092')
producer.send('orders', b'{"order_id": 123}')

# Consumer
consumer = KafkaConsumer('orders', bootstrap_servers='localhost:9092')
for message in consumer:
    process(message.value)

Strengths: Massive throughput, replay capability, exactly-once semantics.

Use for: Event streaming, real-time analytics, high-volume data pipelines.

Quick Comparison

Tool	Type	Managed Option	Throughput	Complexity
RabbitMQ	Queue	CloudAMQP	Moderate	Medium
Amazon SQS	Queue	Native AWS	High	Low
Apache Kafka	Stream	Confluent, MSK	Very High	High
Redis Streams	Stream	Redis Cloud	High	Medium

Patterns for Reliable Processing

Pattern 1: Idempotent Consumers

With at-least-once delivery, consumers might see the same message twice. Make operations idempotent (safe to repeat).

Bad:

python
def process_order(order):
    charge_customer(order.total)  # Charges twice if message replayed!

Good:

python
def process_order(order):
    if already_processed(order.id):
        return  # Skip duplicate
    
    charge_customer(order.total)
    mark_processed(order.id)

Pattern 2: Dead Letter Queues

What happens when a message fails processing repeatedly? Don't lose it. Move it to a dead letter queue for investigation.

Pattern 3: Outbox Pattern

Ensure database writes and message publishing are atomic.

Problem: You update the database, then publish a message. If the app crashes between them, the message is lost.

Solution: Write the message to an outbox table in the same transaction. A separate process reads the outbox and publishes.

sql
BEGIN TRANSACTION;
  INSERT INTO orders (id, total) VALUES (123, 45.00);
  INSERT INTO outbox (event_type, payload) VALUES ('OrderCreated', '{"id": 123}');
COMMIT;

A background worker polls the outbox and publishes to the queue.

Pattern 4: Consumer Groups

Multiple instances of a consumer sharing the work.

Each message goes to ONE worker in the group. Scale by adding more workers.

Common Pitfalls

Pitfall 1: Unbounded Queues

If producers are faster than consumers, then the queue grows forever. Eventually, you will run out of memory or disk.

Fix:

Monitor queue depth
Set max queue size (reject or drop old messages)
Scale consumers automatically based on queue depth

Pitfall 2: Poison Messages

A malformed message that can crash consumers which results in consumer restarts, consumes the same message, crashes again and again.

Fix:

Limit retry attempts
Use dead letter queues
Validate messages before processing

Pitfall 3: Ordering Assumptions

Most queues don't guarantee strict ordering, especially with multiple consumers.

Fix:

If order matters, use a single consumer or partitioned queues
Include sequence numbers in messages
Design consumers to handle out-of-order messages

Pitfall 4: Fire and Forget

Publishing a message and assuming it will be processed. What if the queue is down? What if the consumer crashes?

Fix:

Use persistent/durable queues
Implement acknowledgments
Monitor processing metrics
Alert on growing queues or high failure rates

Key Takeaways

Message queues decouple producers and consumers:

Producers don't wait for consumers
Traffic spikes are buffered
Failed consumers don't crash producers

Event streams broadcast events to multiple subscribers:

Multiple services can react to the same event
Events are retained for replay
New services can process historical data

Choose queues for task distribution (do this work once).

Choose streams for event broadcasting (everyone needs to know).

Make consumers idempotent. Messages may be delivered more than once.

Use dead letter queues. Don't lose failed messages.

Monitor everything. Queue depth, processing time, failure rates.

What's Next

We've covered how services communicate asynchronously. But what about the interface between services and the outside world? How do you design APIs that are easy to use, maintain, and scale? That's API Design Best Practices, coming up next.

Worse thing is even if one of the things goes wrong, the order fails. For example, if the email service is down, the order fails.

What we need is a way to say Order received! and tell the customer that we'll handle the rest in the background. The customer gets instant feedback. The other tasks happen asynchronously.

This is where message queues and event streaming comes into the play. They decouple services so they don't have to wait for each other.

What You Will Learn

Why synchronous communication creates fragile systems
Message queues: the postal service of software
Event streaming: the newspaper model
When to use queues vs streams
Popular tools: RabbitMQ, SQS, Kafka
Patterns for reliable message processing
Common pitfalls and how to avoid them

The Problem with Synchronous Communication

The Chain of Dependency

In a synchronous system, services call each other directly and wait for responses:

Total time: Sum of all service response times.

If Email Service is slow: Everything waits.

If Email Service is down: The entire order fails.

This is like a relay race where every runner must hand off perfectly. One stumble and the whole team loses.

The Pizza Shop Analogy

Imagine you are ordering a pizza by phone. Here is how the synchronous looks like:

You call and say I want a pepperoni pizza
The cashier says Hold on and walks to the kitchen
The cashier waits while the chef makes the pizza
The cashier waits while the delivery driver delivers it
After you receive the pizza, the cashier says "Okay, your pizza has been delivered. That'll be 50 bucks."

You've been on hold for 45 minutes. Sounds ridiculous, right?

But this doesn't happen in real world. It looks something like this:

You call and say "I want a pepperoni pizza"
Cashier: "Got it! Order #47. It'll be there in 30 minutes." (hangs up)
Cashier writes the order on a ticket and puts it in the queue
Kitchen picks up the ticket when ready
Driver picks up pizza when it's done

The cashier is free to take more orders. The kitchen works at its own pace. The driver doesn't need to know who ordered what. That ticket is a message queue.

Message Queues:

How It Works

Producer sends a message to the queue
Queue stores the message until someone retrieves it
Consumer pulls messages and processes them

The producer and consumer don't need to be running at the same time. They don't even need to know about each other.

Message Queue Properties

Persistence: Messages survive queue restarts. If the server reboots, your messages aren't lost.

Acknowledgment: Consumers confirm when they've processed a message. If a consumer crashes mid-processing, the message goes back to the queue.

Ordering: Messages are typically processed in order (FIFO), though this varies by implementation.

At-least-once delivery: The queue guarantees the message will be delivered at least once. Your consumer might see duplicates.

Event Streaming:

Event streaming is different from message queues. Think of it like a newspaper vs mail.

Message Queue (Mail):

Addressed to a specific recipient
Once received, it's gone
Recipient must be known upfront

Event Stream (Newspaper):

Published for anyone interested
Stays available for a period
New subscribers can read past issues

How It Works

Key differences:

Multiple consumers can read the same event
Events are retained for a configurable time (hours, days, weeks)
Consumers track their position in the stream
New consumers can start from the beginning or any point

Coffee Shop Example: Events

When an order is placed, we publish an OrderCreated event:

json
{
  "eventType": "OrderCreated",
  "timestamp": "2024-03-15T10:30:00Z",
  "data": {
    "orderId": "order-123",
    "customerId": "cust-456",
    "items": [{"name": "Latte", "price": 4.50}],
    "total": 4.50
  }
}

Multiple services subscribe to this event:

Payment Service: Charges the customer
Inventory Service: Decrements stock
Analytics Service: Updates dashboards
Loyalty Service: Awards points
Kitchen Display: Shows the order

Each service processes the event independently. If we add a new Marketing Service next month, it can subscribe and even replay past events to catch up.

Queues vs Streams: When to Use Each

Aspect	Message Queue	Event Stream
Delivery	To one consumer	To many consumers
Retention	Until consumed	For a time period
Replay	No	Yes
Use case	Task distribution	Event broadcast
Example	"Process this payment"	"An order was placed"

Use Message Queues When:

Work distribution: You have tasks that need to be done once by one worker
Load leveling: Buffer requests during traffic spikes
Guaranteed processing: Each message must be handled exactly once

Example: Email sending, image processing, report generation

Use Event Streams When:

Multiple consumers: Different services need the same data
Event sourcing: You want a complete history of what happened
Real-time analytics: Dashboard updates, monitoring
Replay capability: New services need to process historical events

Example: User activity tracking, audit logs, real-time feeds

Popular Tools

RabbitMQ: The Reliable Workhorse

Traditional message queue. Great for task distribution.

python
# Producer
channel.basic_publish(
    exchange='',
    routing_key='order_queue',
    body='{"order_id": 123}'
)

# Consumer
def callback(ch, method, properties, body):
    process_order(body)
    ch.basic_ack(delivery_tag=method.delivery_tag)

channel.basic_consume(queue='order_queue', on_message_callback=callback)

Strengths: Flexible routing, multiple protocols, mature ecosystem.

Use for: Task queues, RPC, complex routing logic

Amazon SQS: The Managed Option

Fully managed queue service. No infrastructure to manage.

python
import boto3

sqs = boto3.client('sqs')

# Send message
sqs.send_message(
    QueueUrl='https://sqs.../my-queue',
    MessageBody='{"order_id": 123}'
)

# Receive messages
response = sqs.receive_message(QueueUrl='https://sqs.../my-queue')
for message in response.get('Messages', []):
    process(message['Body'])
    sqs.delete_message(
        QueueUrl='https://sqs.../my-queue',
        ReceiptHandle=message['ReceiptHandle']
    )

Strengths: Zero ops, scales automatically, cheap.

Use for: AWS-native applications, simple queue needs.

Apache Kafka: The Event Streaming Giant

Distributed event streaming platform. Built for scale.

python
from kafka import KafkaProducer, KafkaConsumer

# Producer
producer = KafkaProducer(bootstrap_servers='localhost:9092')
producer.send('orders', b'{"order_id": 123}')

# Consumer
consumer = KafkaConsumer('orders', bootstrap_servers='localhost:9092')
for message in consumer:
    process(message.value)

Strengths: Massive throughput, replay capability, exactly-once semantics.

Use for: Event streaming, real-time analytics, high-volume data pipelines.

Quick Comparison

Tool	Type	Managed Option	Throughput	Complexity
RabbitMQ	Queue	CloudAMQP	Moderate	Medium
Amazon SQS	Queue	Native AWS	High	Low
Apache Kafka	Stream	Confluent, MSK	Very High	High
Redis Streams	Stream	Redis Cloud	High	Medium

Patterns for Reliable Processing

Pattern 1: Idempotent Consumers

With at-least-once delivery, consumers might see the same message twice. Make operations idempotent (safe to repeat).

Bad:

python
def process_order(order):
    charge_customer(order.total)  # Charges twice if message replayed!

Good:

python
def process_order(order):
    if already_processed(order.id):
        return  # Skip duplicate
    
    charge_customer(order.total)
    mark_processed(order.id)

Pattern 2: Dead Letter Queues

What happens when a message fails processing repeatedly? Don't lose it. Move it to a dead letter queue for investigation.

Pattern 3: Outbox Pattern

Ensure database writes and message publishing are atomic.

Problem: You update the database, then publish a message. If the app crashes between them, the message is lost.

Solution: Write the message to an outbox table in the same transaction. A separate process reads the outbox and publishes.

sql
BEGIN TRANSACTION;
  INSERT INTO orders (id, total) VALUES (123, 45.00);
  INSERT INTO outbox (event_type, payload) VALUES ('OrderCreated', '{"id": 123}');
COMMIT;

A background worker polls the outbox and publishes to the queue.

Pattern 4: Consumer Groups

Multiple instances of a consumer sharing the work.

Each message goes to ONE worker in the group. Scale by adding more workers.

Common Pitfalls

Pitfall 1: Unbounded Queues

If producers are faster than consumers, then the queue grows forever. Eventually, you will run out of memory or disk.

Fix:

Monitor queue depth
Set max queue size (reject or drop old messages)
Scale consumers automatically based on queue depth

Pitfall 2: Poison Messages

A malformed message that can crash consumers which results in consumer restarts, consumes the same message, crashes again and again.

Fix:

Limit retry attempts
Use dead letter queues
Validate messages before processing

Pitfall 3: Ordering Assumptions

Most queues don't guarantee strict ordering, especially with multiple consumers.

Fix:

If order matters, use a single consumer or partitioned queues
Include sequence numbers in messages
Design consumers to handle out-of-order messages

Pitfall 4: Fire and Forget

Publishing a message and assuming it will be processed. What if the queue is down? What if the consumer crashes?

Fix:

Use persistent/durable queues
Implement acknowledgments
Monitor processing metrics
Alert on growing queues or high failure rates

Key Takeaways

Message queues decouple producers and consumers:

Producers don't wait for consumers
Traffic spikes are buffered
Failed consumers don't crash producers

Event streams broadcast events to multiple subscribers:

Multiple services can react to the same event
Events are retained for replay
New services can process historical data

Choose queues for task distribution (do this work once).

Choose streams for event broadcasting (everyone needs to know).

Make consumers idempotent. Messages may be delivered more than once.

Use dead letter queues. Don't lose failed messages.

Monitor everything. Queue depth, processing time, failure rates.