Summary: )

)

When I was a junior engineer, I thought the SAGA pattern was magic.
Every blog and YouTube video made it sound so elegant:

“Break your transactions into smaller steps! Add compensations! Voilà — distributed consistency!”

It looked beautiful in diagrams.
Then I saw it in production.
And within a few months, I learned the hard truth: SAGA isn’t a design pattern — it’s a discipline.

Here’s what senior architects understand about SAGA that juniors usually don’t 👇

1️⃣ SAGA Is About Boundaries, Not Just Rollbacks

When juniors talk about SAGA, they focus on the rollback.
When seniors design with SAGA, they focus on boundaries — defining where each business transaction starts and ends.

Bad design:

Order → Payment → Inventory → Notification

Good design:

(OrderService)
   ↳ createOrder()
   ↳ reserveInventory()
   ↳ requestPayment()
   ↳ sendConfirmation()

A senior architect asks:

“Should Payment and Inventory really belong in the same SAGA? Or should they be separate workflows?”

That distinction alone saves you from half the complexity later.

Takeaway:
Every SAGA should represent one business intention, not one system flow.

2️⃣ Choreography Is Beautiful — Until It Isn’t

Everyone starts with event-driven SAGA (choreography).
It’s simple and decoupled — until you have 7 services shouting at each other and nobody knows who’s in charge.

ASCII view 👇

Order → Payment → Inventory → Notification → Refund → Loyalty

Looks fine in a demo.
In production, it becomes an event spaghetti.

Senior architects switch to orchestration when complexity crosses a threshold:

SAGA Orchestrator
     |
     +--> Order
     +--> Payment
     +--> Inventory
     +--> Notification

Takeaway:
Use choreography for simple flows.
Use orchestration for controlled, traceable transactions.

3️⃣ Compensation Logic Is Business Logic

Junior engineers treat compensating transactions as “undo” operations.
Senior architects know that compensation = real business behavior.

Example:
Refunding a payment isn’t undoing a charge. It’s a new financial transaction with its own audit trail, rules, and validation.

public void compensatePayment(String orderId) {
    Payment payment = paymentRepo.findByOrder(orderId);
    refundService.refund(payment);
}

Takeaway:
Don’t treat compensations as “rollback()”.
Treat them as first-class business processes.

4️⃣ SAGA State Belongs in the Database, Not Memory

Juniors often keep SAGA state in memory or a message broker (Redis, Kafka offset).
Then a pod restarts — and the whole transaction disappears.

Seniors persist everything:

  • Current step
  • Correlation ID
  • Retry count
  • Timestamp
  • Status (PENDING / SUCCESS / COMPENSATED)
CREATE TABLE saga_state (
  saga_id VARCHAR PRIMARY KEY,
  step_name VARCHAR,
  status VARCHAR,
  retry_count INT,
  updated_at TIMESTAMP
);

Takeaway:
Treat your SAGA state as durable workflow data — not transient messaging metadata.

5️⃣ A Retry Policy Is Not Optional

This one separates architects from developers.

Junior code:

try {
   paymentService.charge(order);
} catch (Exception e) {
   throw new RuntimeException();
}

Senior code:

RetryTemplate template = RetryTemplate.builder()
   .maxAttempts(3)
   .fixedBackoff(2000)
   .build();
template.execute(ctx -> paymentService.charge(order));

SAGA = distributed systems, and distributed systems = failure.
You can’t “hope” for success; you design for failure.

Takeaway:
Every step must have a retry & timeout policy — even compensations.

6️⃣ SAGA Needs Observability More Than Orchestration

When SAGA fails in production, it doesn’t fail loudly.
It fails quietly — one step times out, another retries forever, another half-compensates.

Senior architects obsess over traceability:

  • Correlation IDs in every log
  • OpenTelemetry spans for every step
  • Dead-letter queues with structured JSON
SAGA_ID=12345
Step=PaymentService
Status=TIMEOUT
Retry=2

Takeaway:
If you can’t trace your SAGA flow end-to-end, you’re flying blind.

7️⃣ The Orchestrator Must Be Dumb, Not God

One of our first mistakes was making the orchestrator “smart.”
It decided what to do next, validated business rules, even made API calls directly.

That was a disaster.

Senior architects know:

The orchestrator shouldn’t do work — it should coordinate work.

Orchestrator
   |
   +-- Command: Reserve Inventory
   +-- Command: Process Payment
   +-- Command: Send Notification

Each service still owns its own logic and data.
The orchestrator only manages the dance.

Takeaway:
Keep your SAGA orchestrator stateless and dumb.
That’s how you scale and debug it safely.

8️⃣ Testing SAGA Is an Engineering Skill in Itself

Juniors write happy-path unit tests.
Seniors write failure simulations.

We literally built an internal “Chaos Saga Tester” that randomly fails a step in the flow to see what happens.

FAIL_STEP=Inventory ./runSagaTest.sh

Takeaway:
Don’t test if SAGA works.
Test if it fails gracefully.

9️⃣ Not Everything Needs a SAGA

This is the hill most senior architects will die on.

Every new engineer loves to say:

“We should make it event-driven and SAGA-based!”

But sometimes… a local transaction + retry queue is enough.

Use SAGA when:

  • You have 3+ services in a logical workflow
  • You need business-level rollback
  • You can tolerate eventual consistency

Otherwise, don’t add unnecessary moving parts.

Takeaway:
The best architecture is the simplest one that works.

🔟 SAGA Success Is Cultural, Not Just Technical

This is what juniors rarely see.
SAGA only works if every team in the microservice mesh agrees on contracts, compensation rules, and event semantics.

Otherwise, you end up with distributed trust issues.
Payment team compensates differently than Inventory.
Notifications get triggered twice.
And everyone blames Kafka. 😅

Takeaway:
SAGA design isn’t just code — it’s organizational alignment.

ASCII Diagram — The “Mature” SAGA Flow

+-----------------------+
        |     SAGA Orchestrator |
        +-----------+-----------+
                    |
     +--------------+---------------+
     |              |               |
     v              v               v
+----------+   +-----------+   +------------+
|  Payment |   | Inventory |   | Notification |
+----------+   +-----------+   +------------+
     |              |               |
     +-------> Compensation <-------+

Each service:

  • Knows how to recover
  • Reports state
  • Doesn’t rely on a central coordinator for logic

That’s how you make SAGA reliable.

Final Thoughts

When you’ve seen a few production SAGA disasters, you stop being romantic about distributed systems.
You stop chasing the “perfect” orchestration.
You start caring about idempotency, retries, traceability, and simplicity.

That’s what senior architects know:

SAGA isn’t about perfection — it’s about resilience.

It’s not a pattern you implement.
It’s a habit you practice.
And when done right, it’s the difference between a system that scales quietly… and one that falls apart on a Monday morning.

I collect sparks from tech, culture, and everyday chaos — then spin them into stories with a twist.

Responses (3)

Noah Yejia Tong

What are your thoughts?

I'd also add that when it comes to orchestration a common mistake would be adding a central point of failure as a SAGA orchestrator for everything. If there are multiple domains, multiple SAGA orchestrators can also be created to prevent this single point of failure expand its failures to other domains.

3

Great

2

Painful ai slop.