Saga Pattern
A saga coordinates a distributed transaction as a sequence of local transactions across services. If any step fails, compensating transactions undo the previously completed steps.
Saga = planning a wedding across multiple vendors. If the caterer cancels, you can't 'undo' the cake tasting — you call each vendor to cancel and get refunds one by one.
Sagas come in two flavors: choreography (each service publishes events and reacts to others' events) and orchestration (a central saga orchestrator directs each step via commands). Choreography is decentralized and resilient but hard to reason about for complex flows. Orchestration provides explicit process visibility and makes error handling easier but introduces a central coordinator that can become a bottleneck. Compensating transactions must be idempotent and must account for the fact that the system may have made side effects visible externally before the rollback — saga cannot provide true atomicity, only eventual consistency.
Saga state must be persisted so the orchestrator or choreography can resume after a crash. The orchestrator itself must be idempotent — receiving the same completion event twice must not advance the saga twice. This is typically handled via a state machine stored in a database with optimistic locking on the saga ID. A critical edge case: compensating transactions can themselves fail. You need a 'stuck saga' detection process that alerts on sagas that haven't progressed for longer than their timeout. Temporal.io solves many saga implementation problems by providing durable execution with automatic retries and built-in state persistence.
For an e-commerce order flow (reserve inventory → charge payment → confirm shipment), I use an orchestrated saga with the saga state persisted in Postgres. Each step is a command sent to the target service. On failure at any step, the orchestrator executes compensating commands in reverse: refund payment if already charged, release inventory reservation. I use idempotency keys on every command and the Temporal workflow engine to handle retries, timeouts, and state durability.
Trying to implement sagas without compensating transactions. You cannot simply 'roll back' a distributed transaction — the inventory service already decremented its count. Compensating transactions are domain operations (re-increment inventory) not technical rollbacks.