Reliability

We build for failure so production stays predictable: billing state machine and idempotent webhooks (Dodo), job retries and dead-letter (Trigger.dev), usage thresholds and overrun handling, and operational safety with runbooks and circuit breakers. One page for how we avoid double-charge, silent drops, and lock-in — and how we recover when things go wrong.

Billing failure recovery

Dodo Payments and a billing state machine. No double-charge, no lost state when providers fail or webhooks are delayed. Grace periods and documented failure playbooks.

Subscriptions and usage move through explicit states (trialing, active, past_due, canceled). Every billing event is recorded in the ledger. Failed payments trigger configurable grace periods and dunning; we don't cut access immediately. Webhook handling is idempotent (keyed by idempotency keys) so retries and duplicate delivery don't create duplicate charges. We document failure playbooks for payment fails and webhook delays so you can recover predictably.

Webhook safety

Payloads signed and verified; handlers idempotent. Duplicate or out-of-order delivery doesn't corrupt data. We store processed keys and reject duplicates.

We sign webhook payloads and recommend verification on your side. Retries use exponential backoff with a cap. Critical handlers (billing, notifications) are idempotent: we store processed idempotency keys and reject duplicates so provider retries are safe. Charge and invoice creation are idempotent. We document replay and dead-letter handling so you can recover from backlog or outages without data corruption.

Job retries and dead-letter

Trigger.dev: durable jobs, exponential backoff, and a dashboard. Failed jobs move to dead-letter so nothing is dropped silently. Inspect and replay when ready.

Background jobs run on Trigger.dev with configurable retry and max attempts. After exhaustion, jobs move to a dead-letter state so you can inspect payloads and replay once the root cause is fixed. We enforce idempotency for billing-related jobs so retries don't double-apply side effects. The Trigger.dev dashboard gives you visibility into status, failures, and replay.

Usage overrun and threshold alerts

Configurable thresholds fire before hard limits. Soft caps, overrun policies, and metered usage so billing stays accurate and you stay in control.

When usage approaches or exceeds limits, we fire threshold alerts so you can act before hitting hard stops. You can set soft caps, allow overrun with overage billing, or hard-stop — configurable per product. Usage events are ingested, aggregated, and fed into the billing pipeline so metered usage and limits stay accurate. No silent breakage; clear errors when validation fails.

Operational safety

Feature flags, circuit breakers, runbooks, and read-only impersonation. When dependencies fail, you have a playbook and the tools to limit blast radius.

Kill switches and circuit breakers stop calling failing dependencies (e.g. payment provider, email API) and fail fast; after cooldown we probe again so you don't amplify outages. Runbooks live in the docs and cover provider down, queue backlog, billing reconciliation, and data reconciliation. Admin impersonation is read-only so you can debug without making changes. Report-a-problem sends user, page, and error context. We document failure playbooks so when things break, you have a playbook.

Common questions

See how these practices fit into the bigger picture: architecture and systems.

Reliability

Reliability topics

Common questions

Is this enough for production?

Where are the runbooks?

How do we see failures in production?