Reliability
We build for failure so production stays predictable: billing state machine and idempotent webhooks (Dodo), job retries and dead-letter (Trigger.dev), usage thresholds and overrun handling, and operational safety with runbooks and circuit breakers. One page for how we avoid double-charge, silent drops, and lock-in — and how we recover when things go wrong.
Reliability topics
Subscriptions and usage move through explicit states (trialing, active, past_due, canceled). Every billing event is recorded in the ledger. Failed payments trigger configurable grace periods and dunning; we don't cut access immediately. Webhook handling is idempotent (keyed by idempotency keys) so retries and duplicate delivery don't create duplicate charges. We document failure playbooks for payment fails and webhook delays so you can recover predictably.
We sign webhook payloads and recommend verification on your side. Retries use exponential backoff with a cap. Critical handlers (billing, notifications) are idempotent: we store processed idempotency keys and reject duplicates so provider retries are safe. Charge and invoice creation are idempotent. We document replay and dead-letter handling so you can recover from backlog or outages without data corruption.
Background jobs run on Trigger.dev with configurable retry and max attempts. After exhaustion, jobs move to a dead-letter state so you can inspect payloads and replay once the root cause is fixed. We enforce idempotency for billing-related jobs so retries don't double-apply side effects. The Trigger.dev dashboard gives you visibility into status, failures, and replay.
When usage approaches or exceeds limits, we fire threshold alerts so you can act before hitting hard stops. You can set soft caps, allow overrun with overage billing, or hard-stop — configurable per product. Usage events are ingested, aggregated, and fed into the billing pipeline so metered usage and limits stay accurate. No silent breakage; clear errors when validation fails.
Kill switches and circuit breakers stop calling failing dependencies (e.g. payment provider, email API) and fail fast; after cooldown we probe again so you don't amplify outages. Runbooks live in the docs and cover provider down, queue backlog, billing reconciliation, and data reconciliation. Admin impersonation is read-only so you can debug without making changes. Report-a-problem sends user, page, and error context. We document failure playbooks so when things break, you have a playbook.
Common questions
See how these practices fit into the bigger picture: architecture and systems.