Backup & disaster recovery
The OP itself is stateless. Recovery posture is entirely a property of your op.Store — what's persistent, what's volatile, and what your recovery point objective (RPO) is per substore.
Backup priority
| Substore | Priority | RPO target | Why |
|---|---|---|---|
Clients | critical | hours | RP relationships; recovery requires contacting RPs to reissue secrets |
Users (your existing table) | critical | minutes | account loss is irrecoverable |
Grants | critical | hours | consent records; loss forces every active user to re-consent |
RefreshTokens | high | hours | active sessions; loss forces every active user to re-login |
AccessTokens | medium | low priority | bounded by access-token TTL (default 5 min) — natural recovery |
IATs (DCR initial-access tokens) | medium | hours | issued out-of-band; loss requires reissue |
RATs (DCR registration-access tokens) | medium | hours | required for the RP to read / update its own metadata |
AuthorizationCodes | low | n/a | one-shot, ≤ 60 s lifetime; never restore |
PARs | low | n/a | one-shot, ≤ 90 s lifetime; never restore |
Sessions | embedder choice | depends on WithSessionDurabilityPosture | |
Interactions | low | n/a | per-attempt; users retry |
ConsumedJTIs | must not restore | — | restoring a jti set rolls back replay protection |
Passkeys, TOTPs, EmailOTPs, Recovery | critical | minutes | MFA factor records; loss locks users out |
The "must not restore" row deserves emphasis: replay-protection substores are intentionally one-way write logs. Restoring them from backup re-opens the replay window between the backup point and now. Treat them like sequence counters, not data.
What to skip
These are safe to discard on recovery — the OP rebuilds them from upstream state:
- DPoP server-nonce cache. Reseeded on first request after recovery.
- JAR / DPoP
jticonsumed set. See above — restoring it is actively harmful. - PARs and authorization codes. TTL ≤ 90 s; expired anyway.
- Discovery cache (RP-side). RPs revalidate on
kidmismatch during a JWKS rotation.
Backup mechanics per backend
SQL adapter (storeadapter/sql)
Standard DB backup tooling applies. Two modes that matter:
- Logical (mysqldump / pg_dump) — point-in-time consistent snapshot. Use this if you can pause writes briefly during the snapshot.
- Physical (binlog / WAL streaming) — continuous replication. Use this for sub-minute RPO. The OP does not require any specific backup mode; pick what your DB ops team already runs.
Schema is documented in the source repo under op/storeadapter/sql/schema/.
Redis adapter (storeadapter/redis)
Redis stores only volatile substores in the bundled adapter:
Sessions(eligible —WithSessionDurabilityPostureannotates)InteractionsConsumedJTIs- DPoP nonce cache (if you implemented one)
For these, RDB snapshots and AOF persistence are the standard Redis options. Pick:
- AOF off, RDB off — pure volatile; sessions evict on restart. This is a valid posture; set
WithSessionDurabilityPostureaccordingly. - RDB only — periodic snapshot; sessions survive restart but recent activity may be lost.
- AOF + fsync everysec — durability close to SQL; rare for volatile substores.
Embedder-implemented store
If you wrote your own op.Store, the contract test suite at op/store/contract verifies your implementation against the same expectations the bundled adapters meet. The contract does not prescribe a backup shape — that's up to you.
Recovery procedure
Total loss of durable backend
- Restore the durable backend from backup.
- Bring up the OP against the restored store.
- Optionally, fan-out an invalidation:
- If the backup is more than
WithAccessTokenTTLold, every issued access token has expired — no action needed. - If the backup is more than the JWKS cache window old, RPs may still verify tokens against keys you've since rotated — issue a manual JWKS refresh (curl from each RP) if you're paranoid.
- If the backup is more than
- Notify users that re-login is required for any session established after the backup point.
Total loss of volatile backend (Redis)
- Bring up a fresh Redis instance.
- Restart the OP replicas.
- Active sessions — gone. Users re-login.
- The audit log shows a spike in
bcl.no_sessions_for_subjectfor any logout fan-out invoked during the gap. Expected; resolve.
Partial loss (one substore corrupted)
- Stop writes to the affected substore.
- Restore that substore in isolation.
- Resume writes.
The transactional cluster invariant guarantees clients / codes / refresh tokens / access tokens / IATs share one backend, so partial restore inside that cluster is a SQL-level operation (table-level restore from binlog or PITR). Volatile substores are independent — losing Sessions doesn't impact RefreshTokens.
Cookie key recovery
If the cookie key is lost (HSM destroyed, secret manager wiped) every encrypted cookie in flight becomes invalid:
- Active browser sessions are dead — users re-login.
- Pending consent / interaction flows are dead — users restart.
- Refresh tokens are unaffected — the cookie key seals session cookies, not refresh tokens.
There is no recovery path for the cookie key itself. Treat it like a HSM-stored secret with the same redundancy posture as your signing keys.
Signing key recovery
If the signing key is lost:
- New tokens cannot be issued until you provision a fresh
op.Keyset. - Existing tokens still verify against
/jwksif the public half is recoverable from the JWKS document an RP has cached. The OP itself also verifies its own tokens — losing the key is fatal. - Refresh tokens are still mintable as opaque values, but the bound access tokens fail to sign.
Mitigation: store the private signing key in a service that has its own backup / replication (KMS, Vault), and never hold the only copy in process memory. The library accepts any crypto.Signer — see JWKS endpoint § HSM / KMS integration.
Rehearsal
Run a recovery drill before you go live:
- Take a backup at
t0. - Issue 100 tokens at
t0 + 1 min. - Restore the backup.
- Confirm the 100 issued tokens are unverifiable (which is the correct behaviour — the chain was rolled back).
- Confirm new login + token issuance works.
A 30-minute drill catches more issues than a written runbook ever will.