Files
nick-doc/08 - Operations/Backend Funds Migration and Operational Runbooks.md
Siavash Sameni 81625d35d2 docs: AML scope note, human-blocked items, Task #11 pre-flight inventory
- Add AML scope note to Handoff - RN Multichain Probe (sanctions-only vs full KYT)
- Add human-blocked section with 3 precise next steps for owner
- Create Task 11 Pre-flight Inventory: library choice, dev/prod flow, admin UI gaps, backend gaps, risks, acceptance criteria
2026-05-28 20:42:42 +04:00

201 lines
9.2 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
---
title: Backend Funds Migration and Operational Runbooks
tags: [operations, payments, migration, incidents]
created: 2026-05-24
---
# Backend Funds Migration and Operational Runbooks
These runbooks cover the selected backend/funds architecture defined in
[[Backend Core Stack Decision Record - 2026-05-24]].
> [!note] Historical migration context
> Sections that mention keeping SHKeeper active describe the migration period from the old payment rail to Request Network. Current new-payment operations should use [[Escrow Flow]], [[Request Network Integration Constraints]], and [[PRD - Decentralized Custody and Smart-Contract Escrow Roadmap]].
## 1. Migration runbook (legacy + provider migration)
### 1.1 Preflight
1. Verify provider credentials and webhook secrets are present in production `.env`.
2. Confirm canonical state docs are published and approved:
- [[Funds Ledger and Escrow State Machine Specification]]
- [[Payment Provider Adapter Spec]]
- [[Webhook Security Spec]]
3. Run validation report:
- count of legacy `Payment` rows by provider,
- duplicate provider keys,
- webhook processing success/error by provider (24h),
- unresolved `pending`/`processing` payments older than 30m.
4. Run a staging reconciliation simulation against a read-only copy.
### 1.2 Rollout sequence
1. Keep `PAYMENT_MODE=read_only` and keep SHKeeper intent path active.
2. Enable legacy+new provider dual routing via `PAYMENT_ENABLED_PROVIDERS`.
3. Activate adapter contract metrics (`payment-provider-events` and webhook counters) before any write routing.
4. Set `PAYMENT_PROVIDER_MODE=standard` for a 1015% cohort.
5. Expand to full traffic only after two clean 24h windows.
### 1.3 Legacy read path and backfill
- Keep SHKeeper legacy read path enabled for at least one billing cycle after migration starts.
- Backfill task:
- map legacy SHKeeper records into `FundsAccount` references,
- write reconciliation notes into `metadata.providers`,
- leave source payment rows immutable.
- Do not delete legacy routing until backfill report is complete and reviewed.
### 1.4 Validation before enforcement
Before switching default provider:
- No critical webhook verification failures for 24h.
- DLQ volume stable for 2h.
- Provider status mismatch rate `< 0.5%` for 1000 samples.
- Manual reconciliation sample (`>= 10` rows) completed by two operators.
### 1.5 Rollback criteria
Rollback immediately if any of:
- payout/hold/release path returns invariant violation,
- unresolved webhook failure burst above threshold,
- chain reconciliation mismatch above threshold from runbook table.
Rollback command sequence:
1. `PAYMENT_ENABLED_PROVIDERS=shkeeper`
2. `PAYMENT_DEFAULT_PROVIDER=shkeeper`
3. `PAYMENT_PROVIDER_MODE=read_only` (pause new routing),
4. keep reconciliation active in dry-run mode,
5. post-incident note in this doc + [[Incident Response]].
### 1.6 Webhook cutoff and manual reconciliation
- Keep auto-accepting historical webhook traffic through cutover.
- Set explicit cutoff only if new provider writes must stop:
1. pause new pay-in intents on new provider,
2. continue `handleProviderWebhook` in `read_only` mode,
3. run manual reconciliation for open windows,
4. re-enable normal routing after backlog clear or explicit incident decision.
## 2. Scenario runbooks
Each incident scenario uses the same structure:
`owner`, `trigger`, `detection`, `immediate`, `recovery`, `post`.
### 2.1 Failed webhook
- **Owner:** BL + DL
- **Trigger:** `webhook_failures` alert > warning threshold (see [[Webhook Security Spec]]).
- **Detection:** spike in 5xx/400 from callback route, DLQ growth.
- **Immediate:** isolate by provider; keep `PAYMENT_PROVIDER_MODE=read_only` for provider if mismatch continues.
- **Recovery:** re-process DLQ entries after fix; verify idempotent behavior; compare reconciliation counters.
- **Post:** update webhook retry tuning and security thresholds.
### 2.2 Duplicate payment
- **Owner:** BL
- **Trigger:** duplicate webhook deliveries or idempotent suppression above normal baseline.
- **Detection:** `duplicate` counter above expected and no payment state change.
- **Immediate:** confirm normalized status transition map; suppress with `deliveryId`/`idempotency` checks.
- **Recovery:** rerun reconciliation for affected IDs and clear manual holds if ledger still consistent.
- **Post:** validate provider event parsing and storage hashing.
### 2.3 Missing payment
- **Owner:** BL + Operations
- **Trigger:** customer support report or reconciliation finding.
- **Detection:** provider reference exists, no canonical payment row or no final ledger balance.
- **Immediate:** freeze payouts/release for request ID; run `searchProviderPayments`.
- **Recovery:** create missing read-model entry or corrective ledger adjustment after evidence capture.
- **Post:** add missing-test case for same provider/providerReference combination.
### 2.4 Stuck release
- **Owner:** BL + Ops
- **Trigger:** `released` pending >30m.
- **Detection:** payout task with no terminal transition for `release`/`refund`.
- **Immediate:** confirm chain confirmation and provider payout status.
- **Recovery:** if confirmed on-chain but not reflected, run manual reconciliation + status repair; else reissue payout instruction.
- **Post:** update alerting and retry windows.
### 2.5 Disputed release attempt
- **Owner:** Security + BL + Admin
- **Trigger:** release/recover API invoked while dispute hold is active.
- **Detection:** `disputeId` active + attempted release/refund action.
- **Immediate:** block action, notify admin team and dispute owner.
- **Recovery:** clear dispute hold only by approved dispute resolution path.
- **Post:** enforce ownership check in orchestration and admin UI guardrail.
### 2.6 Compromised admin
- **Owner:** Security Lead
- **Trigger:** unexpected high-value payouts or admin credentials alert.
- **Detection:** step-up auth failures, role anomalies, IP anomalies.
- **Immediate:** suspend admin account, rotate sessions, set `PAYMENT_PROVIDER_MODE=read_only`.
- **Recovery:** review all recent ledger entries and release/refund actions; reverse via admin correction only after legal/ownership approval.
- **Post:** enforce 2-person approval for high-risk payout thresholds and step-up audit check.
### 2.7 Leaked API key / webhook secret
- **Owner:** Security + DevOps
- **Trigger:** secret scanner, log leak, or suspected rotation request.
- **Detection:** unauthorized provider 401s, unknown callbacks, or external exposure confirmation.
- **Immediate:** rotate affected key and disable old key immediately.
- **Recovery:** replay missed webhooks in dry-run mode after trust restored; compare reconciliation totals.
- **Post:** document incident and rotate staging/CI secret references.
### 2.8 Provider outage
- **Owner:** DevOps + BL
- **Trigger:** repeated provider 5xx/timeouts.
- **Detection:** error budget alert and queue growth.
- **Immediate:** switch `PAYMENT_ENABLED_PROVIDERS=shkeeper` or `read_only`, notify users of degraded mode.
- **Recovery:** drain queue and replay DLQ when provider recovers.
- **Post:** update provider outage SLA runbook and status page text.
### 2.9 Chain/RPC outage
- **Owner:** BL + DevOps
- **Trigger:** chain verification failures and elevated wallet lookup latency.
- **Detection:** verification timeout/error and `chain_verification_stale` trend.
- **Immediate:** pause non-critical chain-dependent actions; preserve pending proofs.
- **Recovery:** rerun verification jobs against fallback RPC endpoint.
- **Post:** update RPC provider fallback policy and retry tuning.
### 2.10 Suspicious payment proof
- **Owner:** Security + BL
- **Trigger:** tx hash proof with mismatched `to`, amount, or network.
- **Detection:** verify script rejects or manual proof review.
- **Immediate:** mark payment disputed, require operator review.
- **Recovery:** keep payment in `pending`, notify buyer and seller, trigger manual support workflow.
- **Post:** add automated proof assertions to webhook and verifier tests.
### 2.11 npm/package compromise
- **Owner:** CTO + Security
- **Trigger:** new high/critical advisory tied to runtime packages.
- **Detection:** audit alerts or security team notification.
- **Immediate:** pause releases in affected components; freeze `watchtower` auto-update if needed.
- **Recovery:** patch dependency and redeploy, then run smoke tests and transaction smoke path.
- **Post:** verify lockfile, provenance, and dependency policy compliance.
## 3. Incident owner roster
Production launch must replace these role owners with named responders in the
primary on-call schedule. Until then, escalation is by role:
- Payments: backend lead for payment/ledger services
- Security: security owner from [[Security Ownership and Launch Decision Criteria]]
- DevOps: deployment owner for production infrastructure
- Admin lead: operations lead for dispute and payout approval workflows
## 4. Post-incident mandatory actions
- incident log + root-cause and action items,
- metric threshold updates if needed,
- runbook improvements (this document),
- verification that [[Monitoring]] and [[Incident Response]] links are still accurate.
## Related
- [[Incident Response]]
- [[Monitoring]]
- [[Deployment]]
- [[Backup & Recovery]]
- [[Webhook Security Spec]]
- [[Payment Provider Adapter Spec]]
- [[Backend Core Stack Decision Record - 2026-05-24]]