--- title: Backend Funds Migration and Operational Runbooks tags: [operations, payments, migration, incidents] created: 2026-05-24 --- # Backend Funds Migration and Operational Runbooks These runbooks cover the selected backend/funds architecture defined in [[Backend Core Stack Decision Record - 2026-05-24]]. > [!note] Historical migration context > Sections that mention keeping SHKeeper active describe the migration period from the old payment rail to Request Network. Current new-payment operations should use [[Escrow Flow]], [[Request Network Integration Constraints]], and [[PRD - Decentralized Custody and Smart-Contract Escrow Roadmap]]. ## 1. Migration runbook (legacy + provider migration) ### 1.1 Preflight 1. Verify provider credentials and webhook secrets are present in production `.env`. 2. Confirm canonical state docs are published and approved: - [[Funds Ledger and Escrow State Machine Specification]] - [[Payment Provider Adapter Spec]] - [[Webhook Security Spec]] 3. Run validation report: - count of legacy `Payment` rows by provider, - duplicate provider keys, - webhook processing success/error by provider (24h), - unresolved `pending`/`processing` payments older than 30m. 4. Run a staging reconciliation simulation against a read-only copy. ### 1.2 Rollout sequence 1. Keep `PAYMENT_MODE=read_only` and keep SHKeeper intent path active. 2. Enable legacy+new provider dual routing via `PAYMENT_ENABLED_PROVIDERS`. 3. Activate adapter contract metrics (`payment-provider-events` and webhook counters) before any write routing. 4. Set `PAYMENT_PROVIDER_MODE=standard` for a 10–15% cohort. 5. Expand to full traffic only after two clean 24h windows. ### 1.3 Legacy read path and backfill - Keep SHKeeper legacy read path enabled for at least one billing cycle after migration starts. - Backfill task: - map legacy SHKeeper records into `FundsAccount` references, - write reconciliation notes into `metadata.providers`, - leave source payment rows immutable. - Do not delete legacy routing until backfill report is complete and reviewed. ### 1.4 Validation before enforcement Before switching default provider: - No critical webhook verification failures for 24h. - DLQ volume stable for 2h. - Provider status mismatch rate `< 0.5%` for 1000 samples. - Manual reconciliation sample (`>= 10` rows) completed by two operators. ### 1.5 Rollback criteria Rollback immediately if any of: - payout/hold/release path returns invariant violation, - unresolved webhook failure burst above threshold, - chain reconciliation mismatch above threshold from runbook table. Rollback command sequence: 1. `PAYMENT_ENABLED_PROVIDERS=shkeeper` 2. `PAYMENT_DEFAULT_PROVIDER=shkeeper` 3. `PAYMENT_PROVIDER_MODE=read_only` (pause new routing), 4. keep reconciliation active in dry-run mode, 5. post-incident note in this doc + [[Incident Response]]. ### 1.6 Webhook cutoff and manual reconciliation - Keep auto-accepting historical webhook traffic through cutover. - Set explicit cutoff only if new provider writes must stop: 1. pause new pay-in intents on new provider, 2. continue `handleProviderWebhook` in `read_only` mode, 3. run manual reconciliation for open windows, 4. re-enable normal routing after backlog clear or explicit incident decision. ## 2. Scenario runbooks Each incident scenario uses the same structure: `owner`, `trigger`, `detection`, `immediate`, `recovery`, `post`. ### 2.1 Failed webhook - **Owner:** BL + DL - **Trigger:** `webhook_failures` alert > warning threshold (see [[Webhook Security Spec]]). - **Detection:** spike in 5xx/400 from callback route, DLQ growth. - **Immediate:** isolate by provider; keep `PAYMENT_PROVIDER_MODE=read_only` for provider if mismatch continues. - **Recovery:** re-process DLQ entries after fix; verify idempotent behavior; compare reconciliation counters. - **Post:** update webhook retry tuning and security thresholds. ### 2.2 Duplicate payment - **Owner:** BL - **Trigger:** duplicate webhook deliveries or idempotent suppression above normal baseline. - **Detection:** `duplicate` counter above expected and no payment state change. - **Immediate:** confirm normalized status transition map; suppress with `deliveryId`/`idempotency` checks. - **Recovery:** rerun reconciliation for affected IDs and clear manual holds if ledger still consistent. - **Post:** validate provider event parsing and storage hashing. ### 2.3 Missing payment - **Owner:** BL + Operations - **Trigger:** customer support report or reconciliation finding. - **Detection:** provider reference exists, no canonical payment row or no final ledger balance. - **Immediate:** freeze payouts/release for request ID; run `searchProviderPayments`. - **Recovery:** create missing read-model entry or corrective ledger adjustment after evidence capture. - **Post:** add missing-test case for same provider/providerReference combination. ### 2.4 Stuck release - **Owner:** BL + Ops - **Trigger:** `released` pending >30m. - **Detection:** payout task with no terminal transition for `release`/`refund`. - **Immediate:** confirm chain confirmation and provider payout status. - **Recovery:** if confirmed on-chain but not reflected, run manual reconciliation + status repair; else reissue payout instruction. - **Post:** update alerting and retry windows. ### 2.5 Disputed release attempt - **Owner:** Security + BL + Admin - **Trigger:** release/recover API invoked while dispute hold is active. - **Detection:** `disputeId` active + attempted release/refund action. - **Immediate:** block action, notify admin team and dispute owner. - **Recovery:** clear dispute hold only by approved dispute resolution path. - **Post:** enforce ownership check in orchestration and admin UI guardrail. ### 2.6 Compromised admin - **Owner:** Security Lead - **Trigger:** unexpected high-value payouts or admin credentials alert. - **Detection:** step-up auth failures, role anomalies, IP anomalies. - **Immediate:** suspend admin account, rotate sessions, set `PAYMENT_PROVIDER_MODE=read_only`. - **Recovery:** review all recent ledger entries and release/refund actions; reverse via admin correction only after legal/ownership approval. - **Post:** enforce 2-person approval for high-risk payout thresholds and step-up audit check. ### 2.7 Leaked API key / webhook secret - **Owner:** Security + DevOps - **Trigger:** secret scanner, log leak, or suspected rotation request. - **Detection:** unauthorized provider 401s, unknown callbacks, or external exposure confirmation. - **Immediate:** rotate affected key and disable old key immediately. - **Recovery:** replay missed webhooks in dry-run mode after trust restored; compare reconciliation totals. - **Post:** document incident and rotate staging/CI secret references. ### 2.8 Provider outage - **Owner:** DevOps + BL - **Trigger:** repeated provider 5xx/timeouts. - **Detection:** error budget alert and queue growth. - **Immediate:** switch `PAYMENT_ENABLED_PROVIDERS=shkeeper` or `read_only`, notify users of degraded mode. - **Recovery:** drain queue and replay DLQ when provider recovers. - **Post:** update provider outage SLA runbook and status page text. ### 2.9 Chain/RPC outage - **Owner:** BL + DevOps - **Trigger:** chain verification failures and elevated wallet lookup latency. - **Detection:** verification timeout/error and `chain_verification_stale` trend. - **Immediate:** pause non-critical chain-dependent actions; preserve pending proofs. - **Recovery:** rerun verification jobs against fallback RPC endpoint. - **Post:** update RPC provider fallback policy and retry tuning. ### 2.10 Suspicious payment proof - **Owner:** Security + BL - **Trigger:** tx hash proof with mismatched `to`, amount, or network. - **Detection:** verify script rejects or manual proof review. - **Immediate:** mark payment disputed, require operator review. - **Recovery:** keep payment in `pending`, notify buyer and seller, trigger manual support workflow. - **Post:** add automated proof assertions to webhook and verifier tests. ### 2.11 npm/package compromise - **Owner:** CTO + Security - **Trigger:** new high/critical advisory tied to runtime packages. - **Detection:** audit alerts or security team notification. - **Immediate:** pause releases in affected components; freeze `watchtower` auto-update if needed. - **Recovery:** patch dependency and redeploy, then run smoke tests and transaction smoke path. - **Post:** verify lockfile, provenance, and dependency policy compliance. ## 3. Incident owner roster Production launch must replace these role owners with named responders in the primary on-call schedule. Until then, escalation is by role: - Payments: backend lead for payment/ledger services - Security: security owner from [[Security Ownership and Launch Decision Criteria]] - DevOps: deployment owner for production infrastructure - Admin lead: operations lead for dispute and payout approval workflows ## 4. Post-incident mandatory actions - incident log + root-cause and action items, - metric threshold updates if needed, - runbook improvements (this document), - verification that [[Monitoring]] and [[Incident Response]] links are still accurate. ## Related - [[Incident Response]] - [[Monitoring]] - [[Deployment]] - [[Backup & Recovery]] - [[Webhook Security Spec]] - [[Payment Provider Adapter Spec]] - [[Backend Core Stack Decision Record - 2026-05-24]]