---
title: Backend Funds Migration and Operational Runbooks
tags: [operations, payments, migration, incidents]
created: 2026-05-24
---

# Backend Funds Migration and Operational Runbooks

These runbooks cover the selected backend/funds architecture defined in
[[Backend Core Stack Decision Record - 2026-05-24]].

> [!note] Historical migration context
> Sections that mention keeping SHKeeper active describe the migration period from the old payment rail to Request Network. Current new-payment operations should use [[Escrow Flow]], [[Request Network Integration Constraints]], and [[PRD - Decentralized Custody and Smart-Contract Escrow Roadmap]].

## 1. Migration runbook (legacy + provider migration)

### 1.1 Preflight

1. Verify provider credentials and webhook secrets are present in production `.env`.
2. Confirm canonical state docs are published and approved:
   - [[Funds Ledger and Escrow State Machine Specification]]
   - [[Payment Provider Adapter Spec]]
   - [[Webhook Security Spec]]
3. Run validation report:
   - count of legacy `Payment` rows by provider,
   - duplicate provider keys,
   - webhook processing success/error by provider (24h),
   - unresolved `pending`/`processing` payments older than 30m.
4. Run a staging reconciliation simulation against a read-only copy.

### 1.2 Rollout sequence

1. Keep `PAYMENT_MODE=read_only` and keep SHKeeper intent path active.
2. Enable legacy+new provider dual routing via `PAYMENT_ENABLED_PROVIDERS`.
3. Activate adapter contract metrics (`payment-provider-events` and webhook counters) before any write routing.
4. Set `PAYMENT_PROVIDER_MODE=standard` for a 10–15% cohort.
5. Expand to full traffic only after two clean 24h windows.

### 1.3 Legacy read path and backfill

- Keep SHKeeper legacy read path enabled for at least one billing cycle after migration starts.
- Backfill task:
  - map legacy SHKeeper records into `FundsAccount` references,
  - write reconciliation notes into `metadata.providers`,
  - leave source payment rows immutable.
- Do not delete legacy routing until backfill report is complete and reviewed.

### 1.4 Validation before enforcement

Before switching default provider:

- No critical webhook verification failures for 24h.
- DLQ volume stable for 2h.
- Provider status mismatch rate `< 0.5%` for 1000 samples.
- Manual reconciliation sample (`>= 10` rows) completed by two operators.

### 1.5 Rollback criteria

Rollback immediately if any of:

- payout/hold/release path returns invariant violation,
- unresolved webhook failure burst above threshold,
- chain reconciliation mismatch above threshold from runbook table.

Rollback command sequence:

1. `PAYMENT_ENABLED_PROVIDERS=shkeeper`
2. `PAYMENT_DEFAULT_PROVIDER=shkeeper`
3. `PAYMENT_PROVIDER_MODE=read_only` (pause new routing),
4. keep reconciliation active in dry-run mode,
5. post-incident note in this doc + [[Incident Response]].

### 1.6 Webhook cutoff and manual reconciliation

- Keep auto-accepting historical webhook traffic through cutover.
- Set explicit cutoff only if new provider writes must stop:
  1. pause new pay-in intents on new provider,
  2. continue `handleProviderWebhook` in `read_only` mode,
  3. run manual reconciliation for open windows,
  4. re-enable normal routing after backlog clear or explicit incident decision.

## 2. Scenario runbooks

Each incident scenario uses the same structure:
`owner`, `trigger`, `detection`, `immediate`, `recovery`, `post`.

### 2.1 Failed webhook
- **Owner:** BL + DL
- **Trigger:** `webhook_failures` alert > warning threshold (see [[Webhook Security Spec]]).
- **Detection:** spike in 5xx/400 from callback route, DLQ growth.
- **Immediate:** isolate by provider; keep `PAYMENT_PROVIDER_MODE=read_only` for provider if mismatch continues.
- **Recovery:** re-process DLQ entries after fix; verify idempotent behavior; compare reconciliation counters.
- **Post:** update webhook retry tuning and security thresholds.

### 2.2 Duplicate payment
- **Owner:** BL
- **Trigger:** duplicate webhook deliveries or idempotent suppression above normal baseline.
- **Detection:** `duplicate` counter above expected and no payment state change.
- **Immediate:** confirm normalized status transition map; suppress with `deliveryId`/`idempotency` checks.
- **Recovery:** rerun reconciliation for affected IDs and clear manual holds if ledger still consistent.
- **Post:** validate provider event parsing and storage hashing.

### 2.3 Missing payment
- **Owner:** BL + Operations
- **Trigger:** customer support report or reconciliation finding.
- **Detection:** provider reference exists, no canonical payment row or no final ledger balance.
- **Immediate:** freeze payouts/release for request ID; run `searchProviderPayments`.
- **Recovery:** create missing read-model entry or corrective ledger adjustment after evidence capture.
- **Post:** add missing-test case for same provider/providerReference combination.

### 2.4 Stuck release
- **Owner:** BL + Ops
- **Trigger:** `released` pending >30m.
- **Detection:** payout task with no terminal transition for `release`/`refund`.
- **Immediate:** confirm chain confirmation and provider payout status.
- **Recovery:** if confirmed on-chain but not reflected, run manual reconciliation + status repair; else reissue payout instruction.
- **Post:** update alerting and retry windows.

### 2.5 Disputed release attempt
- **Owner:** Security + BL + Admin
- **Trigger:** release/recover API invoked while dispute hold is active.
- **Detection:** `disputeId` active + attempted release/refund action.
- **Immediate:** block action, notify admin team and dispute owner.
- **Recovery:** clear dispute hold only by approved dispute resolution path.
- **Post:** enforce ownership check in orchestration and admin UI guardrail.

### 2.6 Compromised admin
- **Owner:** Security Lead
- **Trigger:** unexpected high-value payouts or admin credentials alert.
- **Detection:** step-up auth failures, role anomalies, IP anomalies.
- **Immediate:** suspend admin account, rotate sessions, set `PAYMENT_PROVIDER_MODE=read_only`.
- **Recovery:** review all recent ledger entries and release/refund actions; reverse via admin correction only after legal/ownership approval.
- **Post:** enforce 2-person approval for high-risk payout thresholds and step-up audit check.

### 2.7 Leaked API key / webhook secret
- **Owner:** Security + DevOps
- **Trigger:** secret scanner, log leak, or suspected rotation request.
- **Detection:** unauthorized provider 401s, unknown callbacks, or external exposure confirmation.
- **Immediate:** rotate affected key and disable old key immediately.
- **Recovery:** replay missed webhooks in dry-run mode after trust restored; compare reconciliation totals.
- **Post:** document incident and rotate staging/CI secret references.

### 2.8 Provider outage
- **Owner:** DevOps + BL
- **Trigger:** repeated provider 5xx/timeouts.
- **Detection:** error budget alert and queue growth.
- **Immediate:** switch `PAYMENT_ENABLED_PROVIDERS=shkeeper` or `read_only`, notify users of degraded mode.
- **Recovery:** drain queue and replay DLQ when provider recovers.
- **Post:** update provider outage SLA runbook and status page text.

### 2.9 Chain/RPC outage
- **Owner:** BL + DevOps
- **Trigger:** chain verification failures and elevated wallet lookup latency.
- **Detection:** verification timeout/error and `chain_verification_stale` trend.
- **Immediate:** pause non-critical chain-dependent actions; preserve pending proofs.
- **Recovery:** rerun verification jobs against fallback RPC endpoint.
- **Post:** update RPC provider fallback policy and retry tuning.

### 2.10 Suspicious payment proof
- **Owner:** Security + BL
- **Trigger:** tx hash proof with mismatched `to`, amount, or network.
- **Detection:** verify script rejects or manual proof review.
- **Immediate:** mark payment disputed, require operator review.
- **Recovery:** keep payment in `pending`, notify buyer and seller, trigger manual support workflow.
- **Post:** add automated proof assertions to webhook and verifier tests.

### 2.11 npm/package compromise
- **Owner:** CTO + Security
- **Trigger:** new high/critical advisory tied to runtime packages.
- **Detection:** audit alerts or security team notification.
- **Immediate:** pause releases in affected components; freeze `watchtower` auto-update if needed.
- **Recovery:** patch dependency and redeploy, then run smoke tests and transaction smoke path.
- **Post:** verify lockfile, provenance, and dependency policy compliance.

## 3. Incident owner roster

Production launch must replace these role owners with named responders in the
primary on-call schedule. Until then, escalation is by role:

- Payments: backend lead for payment/ledger services
- Security: security owner from [[Security Ownership and Launch Decision Criteria]]
- DevOps: deployment owner for production infrastructure
- Admin lead: operations lead for dispute and payout approval workflows

## 4. Post-incident mandatory actions

- incident log + root-cause and action items,
- metric threshold updates if needed,
- runbook improvements (this document),
- verification that [[Monitoring]] and [[Incident Response]] links are still accurate.

## Related

- [[Incident Response]]
- [[Monitoring]]
- [[Deployment]]
- [[Backup & Recovery]]
- [[Webhook Security Spec]]
- [[Payment Provider Adapter Spec]]
- [[Backend Core Stack Decision Record - 2026-05-24]]