Files
nick-doc/08 - Operations/Backend Funds Migration and Operational Runbooks.md
Siavash Sameni 81625d35d2 docs: AML scope note, human-blocked items, Task #11 pre-flight inventory
- Add AML scope note to Handoff - RN Multichain Probe (sanctions-only vs full KYT)
- Add human-blocked section with 3 precise next steps for owner
- Create Task 11 Pre-flight Inventory: library choice, dev/prod flow, admin UI gaps, backend gaps, risks, acceptance criteria
2026-05-28 20:42:42 +04:00

9.2 KiB
Raw Permalink Blame History

title, tags, created
title tags created
Backend Funds Migration and Operational Runbooks
operations
payments
migration
incidents
2026-05-24

Backend Funds Migration and Operational Runbooks

These runbooks cover the selected backend/funds architecture defined in Backend Core Stack Decision Record - 2026-05-24.

[!note] Historical migration context Sections that mention keeping SHKeeper active describe the migration period from the old payment rail to Request Network. Current new-payment operations should use Escrow Flow, Request Network Integration Constraints, and PRD - Decentralized Custody and Smart-Contract Escrow Roadmap.

1. Migration runbook (legacy + provider migration)

1.1 Preflight

  1. Verify provider credentials and webhook secrets are present in production .env.
  2. Confirm canonical state docs are published and approved:
  3. Run validation report:
    • count of legacy Payment rows by provider,
    • duplicate provider keys,
    • webhook processing success/error by provider (24h),
    • unresolved pending/processing payments older than 30m.
  4. Run a staging reconciliation simulation against a read-only copy.

1.2 Rollout sequence

  1. Keep PAYMENT_MODE=read_only and keep SHKeeper intent path active.
  2. Enable legacy+new provider dual routing via PAYMENT_ENABLED_PROVIDERS.
  3. Activate adapter contract metrics (payment-provider-events and webhook counters) before any write routing.
  4. Set PAYMENT_PROVIDER_MODE=standard for a 1015% cohort.
  5. Expand to full traffic only after two clean 24h windows.

1.3 Legacy read path and backfill

  • Keep SHKeeper legacy read path enabled for at least one billing cycle after migration starts.
  • Backfill task:
    • map legacy SHKeeper records into FundsAccount references,
    • write reconciliation notes into metadata.providers,
    • leave source payment rows immutable.
  • Do not delete legacy routing until backfill report is complete and reviewed.

1.4 Validation before enforcement

Before switching default provider:

  • No critical webhook verification failures for 24h.
  • DLQ volume stable for 2h.
  • Provider status mismatch rate < 0.5% for 1000 samples.
  • Manual reconciliation sample (>= 10 rows) completed by two operators.

1.5 Rollback criteria

Rollback immediately if any of:

  • payout/hold/release path returns invariant violation,
  • unresolved webhook failure burst above threshold,
  • chain reconciliation mismatch above threshold from runbook table.

Rollback command sequence:

  1. PAYMENT_ENABLED_PROVIDERS=shkeeper
  2. PAYMENT_DEFAULT_PROVIDER=shkeeper
  3. PAYMENT_PROVIDER_MODE=read_only (pause new routing),
  4. keep reconciliation active in dry-run mode,
  5. post-incident note in this doc + Incident Response.

1.6 Webhook cutoff and manual reconciliation

  • Keep auto-accepting historical webhook traffic through cutover.
  • Set explicit cutoff only if new provider writes must stop:
    1. pause new pay-in intents on new provider,
    2. continue handleProviderWebhook in read_only mode,
    3. run manual reconciliation for open windows,
    4. re-enable normal routing after backlog clear or explicit incident decision.

2. Scenario runbooks

Each incident scenario uses the same structure: owner, trigger, detection, immediate, recovery, post.

2.1 Failed webhook

  • Owner: BL + DL
  • Trigger: webhook_failures alert > warning threshold (see Webhook Security Spec).
  • Detection: spike in 5xx/400 from callback route, DLQ growth.
  • Immediate: isolate by provider; keep PAYMENT_PROVIDER_MODE=read_only for provider if mismatch continues.
  • Recovery: re-process DLQ entries after fix; verify idempotent behavior; compare reconciliation counters.
  • Post: update webhook retry tuning and security thresholds.

2.2 Duplicate payment

  • Owner: BL
  • Trigger: duplicate webhook deliveries or idempotent suppression above normal baseline.
  • Detection: duplicate counter above expected and no payment state change.
  • Immediate: confirm normalized status transition map; suppress with deliveryId/idempotency checks.
  • Recovery: rerun reconciliation for affected IDs and clear manual holds if ledger still consistent.
  • Post: validate provider event parsing and storage hashing.

2.3 Missing payment

  • Owner: BL + Operations
  • Trigger: customer support report or reconciliation finding.
  • Detection: provider reference exists, no canonical payment row or no final ledger balance.
  • Immediate: freeze payouts/release for request ID; run searchProviderPayments.
  • Recovery: create missing read-model entry or corrective ledger adjustment after evidence capture.
  • Post: add missing-test case for same provider/providerReference combination.

2.4 Stuck release

  • Owner: BL + Ops
  • Trigger: released pending >30m.
  • Detection: payout task with no terminal transition for release/refund.
  • Immediate: confirm chain confirmation and provider payout status.
  • Recovery: if confirmed on-chain but not reflected, run manual reconciliation + status repair; else reissue payout instruction.
  • Post: update alerting and retry windows.

2.5 Disputed release attempt

  • Owner: Security + BL + Admin
  • Trigger: release/recover API invoked while dispute hold is active.
  • Detection: disputeId active + attempted release/refund action.
  • Immediate: block action, notify admin team and dispute owner.
  • Recovery: clear dispute hold only by approved dispute resolution path.
  • Post: enforce ownership check in orchestration and admin UI guardrail.

2.6 Compromised admin

  • Owner: Security Lead
  • Trigger: unexpected high-value payouts or admin credentials alert.
  • Detection: step-up auth failures, role anomalies, IP anomalies.
  • Immediate: suspend admin account, rotate sessions, set PAYMENT_PROVIDER_MODE=read_only.
  • Recovery: review all recent ledger entries and release/refund actions; reverse via admin correction only after legal/ownership approval.
  • Post: enforce 2-person approval for high-risk payout thresholds and step-up audit check.

2.7 Leaked API key / webhook secret

  • Owner: Security + DevOps
  • Trigger: secret scanner, log leak, or suspected rotation request.
  • Detection: unauthorized provider 401s, unknown callbacks, or external exposure confirmation.
  • Immediate: rotate affected key and disable old key immediately.
  • Recovery: replay missed webhooks in dry-run mode after trust restored; compare reconciliation totals.
  • Post: document incident and rotate staging/CI secret references.

2.8 Provider outage

  • Owner: DevOps + BL
  • Trigger: repeated provider 5xx/timeouts.
  • Detection: error budget alert and queue growth.
  • Immediate: switch PAYMENT_ENABLED_PROVIDERS=shkeeper or read_only, notify users of degraded mode.
  • Recovery: drain queue and replay DLQ when provider recovers.
  • Post: update provider outage SLA runbook and status page text.

2.9 Chain/RPC outage

  • Owner: BL + DevOps
  • Trigger: chain verification failures and elevated wallet lookup latency.
  • Detection: verification timeout/error and chain_verification_stale trend.
  • Immediate: pause non-critical chain-dependent actions; preserve pending proofs.
  • Recovery: rerun verification jobs against fallback RPC endpoint.
  • Post: update RPC provider fallback policy and retry tuning.

2.10 Suspicious payment proof

  • Owner: Security + BL
  • Trigger: tx hash proof with mismatched to, amount, or network.
  • Detection: verify script rejects or manual proof review.
  • Immediate: mark payment disputed, require operator review.
  • Recovery: keep payment in pending, notify buyer and seller, trigger manual support workflow.
  • Post: add automated proof assertions to webhook and verifier tests.

2.11 npm/package compromise

  • Owner: CTO + Security
  • Trigger: new high/critical advisory tied to runtime packages.
  • Detection: audit alerts or security team notification.
  • Immediate: pause releases in affected components; freeze watchtower auto-update if needed.
  • Recovery: patch dependency and redeploy, then run smoke tests and transaction smoke path.
  • Post: verify lockfile, provenance, and dependency policy compliance.

3. Incident owner roster

Production launch must replace these role owners with named responders in the primary on-call schedule. Until then, escalation is by role:

  • Payments: backend lead for payment/ledger services
  • Security: security owner from Security Ownership and Launch Decision Criteria
  • DevOps: deployment owner for production infrastructure
  • Admin lead: operations lead for dispute and payout approval workflows

4. Post-incident mandatory actions

  • incident log + root-cause and action items,
  • metric threshold updates if needed,
  • runbook improvements (this document),
  • verification that Monitoring and Incident Response links are still accurate.