- Add AML scope note to Handoff - RN Multichain Probe (sanctions-only vs full KYT) - Add human-blocked section with 3 precise next steps for owner - Create Task 11 Pre-flight Inventory: library choice, dev/prod flow, admin UI gaps, backend gaps, risks, acceptance criteria
9.2 KiB
title, tags, created
| title | tags | created | ||||
|---|---|---|---|---|---|---|
| Backend Funds Migration and Operational Runbooks |
|
2026-05-24 |
Backend Funds Migration and Operational Runbooks
These runbooks cover the selected backend/funds architecture defined in Backend Core Stack Decision Record - 2026-05-24.
[!note] Historical migration context Sections that mention keeping SHKeeper active describe the migration period from the old payment rail to Request Network. Current new-payment operations should use Escrow Flow, Request Network Integration Constraints, and PRD - Decentralized Custody and Smart-Contract Escrow Roadmap.
1. Migration runbook (legacy + provider migration)
1.1 Preflight
- Verify provider credentials and webhook secrets are present in production
.env. - Confirm canonical state docs are published and approved:
- Run validation report:
- count of legacy
Paymentrows by provider, - duplicate provider keys,
- webhook processing success/error by provider (24h),
- unresolved
pending/processingpayments older than 30m.
- count of legacy
- Run a staging reconciliation simulation against a read-only copy.
1.2 Rollout sequence
- Keep
PAYMENT_MODE=read_onlyand keep SHKeeper intent path active. - Enable legacy+new provider dual routing via
PAYMENT_ENABLED_PROVIDERS. - Activate adapter contract metrics (
payment-provider-eventsand webhook counters) before any write routing. - Set
PAYMENT_PROVIDER_MODE=standardfor a 10–15% cohort. - Expand to full traffic only after two clean 24h windows.
1.3 Legacy read path and backfill
- Keep SHKeeper legacy read path enabled for at least one billing cycle after migration starts.
- Backfill task:
- map legacy SHKeeper records into
FundsAccountreferences, - write reconciliation notes into
metadata.providers, - leave source payment rows immutable.
- map legacy SHKeeper records into
- Do not delete legacy routing until backfill report is complete and reviewed.
1.4 Validation before enforcement
Before switching default provider:
- No critical webhook verification failures for 24h.
- DLQ volume stable for 2h.
- Provider status mismatch rate
< 0.5%for 1000 samples. - Manual reconciliation sample (
>= 10rows) completed by two operators.
1.5 Rollback criteria
Rollback immediately if any of:
- payout/hold/release path returns invariant violation,
- unresolved webhook failure burst above threshold,
- chain reconciliation mismatch above threshold from runbook table.
Rollback command sequence:
PAYMENT_ENABLED_PROVIDERS=shkeeperPAYMENT_DEFAULT_PROVIDER=shkeeperPAYMENT_PROVIDER_MODE=read_only(pause new routing),- keep reconciliation active in dry-run mode,
- post-incident note in this doc + Incident Response.
1.6 Webhook cutoff and manual reconciliation
- Keep auto-accepting historical webhook traffic through cutover.
- Set explicit cutoff only if new provider writes must stop:
- pause new pay-in intents on new provider,
- continue
handleProviderWebhookinread_onlymode, - run manual reconciliation for open windows,
- re-enable normal routing after backlog clear or explicit incident decision.
2. Scenario runbooks
Each incident scenario uses the same structure:
owner, trigger, detection, immediate, recovery, post.
2.1 Failed webhook
- Owner: BL + DL
- Trigger:
webhook_failuresalert > warning threshold (see Webhook Security Spec). - Detection: spike in 5xx/400 from callback route, DLQ growth.
- Immediate: isolate by provider; keep
PAYMENT_PROVIDER_MODE=read_onlyfor provider if mismatch continues. - Recovery: re-process DLQ entries after fix; verify idempotent behavior; compare reconciliation counters.
- Post: update webhook retry tuning and security thresholds.
2.2 Duplicate payment
- Owner: BL
- Trigger: duplicate webhook deliveries or idempotent suppression above normal baseline.
- Detection:
duplicatecounter above expected and no payment state change. - Immediate: confirm normalized status transition map; suppress with
deliveryId/idempotencychecks. - Recovery: rerun reconciliation for affected IDs and clear manual holds if ledger still consistent.
- Post: validate provider event parsing and storage hashing.
2.3 Missing payment
- Owner: BL + Operations
- Trigger: customer support report or reconciliation finding.
- Detection: provider reference exists, no canonical payment row or no final ledger balance.
- Immediate: freeze payouts/release for request ID; run
searchProviderPayments. - Recovery: create missing read-model entry or corrective ledger adjustment after evidence capture.
- Post: add missing-test case for same provider/providerReference combination.
2.4 Stuck release
- Owner: BL + Ops
- Trigger:
releasedpending >30m. - Detection: payout task with no terminal transition for
release/refund. - Immediate: confirm chain confirmation and provider payout status.
- Recovery: if confirmed on-chain but not reflected, run manual reconciliation + status repair; else reissue payout instruction.
- Post: update alerting and retry windows.
2.5 Disputed release attempt
- Owner: Security + BL + Admin
- Trigger: release/recover API invoked while dispute hold is active.
- Detection:
disputeIdactive + attempted release/refund action. - Immediate: block action, notify admin team and dispute owner.
- Recovery: clear dispute hold only by approved dispute resolution path.
- Post: enforce ownership check in orchestration and admin UI guardrail.
2.6 Compromised admin
- Owner: Security Lead
- Trigger: unexpected high-value payouts or admin credentials alert.
- Detection: step-up auth failures, role anomalies, IP anomalies.
- Immediate: suspend admin account, rotate sessions, set
PAYMENT_PROVIDER_MODE=read_only. - Recovery: review all recent ledger entries and release/refund actions; reverse via admin correction only after legal/ownership approval.
- Post: enforce 2-person approval for high-risk payout thresholds and step-up audit check.
2.7 Leaked API key / webhook secret
- Owner: Security + DevOps
- Trigger: secret scanner, log leak, or suspected rotation request.
- Detection: unauthorized provider 401s, unknown callbacks, or external exposure confirmation.
- Immediate: rotate affected key and disable old key immediately.
- Recovery: replay missed webhooks in dry-run mode after trust restored; compare reconciliation totals.
- Post: document incident and rotate staging/CI secret references.
2.8 Provider outage
- Owner: DevOps + BL
- Trigger: repeated provider 5xx/timeouts.
- Detection: error budget alert and queue growth.
- Immediate: switch
PAYMENT_ENABLED_PROVIDERS=shkeeperorread_only, notify users of degraded mode. - Recovery: drain queue and replay DLQ when provider recovers.
- Post: update provider outage SLA runbook and status page text.
2.9 Chain/RPC outage
- Owner: BL + DevOps
- Trigger: chain verification failures and elevated wallet lookup latency.
- Detection: verification timeout/error and
chain_verification_staletrend. - Immediate: pause non-critical chain-dependent actions; preserve pending proofs.
- Recovery: rerun verification jobs against fallback RPC endpoint.
- Post: update RPC provider fallback policy and retry tuning.
2.10 Suspicious payment proof
- Owner: Security + BL
- Trigger: tx hash proof with mismatched
to, amount, or network. - Detection: verify script rejects or manual proof review.
- Immediate: mark payment disputed, require operator review.
- Recovery: keep payment in
pending, notify buyer and seller, trigger manual support workflow. - Post: add automated proof assertions to webhook and verifier tests.
2.11 npm/package compromise
- Owner: CTO + Security
- Trigger: new high/critical advisory tied to runtime packages.
- Detection: audit alerts or security team notification.
- Immediate: pause releases in affected components; freeze
watchtowerauto-update if needed. - Recovery: patch dependency and redeploy, then run smoke tests and transaction smoke path.
- Post: verify lockfile, provenance, and dependency policy compliance.
3. Incident owner roster
Production launch must replace these role owners with named responders in the primary on-call schedule. Until then, escalation is by role:
- Payments: backend lead for payment/ledger services
- Security: security owner from Security Ownership and Launch Decision Criteria
- DevOps: deployment owner for production infrastructure
- Admin lead: operations lead for dispute and payout approval workflows
4. Post-incident mandatory actions
- incident log + root-cause and action items,
- metric threshold updates if needed,
- runbook improvements (this document),
- verification that Monitoring and Incident Response links are still accurate.