Files

Siavash Sameni 81625d35d2 docs: AML scope note, human-blocked items, Task #11 pre-flight inventory

- Add AML scope note to Handoff - RN Multichain Probe (sanctions-only vs full KYT)
- Add human-blocked section with 3 precise next steps for owner
- Create Task 11 Pre-flight Inventory: library choice, dev/prod flow, admin UI gaps, backend gaps, risks, acceptance criteria

2026-05-28 20:42:42 +04:00

9.2 KiB

Raw Permalink Blame History

title, tags, created

title

Backend Funds Migration and Operational Runbooks

These runbooks cover the selected backend/funds architecture defined in Backend Core Stack Decision Record - 2026-05-24.

[!note] Historical migration context Sections that mention keeping SHKeeper active describe the migration period from the old payment rail to Request Network. Current new-payment operations should use Escrow Flow, Request Network Integration Constraints, and PRD - Decentralized Custody and Smart-Contract Escrow Roadmap.

1. Migration runbook (legacy + provider migration)

1.1 Preflight

Verify provider credentials and webhook secrets are present in production .env.
Confirm canonical state docs are published and approved:
Run validation report:
- count of legacy Payment rows by provider,
- duplicate provider keys,
- webhook processing success/error by provider (24h),
- unresolved pending/processing payments older than 30m.
Run a staging reconciliation simulation against a read-only copy.

1.2 Rollout sequence

Keep PAYMENT_MODE=read_only and keep SHKeeper intent path active.
Enable legacy+new provider dual routing via PAYMENT_ENABLED_PROVIDERS.
Activate adapter contract metrics (payment-provider-events and webhook counters) before any write routing.
Set PAYMENT_PROVIDER_MODE=standard for a 10–15% cohort.
Expand to full traffic only after two clean 24h windows.

1.3 Legacy read path and backfill

Keep SHKeeper legacy read path enabled for at least one billing cycle after migration starts.
Backfill task:
- map legacy SHKeeper records into FundsAccount references,
- write reconciliation notes into metadata.providers,
- leave source payment rows immutable.
Do not delete legacy routing until backfill report is complete and reviewed.

1.4 Validation before enforcement

Before switching default provider:

No critical webhook verification failures for 24h.
DLQ volume stable for 2h.
Provider status mismatch rate < 0.5% for 1000 samples.
Manual reconciliation sample (>= 10 rows) completed by two operators.

1.5 Rollback criteria

Rollback immediately if any of:

payout/hold/release path returns invariant violation,
unresolved webhook failure burst above threshold,
chain reconciliation mismatch above threshold from runbook table.

Rollback command sequence:

PAYMENT_ENABLED_PROVIDERS=shkeeper
PAYMENT_DEFAULT_PROVIDER=shkeeper
PAYMENT_PROVIDER_MODE=read_only (pause new routing),
keep reconciliation active in dry-run mode,
post-incident note in this doc + Incident Response.

1.6 Webhook cutoff and manual reconciliation

Keep auto-accepting historical webhook traffic through cutover.
Set explicit cutoff only if new provider writes must stop:
1. pause new pay-in intents on new provider,
2. continue handleProviderWebhook in read_only mode,
3. run manual reconciliation for open windows,
4. re-enable normal routing after backlog clear or explicit incident decision.

2. Scenario runbooks

Each incident scenario uses the same structure: owner, trigger, detection, immediate, recovery, post.

2.1 Failed webhook

Owner: BL + DL
Trigger: webhook_failures alert > warning threshold (see Webhook Security Spec).
Detection: spike in 5xx/400 from callback route, DLQ growth.
Immediate: isolate by provider; keep PAYMENT_PROVIDER_MODE=read_only for provider if mismatch continues.
Recovery: re-process DLQ entries after fix; verify idempotent behavior; compare reconciliation counters.
Post: update webhook retry tuning and security thresholds.

2.2 Duplicate payment

Owner: BL
Trigger: duplicate webhook deliveries or idempotent suppression above normal baseline.
Detection: duplicate counter above expected and no payment state change.
Immediate: confirm normalized status transition map; suppress with deliveryId/idempotency checks.
Recovery: rerun reconciliation for affected IDs and clear manual holds if ledger still consistent.
Post: validate provider event parsing and storage hashing.

2.3 Missing payment

Owner: BL + Operations
Trigger: customer support report or reconciliation finding.
Detection: provider reference exists, no canonical payment row or no final ledger balance.
Immediate: freeze payouts/release for request ID; run searchProviderPayments.
Recovery: create missing read-model entry or corrective ledger adjustment after evidence capture.
Post: add missing-test case for same provider/providerReference combination.

2.4 Stuck release

Owner: BL + Ops
Trigger: released pending >30m.
Detection: payout task with no terminal transition for release/refund.
Immediate: confirm chain confirmation and provider payout status.
Recovery: if confirmed on-chain but not reflected, run manual reconciliation + status repair; else reissue payout instruction.
Post: update alerting and retry windows.

2.5 Disputed release attempt

Owner: Security + BL + Admin
Trigger: release/recover API invoked while dispute hold is active.
Detection: disputeId active + attempted release/refund action.
Immediate: block action, notify admin team and dispute owner.
Recovery: clear dispute hold only by approved dispute resolution path.
Post: enforce ownership check in orchestration and admin UI guardrail.

2.6 Compromised admin

Owner: Security Lead
Trigger: unexpected high-value payouts or admin credentials alert.
Detection: step-up auth failures, role anomalies, IP anomalies.
Immediate: suspend admin account, rotate sessions, set PAYMENT_PROVIDER_MODE=read_only.
Recovery: review all recent ledger entries and release/refund actions; reverse via admin correction only after legal/ownership approval.
Post: enforce 2-person approval for high-risk payout thresholds and step-up audit check.

2.7 Leaked API key / webhook secret

Owner: Security + DevOps
Trigger: secret scanner, log leak, or suspected rotation request.
Detection: unauthorized provider 401s, unknown callbacks, or external exposure confirmation.
Immediate: rotate affected key and disable old key immediately.
Recovery: replay missed webhooks in dry-run mode after trust restored; compare reconciliation totals.
Post: document incident and rotate staging/CI secret references.

2.8 Provider outage

Owner: DevOps + BL
Trigger: repeated provider 5xx/timeouts.
Detection: error budget alert and queue growth.
Immediate: switch PAYMENT_ENABLED_PROVIDERS=shkeeper or read_only, notify users of degraded mode.
Recovery: drain queue and replay DLQ when provider recovers.
Post: update provider outage SLA runbook and status page text.

2.9 Chain/RPC outage

Owner: BL + DevOps
Trigger: chain verification failures and elevated wallet lookup latency.
Detection: verification timeout/error and chain_verification_stale trend.
Immediate: pause non-critical chain-dependent actions; preserve pending proofs.
Recovery: rerun verification jobs against fallback RPC endpoint.
Post: update RPC provider fallback policy and retry tuning.

2.10 Suspicious payment proof

Owner: Security + BL
Trigger: tx hash proof with mismatched to, amount, or network.
Detection: verify script rejects or manual proof review.
Immediate: mark payment disputed, require operator review.
Recovery: keep payment in pending, notify buyer and seller, trigger manual support workflow.
Post: add automated proof assertions to webhook and verifier tests.

2.11 npm/package compromise

Owner: CTO + Security
Trigger: new high/critical advisory tied to runtime packages.
Detection: audit alerts or security team notification.
Immediate: pause releases in affected components; freeze watchtower auto-update if needed.
Recovery: patch dependency and redeploy, then run smoke tests and transaction smoke path.
Post: verify lockfile, provenance, and dependency policy compliance.

3. Incident owner roster

Production launch must replace these role owners with named responders in the primary on-call schedule. Until then, escalation is by role:

Payments: backend lead for payment/ledger services
Security: security owner from Security Ownership and Launch Decision Criteria
DevOps: deployment owner for production infrastructure
Admin lead: operations lead for dispute and payout approval workflows

4. Post-incident mandatory actions

incident log + root-cause and action items,
metric threshold updates if needed,
runbook improvements (this document),
verification that Monitoring and Incident Response links are still accurate.

9.2 KiB Raw Permalink Blame History Unescape Escape