Complete task 4 backend security architecture docs
This commit is contained in:
@@ -0,0 +1,197 @@
|
||||
---
|
||||
title: Backend Funds Migration and Operational Runbooks
|
||||
tags: [operations, payments, migration, incidents]
|
||||
created: 2026-05-24
|
||||
---
|
||||
|
||||
# Backend Funds Migration and Operational Runbooks
|
||||
|
||||
These runbooks cover the selected backend/funds architecture defined in
|
||||
[[Backend Core Stack Decision Record - 2026-05-24]].
|
||||
|
||||
## 1. Migration runbook (legacy + provider migration)
|
||||
|
||||
### 1.1 Preflight
|
||||
|
||||
1. Verify provider credentials and webhook secrets are present in production `.env`.
|
||||
2. Confirm canonical state docs are published and approved:
|
||||
- [[Funds Ledger and Escrow State Machine Specification]]
|
||||
- [[Payment Provider Adapter Spec]]
|
||||
- [[Webhook Security Spec]]
|
||||
3. Run validation report:
|
||||
- count of legacy `Payment` rows by provider,
|
||||
- duplicate provider keys,
|
||||
- webhook processing success/error by provider (24h),
|
||||
- unresolved `pending`/`processing` payments older than 30m.
|
||||
4. Run a staging reconciliation simulation against a read-only copy.
|
||||
|
||||
### 1.2 Rollout sequence
|
||||
|
||||
1. Keep `PAYMENT_MODE=read_only` and keep SHKeeper intent path active.
|
||||
2. Enable legacy+new provider dual routing via `PAYMENT_ENABLED_PROVIDERS`.
|
||||
3. Activate adapter contract metrics (`payment-provider-events` and webhook counters) before any write routing.
|
||||
4. Set `PAYMENT_PROVIDER_MODE=standard` for a 10–15% cohort.
|
||||
5. Expand to full traffic only after two clean 24h windows.
|
||||
|
||||
### 1.3 Legacy read path and backfill
|
||||
|
||||
- Keep SHKeeper legacy read path enabled for at least one billing cycle after migration starts.
|
||||
- Backfill task:
|
||||
- map legacy SHKeeper records into `FundsAccount` references,
|
||||
- write reconciliation notes into `metadata.providers`,
|
||||
- leave source payment rows immutable.
|
||||
- Do not delete legacy routing until backfill report is complete and reviewed.
|
||||
|
||||
### 1.4 Validation before enforcement
|
||||
|
||||
Before switching default provider:
|
||||
|
||||
- No critical webhook verification failures for 24h.
|
||||
- DLQ volume stable for 2h.
|
||||
- Provider status mismatch rate `< 0.5%` for 1000 samples.
|
||||
- Manual reconciliation sample (`>= 10` rows) completed by two operators.
|
||||
|
||||
### 1.5 Rollback criteria
|
||||
|
||||
Rollback immediately if any of:
|
||||
|
||||
- payout/hold/release path returns invariant violation,
|
||||
- unresolved webhook failure burst above threshold,
|
||||
- chain reconciliation mismatch above threshold from runbook table.
|
||||
|
||||
Rollback command sequence:
|
||||
|
||||
1. `PAYMENT_ENABLED_PROVIDERS=shkeeper`
|
||||
2. `PAYMENT_DEFAULT_PROVIDER=shkeeper`
|
||||
3. `PAYMENT_PROVIDER_MODE=read_only` (pause new routing),
|
||||
4. keep reconciliation active in dry-run mode,
|
||||
5. post-incident note in this doc + [[Incident Response]].
|
||||
|
||||
### 1.6 Webhook cutoff and manual reconciliation
|
||||
|
||||
- Keep auto-accepting historical webhook traffic through cutover.
|
||||
- Set explicit cutoff only if new provider writes must stop:
|
||||
1. pause new pay-in intents on new provider,
|
||||
2. continue `handleProviderWebhook` in `read_only` mode,
|
||||
3. run manual reconciliation for open windows,
|
||||
4. re-enable normal routing after backlog clear or explicit incident decision.
|
||||
|
||||
## 2. Scenario runbooks
|
||||
|
||||
Each incident scenario uses the same structure:
|
||||
`owner`, `trigger`, `detection`, `immediate`, `recovery`, `post`.
|
||||
|
||||
### 2.1 Failed webhook
|
||||
- **Owner:** BL + DL
|
||||
- **Trigger:** `webhook_failures` alert > warning threshold (see [[Webhook Security Spec]]).
|
||||
- **Detection:** spike in 5xx/400 from callback route, DLQ growth.
|
||||
- **Immediate:** isolate by provider; keep `PAYMENT_PROVIDER_MODE=read_only` for provider if mismatch continues.
|
||||
- **Recovery:** re-process DLQ entries after fix; verify idempotent behavior; compare reconciliation counters.
|
||||
- **Post:** update webhook retry tuning and security thresholds.
|
||||
|
||||
### 2.2 Duplicate payment
|
||||
- **Owner:** BL
|
||||
- **Trigger:** duplicate webhook deliveries or idempotent suppression above normal baseline.
|
||||
- **Detection:** `duplicate` counter above expected and no payment state change.
|
||||
- **Immediate:** confirm normalized status transition map; suppress with `deliveryId`/`idempotency` checks.
|
||||
- **Recovery:** rerun reconciliation for affected IDs and clear manual holds if ledger still consistent.
|
||||
- **Post:** validate provider event parsing and storage hashing.
|
||||
|
||||
### 2.3 Missing payment
|
||||
- **Owner:** BL + Operations
|
||||
- **Trigger:** customer support report or reconciliation finding.
|
||||
- **Detection:** provider reference exists, no canonical payment row or no final ledger balance.
|
||||
- **Immediate:** freeze payouts/release for request ID; run `searchProviderPayments`.
|
||||
- **Recovery:** create missing read-model entry or corrective ledger adjustment after evidence capture.
|
||||
- **Post:** add missing-test case for same provider/providerReference combination.
|
||||
|
||||
### 2.4 Stuck release
|
||||
- **Owner:** BL + Ops
|
||||
- **Trigger:** `released` pending >30m.
|
||||
- **Detection:** payout task with no terminal transition for `release`/`refund`.
|
||||
- **Immediate:** confirm chain confirmation and provider payout status.
|
||||
- **Recovery:** if confirmed on-chain but not reflected, run manual reconciliation + status repair; else reissue payout instruction.
|
||||
- **Post:** update alerting and retry windows.
|
||||
|
||||
### 2.5 Disputed release attempt
|
||||
- **Owner:** Security + BL + Admin
|
||||
- **Trigger:** release/recover API invoked while dispute hold is active.
|
||||
- **Detection:** `disputeId` active + attempted release/refund action.
|
||||
- **Immediate:** block action, notify admin team and dispute owner.
|
||||
- **Recovery:** clear dispute hold only by approved dispute resolution path.
|
||||
- **Post:** enforce ownership check in orchestration and admin UI guardrail.
|
||||
|
||||
### 2.6 Compromised admin
|
||||
- **Owner:** Security Lead
|
||||
- **Trigger:** unexpected high-value payouts or admin credentials alert.
|
||||
- **Detection:** step-up auth failures, role anomalies, IP anomalies.
|
||||
- **Immediate:** suspend admin account, rotate sessions, set `PAYMENT_PROVIDER_MODE=read_only`.
|
||||
- **Recovery:** review all recent ledger entries and release/refund actions; reverse via admin correction only after legal/ownership approval.
|
||||
- **Post:** enforce 2-person approval for high-risk payout thresholds and step-up audit check.
|
||||
|
||||
### 2.7 Leaked API key / webhook secret
|
||||
- **Owner:** Security + DevOps
|
||||
- **Trigger:** secret scanner, log leak, or suspected rotation request.
|
||||
- **Detection:** unauthorized provider 401s, unknown callbacks, or external exposure confirmation.
|
||||
- **Immediate:** rotate affected key and disable old key immediately.
|
||||
- **Recovery:** replay missed webhooks in dry-run mode after trust restored; compare reconciliation totals.
|
||||
- **Post:** document incident and rotate staging/CI secret references.
|
||||
|
||||
### 2.8 Provider outage
|
||||
- **Owner:** DevOps + BL
|
||||
- **Trigger:** repeated provider 5xx/timeouts.
|
||||
- **Detection:** error budget alert and queue growth.
|
||||
- **Immediate:** switch `PAYMENT_ENABLED_PROVIDERS=shkeeper` or `read_only`, notify users of degraded mode.
|
||||
- **Recovery:** drain queue and replay DLQ when provider recovers.
|
||||
- **Post:** update provider outage SLA runbook and status page text.
|
||||
|
||||
### 2.9 Chain/RPC outage
|
||||
- **Owner:** BL + DevOps
|
||||
- **Trigger:** chain verification failures and elevated wallet lookup latency.
|
||||
- **Detection:** verification timeout/error and `chain_verification_stale` trend.
|
||||
- **Immediate:** pause non-critical chain-dependent actions; preserve pending proofs.
|
||||
- **Recovery:** rerun verification jobs against fallback RPC endpoint.
|
||||
- **Post:** update RPC provider fallback policy and retry tuning.
|
||||
|
||||
### 2.10 Suspicious payment proof
|
||||
- **Owner:** Security + BL
|
||||
- **Trigger:** tx hash proof with mismatched `to`, amount, or network.
|
||||
- **Detection:** verify script rejects or manual proof review.
|
||||
- **Immediate:** mark payment disputed, require operator review.
|
||||
- **Recovery:** keep payment in `pending`, notify buyer and seller, trigger manual support workflow.
|
||||
- **Post:** add automated proof assertions to webhook and verifier tests.
|
||||
|
||||
### 2.11 npm/package compromise
|
||||
- **Owner:** CTO + Security
|
||||
- **Trigger:** new high/critical advisory tied to runtime packages.
|
||||
- **Detection:** audit alerts or security team notification.
|
||||
- **Immediate:** pause releases in affected components; freeze `watchtower` auto-update if needed.
|
||||
- **Recovery:** patch dependency and redeploy, then run smoke tests and transaction smoke path.
|
||||
- **Post:** verify lockfile, provenance, and dependency policy compliance.
|
||||
|
||||
## 3. Incident owner roster
|
||||
|
||||
Production launch must replace these role owners with named responders in the
|
||||
primary on-call schedule. Until then, escalation is by role:
|
||||
|
||||
- Payments: backend lead for payment/ledger services
|
||||
- Security: security owner from [[Security Ownership and Launch Decision Criteria]]
|
||||
- DevOps: deployment owner for production infrastructure
|
||||
- Admin lead: operations lead for dispute and payout approval workflows
|
||||
|
||||
## 4. Post-incident mandatory actions
|
||||
|
||||
- incident log + root-cause and action items,
|
||||
- metric threshold updates if needed,
|
||||
- runbook improvements (this document),
|
||||
- verification that [[Monitoring]] and [[Incident Response]] links are still accurate.
|
||||
|
||||
## Related
|
||||
|
||||
- [[Incident Response]]
|
||||
- [[Monitoring]]
|
||||
- [[Deployment]]
|
||||
- [[Backup & Recovery]]
|
||||
- [[Webhook Security Spec]]
|
||||
- [[Payment Provider Adapter Spec]]
|
||||
- [[Backend Core Stack Decision Record - 2026-05-24]]
|
||||
Reference in New Issue
Block a user