Files
nick-doc/09 - Audits/Security Ownership and Launch Decision Criteria.md
Siavash Sameni 4cf5c49274 docs(audit): align documentation with post-remediation backend reality
- Update data model enums to match backend models
- Update API reference auth requirements
- Add dispute module references and warning blocks
- Add 2026-05-24 audit remediation callout to Overview
- Generate task breakdowns and audit artifacts
- Add doc alignment report (.taskmaster/reports/)
2026-05-24 11:16:29 +04:00

20 KiB

title, tags, created, status
title tags created status
Security Ownership and Launch Decision Criteria
audit
security
governance
launch
raci
2026-05-24 decision

Security Ownership and Launch Decision Criteria

Decision document. Answers open questions 9 and 10 from Backend Stack Security and Refactor Assessment - 2026-05-24: who owns security decisions, and what must be true before public launch.

This document is binding for the Amanat platform launch cycle. Changes require written sign-off from the roles listed in Section 1.


1. Security Ownership RACI

Roles: PO = Product Owner, BL = Backend Lead, DI = DevOps/Infra, FL = Frontend Lead, SO = Security Owner (if designated), CTO = CTO/Leadership.

R = Responsible (does the work), A = Accountable (final decision authority), C = Consulted, I = Informed.

Decision Area PO BL DI FL SO CTO
Authentication changes (token storage, session model, passkey scope) I R C C A I
Payment/funds changes (ledger, state machine, release/refund logic) C R I I A I
Provider integrations (SHKeeper, Request Network, new providers) C R C I A I
Webhook handling (signature verification, idempotency, DLQ) I R C I A I
Rate limiting (tiers, thresholds, enforcement points) I R A I C I
Admin access (role definitions, step-up auth, audit logging) C R I C C A
Dependency updates (lockfile policy, provenance, vulnerability triage) I R A C C I
Incident response (runbook ownership, escalation, postmortem) I C R I A I
Cross-cutting security architecture (service split, stack migration) C R C C C A
External penetration testing (scope, timing, vendor selection) I C C I R A

RACI rules

  • If no Security Owner is designated, accountability for rows marked SO defaults to CTO.
  • BL is responsible for all implementation work on backend security items. FL is responsible for frontend-side changes (cookie migration, CSP hardening, token storage) and is consulted on rows that affect the frontend.
  • DI owns rate limiting configuration, dependency pipeline, and infrastructure-level controls.
  • A role marked A must approve in writing (PR review, doc sign-off, or Slack confirmation logged in the decision register) before the change ships.
  • Any role marked R or A can escalate to CTO for final arbitration.

2. Launch Safety Gate Checklist

Each item is classified as:

  • Required -- blocks launch. Must be verified complete before any public-facing deployment.
  • Strongly Recommended -- should block launch. Can be accepted with a documented risk entry (risk description, owner, remediation deadline) signed by the accountable role from Section 1.
  • Deferred -- explicitly deferred to post-launch. Must appear in Section 5 (Deferred Decisions Register).

2.1 Authentication and Session Hardening

# Condition Classification Source
2.1.1 All financial endpoints require Bearer JWT authentication Required Platform Logical Audit - 2026-05-24 item 3
2.1.2 Ownership checks enforced on all :userId parameterized endpoints Required Platform Logical Audit - 2026-05-24 item 3
2.1.3 Admin role checks enforced on all admin routes Required Backend Stack Security and Refactor Assessment - 2026-05-24 Phase 0
2.1.4 Test/demo payment and email endpoints disabled or auth-protected in production Required Backend Stack Security and Refactor Assessment - 2026-05-24 Phase 0
2.1.5 Access token lifetime reduced to 60 minutes or less Strongly Recommended Platform Logical Audit - 2026-05-24 item 10
2.1.6 Refresh tokens moved to httpOnly cookies (or risk accepted with documented rationale) Strongly Recommended Security Architecture section 11
2.1.7 Passkey/WebAuthn disabled in production until real cryptographic implementation is complete Required Platform Logical Audit - 2026-05-24 item 2
2.1.8 Passkey RP ID set to production domain (not localhost) Required Security Architecture section 2.3
2.1.9 Device/session revocation functional Deferred Post-launch auth hardening

2.2 Payment and Funds Integrity

# Condition Classification Source
2.2.1 Dispute creation enforces escrow hold (disputed state) that blocks release and refund Required Platform Logical Audit - 2026-05-24 item 1
2.2.2 Web3 verification decodes Transfer event and validates recipient, token contract, and amount Required Platform Logical Audit - 2026-05-24 item 4
2.2.3 Payment mutations route through centralized service methods only (no direct controller mutation) Required Backend Stack Security and Refactor Assessment - 2026-05-24 Phase 0
2.2.4 Release/refund eligibility enforced through escrow state, not controller-level flags Required Backend Stack Security and Refactor Assessment - 2026-05-24 payment risks
2.2.5 Seller cannot update offer price after acceptance Strongly Recommended Platform Logical Audit - 2026-05-24 item 18
2.2.6 Immutable funds ledger operational for new payments Deferred Backend Stack Security and Refactor Assessment - 2026-05-24 Phase 2
2.2.7 Provider-neutral payment abstraction layer Deferred Backend Stack Security and Refactor Assessment - 2026-05-24 Phase 2
2.2.8 Payment state enums unified across data model, API, and flow documents Required Platform Logical Audit - 2026-05-24 item 9

2.3 Authorization Enforcement

# Condition Classification Source
2.3.1 Every endpoint mapped to required role (public, authenticated, owner, admin) in authorization matrix Required Backend Stack Security and Refactor Assessment - 2026-05-24 doc requirement 4
2.3.2 assertRole or equivalent guard present in all admin and payment service methods Required Backend Stack Security and Refactor Assessment - 2026-05-24 Phase 0
2.3.3 Arbitrary userId from client no longer accepted for private data; server derives identity from JWT Required Backend Stack Security and Refactor Assessment - 2026-05-24 Phase 0

2.4 Rate Limiting

# Condition Classification Source
2.4.1 Global rate limiting enabled Required Platform Logical Audit - 2026-05-24 item 13
2.4.2 Auth endpoints: 5 req/5 min/IP Required Security Architecture section 9
2.4.3 Payment endpoints: 20 req/15 min/IP Required Backend Stack Security and Refactor Assessment - 2026-05-24 Phase 0
2.4.4 AI endpoints: 10 req/15 min/authenticated-user Required Platform Logical Audit - 2026-05-24 item 3
2.4.5 File upload endpoints: 10 req/15 min/authenticated-user Strongly Recommended --
2.4.6 Delivery confirmation code: max 5 verification attempts per 15 min per request Required Platform Logical Audit - 2026-05-24 item 8

2.5 Webhook Security

# Condition Classification Source
2.5.1 SHKeeper webhook uses raw-body HMAC verification (not reconstructed JSON) Required Backend Stack Security and Refactor Assessment - 2026-05-24 webhook risks
2.5.2 Webhook handler is idempotent (duplicate delivery = no-op) Required Security Architecture section 5
2.5.3 Webhook returns proper HTTP codes: 400 for bad input, 500 for server error, 200 for success Required Platform Logical Audit - 2026-05-24 item 11
2.5.4 Webhook failures logged to dead-letter storage or alerting channel Strongly Recommended Platform Logical Audit - 2026-05-24 item 11
2.5.5 Provider callbacks create reconciliation events; do not directly release funds Strongly Recommended Backend Stack Security and Refactor Assessment - 2026-05-24 webhook risks

2.6 Socket.IO Authorization

# Condition Classification Source
2.6.1 Socket.IO room membership derived from authenticated socket identity, not client-supplied user IDs Required Platform Logical Audit - 2026-05-24 item 12
2.6.2 Socket handshake requires valid JWT Required Backend Stack Security and Refactor Assessment - 2026-05-24 realtime risks

2.7 Supply-Chain Controls

# Condition Classification Source
2.7.1 Lockfile reviewed and updated for known vulnerable packages (Multer <2.1.0, Axios compromise, TanStack compromise) Required Backend Stack Security and Refactor Assessment - 2026-05-24 supply-chain risks
2.7.2 npm audit / yarn audit run and all high/critical CVEs triaged Required Security Architecture section 12
2.7.3 CI install mode uses frozen lockfile Strongly Recommended Backend Stack Security and Refactor Assessment - 2026-05-24 doc requirement 10
2.7.4 No test/demo routes in production builds Required Backend Stack Security and Refactor Assessment - 2026-05-24 Phase 0

2.8 Monitoring, Alerting, and Runbooks

# Condition Classification Source
2.8.1 Backend error monitoring active (Sentry or equivalent with source maps) Strongly Recommended Security Architecture section 12
2.8.2 Structured logging for payment state transitions (actor, target, before/after) Strongly Recommended Security Architecture section 10
2.8.3 Runbook exists for: failed webhook, duplicate payment, stuck release, compromised admin, leaked API key Strongly Recommended Backend Stack Security and Refactor Assessment - 2026-05-24 doc requirement 11
2.8.4 Alerting for: repeated webhook signature failures, unusual payment volume, admin actions on own disputes Strongly Recommended --

2.9 External Penetration Testing

# Condition Classification Source
2.9.1 External pentest of payment + dispute + auth flows completed before general public launch Strongly Recommended Security Architecture section 12, Open question 9
2.9.2 Pentest findings triaged; all critical/high items resolved or risk-accepted before launch Required (if pentest performed) --

2.10 Infrastructure and Operations

# Condition Classification Source
2.10.1 All dev-seeded credentials rotated Required Security Architecture section 12
2.10.2 NODE_ENV=production confirmed in production backend Required Security Architecture section 12
2.10.3 NEXT_PUBLIC_IS_DEVELOPMENT and ENABLE_DEBUG disabled in production Required Security Architecture section 12
2.10.4 Production Watchtower pinned to versioned tag (not latest) Strongly Recommended Platform Logical Audit - 2026-05-24 item 27
2.10.5 Committed or publicly visible secrets rotated Required Backend Stack Security and Refactor Assessment - 2026-05-24 Phase 0

3. Launch Priority Decision

Decision: launch prioritizes immediate hardening of the current Node/Express stack. Backend-core redesign is deferred to post-launch.

Rationale

The audit findings in Backend Stack Security and Refactor Assessment - 2026-05-24 and Platform Logical Audit - 2026-05-24 identify the dominant risks as domain-level security failures, not framework-level weaknesses:

  1. The most dangerous issues are authorization and state-machine bugs, not Node/Express itself. Unauthenticated financial endpoints, client-controlled socket room membership, missing dispute-escrow holds, and broken Web3 verification are independent of the backend language.

  2. A rewrite does not fix the core problems. Moving to Go or Kotlin without first specifying the funds ledger, escrow state machine, and authorization matrix would transplant the same logic gaps into a new codebase. The audit explicitly states: "the larger issue is that the current backend mixes high-risk financial state transitions... in one Express application" -- but a rewrite that does not first solve the domain model problem is wasted effort.

  3. Hardening is faster. The Phase 0 actions from Backend Stack Security and Refactor Assessment - 2026-05-24 (disable unsafe routes, add auth checks, enable rate limiting, fix Web3 verification, fix Socket.IO auth) are discrete, testable tasks that can be completed in days, not months.

  4. The rewrite carries re-introduction risk. The product has working business flows. A full or partial rewrite risks reintroducing escrow and payment bugs that have already been found and can be fixed in place.

Concrete launch sequence

Phase Work Timeline
Phase 0: Containment Complete all Required items from Section 2 checklist. Disable unsafe routes, add auth/ownership enforcement, enable rate limiting, fix dispute-escrow hold, fix Web3 verification, fix Socket.IO auth, disable passkeys, rotate secrets. Immediate
Phase 1: Documentation Produce the 11 required documents listed in Backend Stack Security and Refactor Assessment - 2026-05-24 (threat model, funds ledger spec, escrow state machine, authorization matrix, payment provider adapter spec, webhook security spec, session/auth architecture, realtime auth spec, migration plan, supply-chain policy, operational runbooks). Parallel with Phase 0
Phase 2: Controlled launch Public launch proceeds once all Required checklist items pass verification. Strongly Recommended items are either completed or have documented risk acceptances. After Phase 0
Phase 3: Payment/ledger extraction Build provider-neutral payment layer and immutable ledger. This is the first post-launch engineering priority. Post-launch
Phase 4: Core migration evaluation Decide on Go/Kotlin backend-core rewrite based on team capacity, Phase 3 outcomes, and operational experience. No migration begins until Phase 3 is stable. Post-launch, after Phase 3

4. External Penetration Testing Decision

Decision: yes, commission an external penetration test before general public launch.

Rationale

  • Amanat is a financial escrow platform handling crypto payments. The attack surface includes webhook processing, payment state machines, Web3 transaction verification, and fund release flows. This is materially different from a typical web application.
  • The audit identified critical findings (unauthenticated financial endpoints, Web3 verification bypass, dispute-escrow race condition) that an external tester would also find. An external pentest validates that the Phase 0 hardening actually closed these gaps.
  • Supply-chain compromise evidence from 2026 (Axios, TanStack, Express Multer) demonstrates active threat against the npm ecosystem the platform depends on.

Timeline and scope

Attribute Value
When After Phase 0 hardening is complete, before Phase 2 public launch
Scope Payment flows (SHKeeper pay-in, Web3 verification, payout/release/refund), dispute/escrow state transitions, authentication (login, token refresh, OAuth, session management), admin operations, webhook handling, Socket.IO authorization
Out of scope Marketplace browsing/listing, blog, points/leaderboard, file upload (assessed via code review instead)
Depth Black-box or grey-box at tester's discretion, with access to API documentation and a funded test environment
Deliverable Report with severity ratings, reproduction steps, and remediation recommendations. Findings mapped to checklist items in Section 2.
Gate All critical and high findings must be resolved or risk-accepted (with CTO sign-off) before launch proceeds

If pentest is delayed or unavailable

If the external pentest cannot be scheduled before the desired launch date, the following compensating controls must be in place:

  1. Complete internal code review of all payment, auth, and webhook code paths by someone other than the original author.
  2. Automated security test suite covering: unauthenticated access denial on all financial endpoints, webhook signature rejection, dispute-escrow hold enforcement, Web3 verification with wrong recipient/amount, Socket.IO unauthorized room join.
  3. Documented risk acceptance signed by CTO acknowledging that external validation was not performed.

5. Deferred Decisions Register

Every item deferred from the launch checklist is recorded here with an owner, risk statement, and decision deadline.

# Decision Risk Owner Decision Deadline
D-1 Move access/refresh tokens from localStorage to httpOnly cookies XSS in any frontend dependency or user-generated content leads to full session hijack. Access token at 60 min expiry limits window, but refresh token at 30 days is high value. SO (or BL if no SO) Within 30 days post-launch
D-2 Implement immutable funds ledger for new payments Without a ledger, payment state is mutable and auditable only through application logs. Reconciliation depends on provider records. Overpayments, partial refunds, and fee calculations have no single source of truth. BL Phase 3 start (within 60 days post-launch)
D-3 Build provider-neutral payment abstraction layer Current SHKeeper coupling means changing providers requires modifying core business logic. Provider-specific metadata may become canonical state by accident. BL Phase 3 start (within 60 days post-launch)
D-4 Implement real WebAuthn/passkey authentication Passkeys remain disabled. Users limited to password + OAuth. No phishing-resistant second factor available. BL Within 90 days post-launch
D-5 Device and session revocation Users cannot revoke individual sessions. Compromised refresh token remains valid until natural expiry or password change. BL Within 60 days post-launch
D-6 Admin step-up authentication for payouts and role changes Admin with compromised session can approve payouts or escalate roles without additional verification. CTO Before platform processes real funds at volume
D-7 Production staging pipeline (replace Watchtower auto-deploy on latest) Unvalidated images promoted to production. No health check gate, no rollback automation. DI Within 30 days post-launch
D-8 Frontend Docker image runtime configuration injection Same image cannot be promoted across environments without rebuild. Increases risk of configuration drift or misbuilt production images. FL Within 45 days post-launch
D-9 Webhook dead-letter queue and structured failure alerting Failed webhooks are silently swallowed. Reconciliation depends on manual monitoring or provider retry behavior. BL Within 30 days post-launch
D-10 Backend-core stack migration decision (Go, Kotlin, or remain TypeScript) Continued npm supply-chain exposure for payment core. Express flexibility allows route-level exceptions to accumulate. Decision delayed until payment layer is stable and team capacity is assessed. CTO After Phase 3 stability milestone (target: 120 days post-launch)
D-11 Append-only audit log for payment/payout/role-change operations Payment actions are logged via ad-hoc logger calls, not a tamper-evident audit trail. Required for dispute resolution and regulatory confidence. BL Within 45 days post-launch
D-12 ClamAV or equivalent virus scanning on user-uploaded files Uploaded dispute evidence and attachments served to other users without content scanning. DI Within 60 days post-launch

Governance

  • The accountable owner for each deferred item is responsible for tracking progress and raising blockers.
  • Items past their decision deadline without resolution escalate to CTO.
  • This register is reviewed at each engineering standup or weekly review until all items are resolved or reassigned.

Cross-references