Files
nick-doc/09 - Audits/Security Ownership and Launch Decision Criteria.md
Siavash Sameni 4cf5c49274 docs(audit): align documentation with post-remediation backend reality
- Update data model enums to match backend models
- Update API reference auth requirements
- Add dispute module references and warning blocks
- Add 2026-05-24 audit remediation callout to Overview
- Generate task breakdowns and audit artifacts
- Add doc alignment report (.taskmaster/reports/)
2026-05-24 11:16:29 +04:00

245 lines
20 KiB
Markdown

---
title: Security Ownership and Launch Decision Criteria
tags: [audit, security, governance, launch, raci]
created: 2026-05-24
status: decision
---
# Security Ownership and Launch Decision Criteria
**Decision document.** Answers open questions 9 and 10 from [[Backend Stack Security and Refactor Assessment - 2026-05-24]]: who owns security decisions, and what must be true before public launch.
This document is binding for the Amanat platform launch cycle. Changes require written sign-off from the roles listed in Section 1.
---
## 1. Security Ownership RACI
Roles: **PO** = Product Owner, **BL** = Backend Lead, **DI** = DevOps/Infra, **FL** = Frontend Lead, **SO** = Security Owner (if designated), **CTO** = CTO/Leadership.
R = Responsible (does the work), A = Accountable (final decision authority), C = Consulted, I = Informed.
| Decision Area | PO | BL | DI | FL | SO | CTO |
|---|---|---|---|---|---|---|
| Authentication changes (token storage, session model, passkey scope) | I | R | C | C | A | I |
| Payment/funds changes (ledger, state machine, release/refund logic) | C | R | I | I | A | I |
| Provider integrations (SHKeeper, Request Network, new providers) | C | R | C | I | A | I |
| Webhook handling (signature verification, idempotency, DLQ) | I | R | C | I | A | I |
| Rate limiting (tiers, thresholds, enforcement points) | I | R | A | I | C | I |
| Admin access (role definitions, step-up auth, audit logging) | C | R | I | C | C | A |
| Dependency updates (lockfile policy, provenance, vulnerability triage) | I | R | A | C | C | I |
| Incident response (runbook ownership, escalation, postmortem) | I | C | R | I | A | I |
| Cross-cutting security architecture (service split, stack migration) | C | R | C | C | C | A |
| External penetration testing (scope, timing, vendor selection) | I | C | C | I | R | A |
### RACI rules
- If no Security Owner is designated, accountability for rows marked **SO** defaults to **CTO**.
- **BL** is responsible for all implementation work on backend security items. **FL** is responsible for frontend-side changes (cookie migration, CSP hardening, token storage) and is consulted on rows that affect the frontend.
- **DI** owns rate limiting configuration, dependency pipeline, and infrastructure-level controls.
- A role marked **A** must approve in writing (PR review, doc sign-off, or Slack confirmation logged in the decision register) before the change ships.
- Any role marked **R** or **A** can escalate to **CTO** for final arbitration.
---
## 2. Launch Safety Gate Checklist
Each item is classified as:
- **Required** -- blocks launch. Must be verified complete before any public-facing deployment.
- **Strongly Recommended** -- should block launch. Can be accepted with a documented risk entry (risk description, owner, remediation deadline) signed by the accountable role from Section 1.
- **Deferred** -- explicitly deferred to post-launch. Must appear in Section 5 (Deferred Decisions Register).
### 2.1 Authentication and Session Hardening
| # | Condition | Classification | Source |
|---|---|---|---|
| 2.1.1 | All financial endpoints require Bearer JWT authentication | Required | [[Platform Logical Audit - 2026-05-24]] item 3 |
| 2.1.2 | Ownership checks enforced on all `:userId` parameterized endpoints | Required | [[Platform Logical Audit - 2026-05-24]] item 3 |
| 2.1.3 | Admin role checks enforced on all admin routes | Required | [[Backend Stack Security and Refactor Assessment - 2026-05-24]] Phase 0 |
| 2.1.4 | Test/demo payment and email endpoints disabled or auth-protected in production | Required | [[Backend Stack Security and Refactor Assessment - 2026-05-24]] Phase 0 |
| 2.1.5 | Access token lifetime reduced to 60 minutes or less | Strongly Recommended | [[Platform Logical Audit - 2026-05-24]] item 10 |
| 2.1.6 | Refresh tokens moved to `httpOnly` cookies (or risk accepted with documented rationale) | Strongly Recommended | [[Security Architecture]] section 11 |
| 2.1.7 | Passkey/WebAuthn disabled in production until real cryptographic implementation is complete | Required | [[Platform Logical Audit - 2026-05-24]] item 2 |
| 2.1.8 | Passkey RP ID set to production domain (not `localhost`) | Required | [[Security Architecture]] section 2.3 |
| 2.1.9 | Device/session revocation functional | Deferred | Post-launch auth hardening |
### 2.2 Payment and Funds Integrity
| # | Condition | Classification | Source |
|---|---|---|---|
| 2.2.1 | Dispute creation enforces escrow hold (`disputed` state) that blocks release and refund | Required | [[Platform Logical Audit - 2026-05-24]] item 1 |
| 2.2.2 | Web3 verification decodes Transfer event and validates recipient, token contract, and amount | Required | [[Platform Logical Audit - 2026-05-24]] item 4 |
| 2.2.3 | Payment mutations route through centralized service methods only (no direct controller mutation) | Required | [[Backend Stack Security and Refactor Assessment - 2026-05-24]] Phase 0 |
| 2.2.4 | Release/refund eligibility enforced through escrow state, not controller-level flags | Required | [[Backend Stack Security and Refactor Assessment - 2026-05-24]] payment risks |
| 2.2.5 | Seller cannot update offer price after acceptance | Strongly Recommended | [[Platform Logical Audit - 2026-05-24]] item 18 |
| 2.2.6 | Immutable funds ledger operational for new payments | Deferred | [[Backend Stack Security and Refactor Assessment - 2026-05-24]] Phase 2 |
| 2.2.7 | Provider-neutral payment abstraction layer | Deferred | [[Backend Stack Security and Refactor Assessment - 2026-05-24]] Phase 2 |
| 2.2.8 | Payment state enums unified across data model, API, and flow documents | Required | [[Platform Logical Audit - 2026-05-24]] item 9 |
### 2.3 Authorization Enforcement
| # | Condition | Classification | Source |
|---|---|---|---|
| 2.3.1 | Every endpoint mapped to required role (public, authenticated, owner, admin) in authorization matrix | Required | [[Backend Stack Security and Refactor Assessment - 2026-05-24]] doc requirement 4 |
| 2.3.2 | `assertRole` or equivalent guard present in all admin and payment service methods | Required | [[Backend Stack Security and Refactor Assessment - 2026-05-24]] Phase 0 |
| 2.3.3 | Arbitrary `userId` from client no longer accepted for private data; server derives identity from JWT | Required | [[Backend Stack Security and Refactor Assessment - 2026-05-24]] Phase 0 |
### 2.4 Rate Limiting
| # | Condition | Classification | Source |
|---|---|---|---|
| 2.4.1 | Global rate limiting enabled | Required | [[Platform Logical Audit - 2026-05-24]] item 13 |
| 2.4.2 | Auth endpoints: 5 req/5 min/IP | Required | [[Security Architecture]] section 9 |
| 2.4.3 | Payment endpoints: 20 req/15 min/IP | Required | [[Backend Stack Security and Refactor Assessment - 2026-05-24]] Phase 0 |
| 2.4.4 | AI endpoints: 10 req/15 min/authenticated-user | Required | [[Platform Logical Audit - 2026-05-24]] item 3 |
| 2.4.5 | File upload endpoints: 10 req/15 min/authenticated-user | Strongly Recommended | -- |
| 2.4.6 | Delivery confirmation code: max 5 verification attempts per 15 min per request | Required | [[Platform Logical Audit - 2026-05-24]] item 8 |
### 2.5 Webhook Security
| # | Condition | Classification | Source |
|---|---|---|---|
| 2.5.1 | SHKeeper webhook uses raw-body HMAC verification (not reconstructed JSON) | Required | [[Backend Stack Security and Refactor Assessment - 2026-05-24]] webhook risks |
| 2.5.2 | Webhook handler is idempotent (duplicate delivery = no-op) | Required | [[Security Architecture]] section 5 |
| 2.5.3 | Webhook returns proper HTTP codes: 400 for bad input, 500 for server error, 200 for success | Required | [[Platform Logical Audit - 2026-05-24]] item 11 |
| 2.5.4 | Webhook failures logged to dead-letter storage or alerting channel | Strongly Recommended | [[Platform Logical Audit - 2026-05-24]] item 11 |
| 2.5.5 | Provider callbacks create reconciliation events; do not directly release funds | Strongly Recommended | [[Backend Stack Security and Refactor Assessment - 2026-05-24]] webhook risks |
### 2.6 Socket.IO Authorization
| # | Condition | Classification | Source |
|---|---|---|---|
| 2.6.1 | Socket.IO room membership derived from authenticated socket identity, not client-supplied user IDs | Required | [[Platform Logical Audit - 2026-05-24]] item 12 |
| 2.6.2 | Socket handshake requires valid JWT | Required | [[Backend Stack Security and Refactor Assessment - 2026-05-24]] realtime risks |
### 2.7 Supply-Chain Controls
| # | Condition | Classification | Source |
|---|---|---|---|
| 2.7.1 | Lockfile reviewed and updated for known vulnerable packages (Multer <2.1.0, Axios compromise, TanStack compromise) | Required | [[Backend Stack Security and Refactor Assessment - 2026-05-24]] supply-chain risks |
| 2.7.2 | `npm audit` / `yarn audit` run and all high/critical CVEs triaged | Required | [[Security Architecture]] section 12 |
| 2.7.3 | CI install mode uses frozen lockfile | Strongly Recommended | [[Backend Stack Security and Refactor Assessment - 2026-05-24]] doc requirement 10 |
| 2.7.4 | No test/demo routes in production builds | Required | [[Backend Stack Security and Refactor Assessment - 2026-05-24]] Phase 0 |
### 2.8 Monitoring, Alerting, and Runbooks
| # | Condition | Classification | Source |
|---|---|---|---|
| 2.8.1 | Backend error monitoring active (Sentry or equivalent with source maps) | Strongly Recommended | [[Security Architecture]] section 12 |
| 2.8.2 | Structured logging for payment state transitions (actor, target, before/after) | Strongly Recommended | [[Security Architecture]] section 10 |
| 2.8.3 | Runbook exists for: failed webhook, duplicate payment, stuck release, compromised admin, leaked API key | Strongly Recommended | [[Backend Stack Security and Refactor Assessment - 2026-05-24]] doc requirement 11 |
| 2.8.4 | Alerting for: repeated webhook signature failures, unusual payment volume, admin actions on own disputes | Strongly Recommended | -- |
### 2.9 External Penetration Testing
| # | Condition | Classification | Source |
|---|---|---|---|
| 2.9.1 | External pentest of payment + dispute + auth flows completed before general public launch | Strongly Recommended | [[Security Architecture]] section 12, Open question 9 |
| 2.9.2 | Pentest findings triaged; all critical/high items resolved or risk-accepted before launch | Required (if pentest performed) | -- |
### 2.10 Infrastructure and Operations
| # | Condition | Classification | Source |
|---|---|---|---|
| 2.10.1 | All dev-seeded credentials rotated | Required | [[Security Architecture]] section 12 |
| 2.10.2 | `NODE_ENV=production` confirmed in production backend | Required | [[Security Architecture]] section 12 |
| 2.10.3 | `NEXT_PUBLIC_IS_DEVELOPMENT` and `ENABLE_DEBUG` disabled in production | Required | [[Security Architecture]] section 12 |
| 2.10.4 | Production Watchtower pinned to versioned tag (not `latest`) | Strongly Recommended | [[Platform Logical Audit - 2026-05-24]] item 27 |
| 2.10.5 | Committed or publicly visible secrets rotated | Required | [[Backend Stack Security and Refactor Assessment - 2026-05-24]] Phase 0 |
---
## 3. Launch Priority Decision
**Decision: launch prioritizes immediate hardening of the current Node/Express stack. Backend-core redesign is deferred to post-launch.**
### Rationale
The audit findings in [[Backend Stack Security and Refactor Assessment - 2026-05-24]] and [[Platform Logical Audit - 2026-05-24]] identify the dominant risks as domain-level security failures, not framework-level weaknesses:
1. **The most dangerous issues are authorization and state-machine bugs**, not Node/Express itself. Unauthenticated financial endpoints, client-controlled socket room membership, missing dispute-escrow holds, and broken Web3 verification are independent of the backend language.
2. **A rewrite does not fix the core problems.** Moving to Go or Kotlin without first specifying the funds ledger, escrow state machine, and authorization matrix would transplant the same logic gaps into a new codebase. The audit explicitly states: "the larger issue is that the current backend mixes high-risk financial state transitions... in one Express application" -- but a rewrite that does not first solve the domain model problem is wasted effort.
3. **Hardening is faster.** The Phase 0 actions from [[Backend Stack Security and Refactor Assessment - 2026-05-24]] (disable unsafe routes, add auth checks, enable rate limiting, fix Web3 verification, fix Socket.IO auth) are discrete, testable tasks that can be completed in days, not months.
4. **The rewrite carries re-introduction risk.** The product has working business flows. A full or partial rewrite risks reintroducing escrow and payment bugs that have already been found and can be fixed in place.
### Concrete launch sequence
| Phase | Work | Timeline |
|---|---|---|
| **Phase 0: Containment** | Complete all Required items from Section 2 checklist. Disable unsafe routes, add auth/ownership enforcement, enable rate limiting, fix dispute-escrow hold, fix Web3 verification, fix Socket.IO auth, disable passkeys, rotate secrets. | Immediate |
| **Phase 1: Documentation** | Produce the 11 required documents listed in [[Backend Stack Security and Refactor Assessment - 2026-05-24]] (threat model, funds ledger spec, escrow state machine, authorization matrix, payment provider adapter spec, webhook security spec, session/auth architecture, realtime auth spec, migration plan, supply-chain policy, operational runbooks). | Parallel with Phase 0 |
| **Phase 2: Controlled launch** | Public launch proceeds once all Required checklist items pass verification. Strongly Recommended items are either completed or have documented risk acceptances. | After Phase 0 |
| **Phase 3: Payment/ledger extraction** | Build provider-neutral payment layer and immutable ledger. This is the first post-launch engineering priority. | Post-launch |
| **Phase 4: Core migration evaluation** | Decide on Go/Kotlin backend-core rewrite based on team capacity, Phase 3 outcomes, and operational experience. No migration begins until Phase 3 is stable. | Post-launch, after Phase 3 |
---
## 4. External Penetration Testing Decision
**Decision: yes, commission an external penetration test before general public launch.**
### Rationale
- Amanat is a financial escrow platform handling crypto payments. The attack surface includes webhook processing, payment state machines, Web3 transaction verification, and fund release flows. This is materially different from a typical web application.
- The audit identified critical findings (unauthenticated financial endpoints, Web3 verification bypass, dispute-escrow race condition) that an external tester would also find. An external pentest validates that the Phase 0 hardening actually closed these gaps.
- Supply-chain compromise evidence from 2026 (Axios, TanStack, Express Multer) demonstrates active threat against the npm ecosystem the platform depends on.
### Timeline and scope
| Attribute | Value |
|---|---|
| **When** | After Phase 0 hardening is complete, before Phase 2 public launch |
| **Scope** | Payment flows (SHKeeper pay-in, Web3 verification, payout/release/refund), dispute/escrow state transitions, authentication (login, token refresh, OAuth, session management), admin operations, webhook handling, Socket.IO authorization |
| **Out of scope** | Marketplace browsing/listing, blog, points/leaderboard, file upload (assessed via code review instead) |
| **Depth** | Black-box or grey-box at tester's discretion, with access to API documentation and a funded test environment |
| **Deliverable** | Report with severity ratings, reproduction steps, and remediation recommendations. Findings mapped to checklist items in Section 2. |
| **Gate** | All critical and high findings must be resolved or risk-accepted (with CTO sign-off) before launch proceeds |
### If pentest is delayed or unavailable
If the external pentest cannot be scheduled before the desired launch date, the following compensating controls must be in place:
1. Complete internal code review of all payment, auth, and webhook code paths by someone other than the original author.
2. Automated security test suite covering: unauthenticated access denial on all financial endpoints, webhook signature rejection, dispute-escrow hold enforcement, Web3 verification with wrong recipient/amount, Socket.IO unauthorized room join.
3. Documented risk acceptance signed by CTO acknowledging that external validation was not performed.
---
## 5. Deferred Decisions Register
Every item deferred from the launch checklist is recorded here with an owner, risk statement, and decision deadline.
| # | Decision | Risk | Owner | Decision Deadline |
|---|---|---|---|---|
| D-1 | Move access/refresh tokens from `localStorage` to `httpOnly` cookies | XSS in any frontend dependency or user-generated content leads to full session hijack. Access token at 60 min expiry limits window, but refresh token at 30 days is high value. | SO (or BL if no SO) | Within 30 days post-launch |
| D-2 | Implement immutable funds ledger for new payments | Without a ledger, payment state is mutable and auditable only through application logs. Reconciliation depends on provider records. Overpayments, partial refunds, and fee calculations have no single source of truth. | BL | Phase 3 start (within 60 days post-launch) |
| D-3 | Build provider-neutral payment abstraction layer | Current SHKeeper coupling means changing providers requires modifying core business logic. Provider-specific metadata may become canonical state by accident. | BL | Phase 3 start (within 60 days post-launch) |
| D-4 | Implement real WebAuthn/passkey authentication | Passkeys remain disabled. Users limited to password + OAuth. No phishing-resistant second factor available. | BL | Within 90 days post-launch |
| D-5 | Device and session revocation | Users cannot revoke individual sessions. Compromised refresh token remains valid until natural expiry or password change. | BL | Within 60 days post-launch |
| D-6 | Admin step-up authentication for payouts and role changes | Admin with compromised session can approve payouts or escalate roles without additional verification. | CTO | Before platform processes real funds at volume |
| D-7 | Production staging pipeline (replace Watchtower auto-deploy on `latest`) | Unvalidated images promoted to production. No health check gate, no rollback automation. | DI | Within 30 days post-launch |
| D-8 | Frontend Docker image runtime configuration injection | Same image cannot be promoted across environments without rebuild. Increases risk of configuration drift or misbuilt production images. | FL | Within 45 days post-launch |
| D-9 | Webhook dead-letter queue and structured failure alerting | Failed webhooks are silently swallowed. Reconciliation depends on manual monitoring or provider retry behavior. | BL | Within 30 days post-launch |
| D-10 | Backend-core stack migration decision (Go, Kotlin, or remain TypeScript) | Continued npm supply-chain exposure for payment core. Express flexibility allows route-level exceptions to accumulate. Decision delayed until payment layer is stable and team capacity is assessed. | CTO | After Phase 3 stability milestone (target: 120 days post-launch) |
| D-11 | Append-only audit log for payment/payout/role-change operations | Payment actions are logged via ad-hoc logger calls, not a tamper-evident audit trail. Required for dispute resolution and regulatory confidence. | BL | Within 45 days post-launch |
| D-12 | ClamAV or equivalent virus scanning on user-uploaded files | Uploaded dispute evidence and attachments served to other users without content scanning. | DI | Within 60 days post-launch |
### Governance
- The accountable owner for each deferred item is responsible for tracking progress and raising blockers.
- Items past their decision deadline without resolution escalate to CTO.
- This register is reviewed at each engineering standup or weekly review until all items are resolved or reassigned.
---
## Cross-references
- [[Backend Stack Security and Refactor Assessment - 2026-05-24]] -- primary audit, open questions 9 and 10
- [[Platform Logical Audit - 2026-05-24]] -- detailed findings referenced in checklist items
- [[Security Architecture]] -- current security architecture and pre-launch hardening checklist
- [[PRD - Platform Audit Remediation Plan (2026-05-24)]] -- tactical remediation plan (if available)