diff --git a/09 - Audits/Backend Stack Security and Refactor Assessment - 2026-05-24.md b/09 - Audits/Backend Stack Security and Refactor Assessment - 2026-05-24.md new file mode 100644 index 0000000..175e2eb --- /dev/null +++ b/09 - Audits/Backend Stack Security and Refactor Assessment - 2026-05-24.md @@ -0,0 +1,560 @@ +--- +title: Backend Stack Security and Refactor Assessment +tags: [audit, security, backend, architecture, payments, refactor] +created: 2026-05-24 +status: advisory +--- + +# Backend Stack Security and Refactor Assessment + +## Purpose + +This document records an advisory assessment of whether Amanat should keep the current Node/Express backend, harden it in place, or migrate at least the security-critical backend surface to another technology stack. + +The conclusion is intentionally strategic rather than implementation-heavy. It should be used as input for architecture review, security planning, and refactor scoping. + +## Executive summary + +Amanat is not a normal CRUD marketplace. It is a financial escrow platform with authentication, realtime communication, crypto payment intake, payout/release flows, provider webhooks, and dispute-sensitive fund movement. + +The main security risk is not simply "Node is insecure." The larger issue is that the current backend mixes high-risk financial state transitions, webhook handling, realtime room membership, admin operations, test/demo endpoints, and ordinary marketplace APIs in one Express application. + +Moving away from Node/Express may reduce npm supply-chain exposure and improve long-term auditability, but it will not automatically fix the most important risks. The immediate priority should be to define and enforce the correct security architecture: + +- A canonical funds ledger. +- A strict escrow/payment/dispute state machine. +- Centralized authorization and ownership checks. +- Signed webhook handling with idempotency. +- Server-derived realtime authorization. +- Secure session handling. +- A provider-neutral payment abstraction. + +Recommended approach: + +1. Harden the existing backend immediately. +2. Define the target payment, ledger, and auth architecture in documentation. +3. Extract or rewrite only the security-critical backend core if the team can support the new stack. +4. Keep lower-risk marketplace, chat, notification, and dashboard APIs in TypeScript until the core is stable. + +Default recommendation: do not rewrite the entire backend at once. If a rewrite is chosen, start with payment/auth/escrow core services, preferably in Go or Kotlin/Java, while preserving current product behavior behind stable API contracts. + +## Current system profile + +Observed architecture: + +- Frontend: Next.js, React, MUI, Web3, Socket.IO client. +- Backend: Express 5, TypeScript, Mongoose, Socket.IO, SHKeeper, Web3 transaction verification, SMTP, OpenAI integration. +- Storage: MongoDB and Redis, though Redis is not consistently used as a shared state authority for all security-sensitive flows. +- Realtime: Socket.IO rooms for user, buyer, seller, chat, and purchase-request updates. +- Payments: SHKeeper pay-in, SHKeeper payout, decentralized/Web3 payment verification, manual/admin payout paths. +- Docs: existing logical audit and remediation documents already identify several critical flaws. + +The backend currently acts as: + +- API server. +- Realtime server. +- Payment orchestrator. +- Webhook processor. +- Background-job runner. +- File upload server. +- Auth/session issuer. +- Admin operations surface. + +That is too much responsibility in one process for a financial platform unless the architecture is very tightly controlled. + +## Code-backed security observations + +These findings are consistent with the existing audit docs and representative source review. + +### Payment and funds risks + +- Payment state is largely represented by mutable `Payment.status` and `escrowState` fields rather than an immutable funds ledger. +- Pay-in, manual confirmation, wallet monitoring, webhook handling, and payout flows can converge on the same records through different paths. +- Release/refund eligibility is not fully centralized around ledger invariants. +- The existing docs identify a dispute/escrow race: disputes do not reliably create an enforceable hold before release. +- `Payment` uses mixed/string-compatible references for some core links, reducing referential integrity and query safety. +- Some payment mutation/history routes were exposed without sufficient authentication or ownership enforcement. +- Web3 verification has been documented as relying primarily on transaction receipt success rather than strict token, recipient, and amount verification. + +Security implication: a backend stack change alone will not fix this. The platform needs a funds ledger and state machine first. + +### Authentication and session risks + +- Browser tokens are stored in `localStorage`, increasing impact from XSS. +- Passkey/WebAuthn behavior is described in the audit docs as stubbed/incomplete and challenge storage is process-local. +- Refresh-token behavior differs between auth paths. +- Admin-sensitive routes need explicit role enforcement, not just authentication. + +Security implication: migration should include a session architecture decision, not just a framework change. + +### Realtime risks + +- Socket.IO room joins are client-driven by IDs such as `join-user-room`, `join-buyer-room`, and `join-seller-room`. +- The server should derive room membership from authenticated socket identity, not trust client-supplied user IDs. + +Security implication: realtime authorization needs to be treated like API authorization. + +### Rate limiting and abuse controls + +- Global rate limiting is explicitly disabled in the Express app. +- Sensitive paths need tiered limits: auth, verification, file upload, AI, payment, webhook, chat. +- AI endpoints and email endpoints can create cost or abuse exposure if not authenticated and rate-limited. + +Security implication: this is an immediate hardening task regardless of backend stack. + +### Webhook and provider risks + +- Webhooks must be verified using raw-body signatures, not reconstructed JSON when signatures depend on raw bytes. +- Webhook delivery must be idempotent. +- Unknown, duplicate, malformed, and failed webhooks should be visible in structured records or dead-letter storage. +- Provider callbacks should create reconciliation events, not directly release funds. + +Security implication: payment provider integration should be isolated behind a provider-neutral service contract. + +### Supply-chain risks + +The Node/npm ecosystem has real and recurring supply-chain risk. For this codebase, that risk matters because both frontend and backend depend heavily on npm packages. + +Relevant 2026 context: + +- Express published February 2026 security releases, including high-severity Multer issues affecting versions before 2.1.0. The backend manifest currently specifies `multer: ^2.0.2`, so the resolved lockfile version should be reviewed and updated if necessary. +- Node.js published March 2026 security releases across active release lines. +- Microsoft reported an Axios npm supply-chain compromise in March 2026. This project uses Axios on frontend and backend. +- TanStack published a May 2026 npm compromise postmortem. This project uses `@tanstack/react-query`. + +References: + +- Express security release, 2026-02-27: https://expressjs.com/2026/02/27/security-releases.html +- Node.js March 2026 security releases: https://nodejs.org/en/blog/vulnerability/march-2026-security-releases +- Microsoft on Axios npm supply-chain compromise: https://www.microsoft.com/en-us/security/blog/2026/04/01/mitigating-the-axios-npm-supply-chain-compromise/ +- TanStack npm supply-chain compromise postmortem: https://tanstack.com/blog/npm-supply-chain-compromise-postmortem + +Security implication: npm supply-chain controls are required even if the backend is rewritten, because the frontend remains npm-based. + +## Should the backend move away from Node/Express? + +### Reasons to keep and harden first + +- The product already exists and has working business flows. +- A full rewrite risks reintroducing escrow/payment bugs. +- The most dangerous issues are domain/state/authorization issues, not syntax or framework issues. +- Hardening can reduce immediate exposure faster than a rewrite. +- The team may currently be more productive in TypeScript. + +### Reasons to migrate at least the backend core + +- Financial backend code benefits from a smaller, stricter dependency footprint. +- Payment, ledger, webhook, and payout flows need strong invariants and auditability. +- Express makes it easy to accumulate route-level exceptions, test endpoints, and inconsistent middleware. +- Node/npm supply-chain exposure is material and recurring. +- TypeScript runtime enforcement is limited unless paired with strict schema validation everywhere. +- A separate payment core can be more easily audited, threat-modeled, tested, and locked down. + +### Balanced conclusion + +It is security-wise reasonable to move the highest-risk backend core away from Node/Express, but only after the target security model is specified. + +Do not begin with a full product rewrite. Begin with a security-critical core extraction: + +- Auth/session/token authority. +- Payment intent creation. +- Provider webhook processing. +- Funds ledger and reconciliation. +- Release/refund/dispute-hold enforcement. +- Admin payout approval and audit logging. + +Keep lower-risk modules in the current stack until the core is stable: + +- Marketplace browsing/listing. +- Request templates. +- Chat and notifications, after socket authorization fixes. +- Admin dashboard reads. +- File upload, after hardening or moving to object storage. + +## Stack options + +### Go + +Best fit if the team wants a smaller, operationally simple, security-focused payment core. + +Strengths: + +- Small binaries and deployment footprint. +- Lower dependency surface than typical Node services. +- Strong standard library for HTTP, crypto, JSON, and concurrency. +- Good fit for webhook receivers, ledger services, workers, and reconciliation jobs. +- Easy to run static analysis and produce reproducible builds. + +Weaknesses: + +- Less ergonomic than TypeScript for rapid product iteration. +- Requires team comfort with Go idioms. +- API/schema generation must be designed deliberately. + +Assessment: recommended first choice for a payment/ledger/auth core if the team can maintain Go. + +### Kotlin/Java with Spring Boot + +Best fit if the team wants enterprise-grade structure, mature auth patterns, and strong ecosystem support. + +Strengths: + +- Mature security and validation ecosystem. +- Strong typing and tooling. +- Good for complex domain services and audit-heavy systems. +- Well-understood operational patterns. + +Weaknesses: + +- Heavier runtime and framework footprint. +- More ceremony. +- Slower iteration for a small team. + +Assessment: strong choice for a larger engineering team or enterprise-style compliance needs. + +### Rust + +Best fit if maximum memory safety and correctness are worth slower delivery. + +Strengths: + +- Strong compile-time safety. +- Good for cryptographic and high-assurance components. +- Very low runtime footprint. + +Weaknesses: + +- Higher implementation cost. +- Smaller hiring pool. +- Web API development may be slower. + +Assessment: attractive for narrow cryptographic or transaction-verification components, but probably too costly for the whole backend unless the team is already strong in Rust. + +### Python/FastAPI + +Best fit if rapid backend development and clean API typing are more important than strict compile-time guarantees. + +Strengths: + +- Fast development. +- Good validation with Pydantic. +- Good for admin tools and internal services. + +Weaknesses: + +- Supply-chain risk remains. +- Runtime typing and async behavior require discipline. +- Less compelling than Go/Kotlin for a financial core. + +Assessment: acceptable for internal services, not the preferred payment-core target. + +### Continue TypeScript/Node with stronger architecture + +Best fit if team capacity cannot support another backend language yet. + +Required conditions: + +- Strict route registration policy. +- Runtime validation on every boundary. +- No test/demo routes in production builds. +- Full lockfile and package provenance controls. +- Centralized auth, ownership, and role guards. +- Ledger-first payment architecture. +- Secure cookies or a documented token-storage risk acceptance. +- Socket auth middleware. +- Redis-backed challenge/idempotency/rate-limit storage. + +Assessment: viable short term, but the security bar must be raised significantly. + +## Recommended target architecture + +### Phase 0: Immediate containment + +Goal: reduce current high-risk exposure without broad redesign. + +Actions: + +- Disable or protect test/demo payment and email endpoints in production. +- Require authentication and ownership checks on all payment, notification, AI, and file routes. +- Re-enable rate limiting with stricter limits on auth, payment, AI, file upload, and webhook paths. +- Add admin role checks to admin routes. +- Stop accepting arbitrary `userId` from clients for private data. +- Validate all payment mutations through centralized service methods. +- Lock Socket.IO room membership to server-verified identity. +- Review and update lockfiles for known vulnerable packages. +- Rotate any committed or publicly visible secrets. + +### Phase 1: Architecture specification + +Goal: define the new security model before implementation. + +Documents to produce are listed in the "Required documentation" section below. + +### Phase 2: Payment and ledger extraction + +Goal: move funds logic behind a provider-neutral service. + +Introduce: + +- `FundsAccount` +- `LedgerEntry` +- `FundsBalance` +- `PaymentIntent` +- `PaymentProviderEvent` +- `ReleaseInstruction` +- `RefundInstruction` +- `DisputeHold` + +Key rule: provider webhooks do not directly release funds. They create verified events and ledger entries. + +### Phase 3: Backend-core rewrite or service split + +Goal: decide whether the extracted core remains TypeScript or moves to Go/Kotlin. + +Recommended split: + +- `core-payments`: payment intent, webhook, ledger, release/refund, reconciliation. +- `core-auth`: sessions, passkeys, OAuth, token issuance, session revocation. +- `marketplace-api`: purchase requests, offers, categories, templates. +- `realtime-api`: chat, notifications, socket rooms. + +The split can be logical first, physical later. + +### Phase 4: Full migration only if justified + +Goal: avoid rewriting stable lower-risk product surfaces prematurely. + +Only consider full backend migration after: + +- Payment core is stable. +- Auth/session model is stable. +- API contracts are documented and tested. +- Legacy payment records are migrated or safely read-only. +- Team has demonstrated production maintenance ability in the new stack. + +## Required documentation before refactor + +### 1. Threat Model + +Purpose: identify what must be protected and how it can be attacked. + +Should include: + +- Assets: user accounts, admin accounts, wallet addresses, payment records, funds, webhook secrets, API keys, private notifications. +- Actors: buyer, seller, admin, support, unauthenticated attacker, compromised user, compromised admin, provider, malicious webhook sender. +- Trust boundaries: browser, backend, database, Redis, provider APIs, wallet/RPC, admin UI, Socket.IO. +- Abuse cases: fake payment proof, replayed webhook, arbitrary room join, stolen token, double payout, dispute bypass, email/AI abuse. + +### 2. Funds Ledger Specification + +Purpose: make money movement auditable and provider-independent. + +Should define: + +- Account model per purchase request/order. +- Immutable ledger entry types. +- Derived balance model. +- Gross amount, provider fees, platform fees, held amount, disputed amount, releasable amount, released amount, refunded amount. +- Idempotency keys. +- Reconciliation behavior. + +### 3. Escrow State Machine + +Purpose: define legal transitions once. + +Should include: + +- Purchase request states. +- Payment states. +- Escrow/funds states. +- Dispute states. +- Valid transitions and forbidden transitions. +- Who or what can trigger each transition. +- Required preconditions for release, refund, cancellation, dispute hold, and admin override. + +### 4. Authorization Matrix + +Purpose: remove route-by-route ambiguity. + +Should map every endpoint and socket event to: + +- Public, authenticated, owner, seller, buyer, admin, support, or service role. +- Required ownership checks. +- Required object state. +- Rate-limit tier. +- Audit-log requirement. + +### 5. Payment Provider Adapter Spec + +Purpose: decouple business logic from SHKeeper, Request Network, manual wallet flow, and future providers. + +Should define: + +- `createPayInIntent` +- `getPayInStatus` +- `handleProviderWebhook` +- `createHostedPaymentLink` +- `createReleaseInstruction` +- `createRefundInstruction` +- `getPayoutStatus` +- `searchProviderPayments` + +Provider-specific metadata should be namespaced and never become the canonical funds state. + +### 6. Webhook Security Spec + +Purpose: prevent forged, replayed, or silently failed provider events. + +Should define: + +- Raw-body signature verification. +- Accepted headers and algorithms. +- Replay prevention. +- Delivery ID/idempotency handling. +- Unknown payment behavior. +- Duplicate event behavior. +- Retry semantics. +- Dead-letter/replay storage. +- Alerting thresholds. + +### 7. Session and Auth Architecture + +Purpose: decide how browser sessions should work for a financial platform. + +Should define: + +- Access token lifetime. +- Refresh token lifetime and rotation. +- Whether tokens move from `localStorage` to `httpOnly` cookies. +- CSRF strategy if cookies are used. +- Passkey/WebAuthn implementation requirements. +- OAuth requirements. +- Device/session revocation. +- Admin step-up authentication for payouts or role changes. + +### 8. Realtime Authorization Spec + +Purpose: make Socket.IO events subject to the same security model as REST. + +Should define: + +- Socket handshake authentication. +- Server-derived room membership. +- Which rooms exist. +- Who may join each room. +- Whether room membership changes with request/payment/dispute state. +- Event payload privacy rules. + +### 9. Migration Plan + +Purpose: avoid breaking current payments and historical records. + +Should include: + +- SHKeeper legacy read path. +- New provider feature flag. +- Ledger backfill strategy. +- Data validation report before enforcement. +- Rollback criteria. +- Cutover date for old webhook routes. +- Operator manual reconciliation workflow. + +### 10. Secure Build and Supply-Chain Policy + +Purpose: reduce npm and dependency compromise risk. + +Should define: + +- Package manager and lockfile policy. +- CI install mode. +- Dependency update cadence. +- Security advisory monitoring. +- npm provenance/signature policy where available. +- Secrets handling. +- Production build reproducibility. +- Separation of frontend npm risk from backend core risk. + +### 11. Operational Runbooks + +Purpose: make security incidents and payment failures survivable. + +Should include: + +- Failed webhook. +- Duplicate payment. +- Missing payment. +- Stuck release. +- Disputed release attempt. +- Compromised admin. +- Leaked API key. +- Provider outage. +- Chain/RPC outage. +- Suspicious payment proof. +- npm/package compromise. + +## Decision framework + +Use the following questions before choosing a rewrite: + +- Is the current goal safe launch, or long-term platform rebuild? +- Is the team willing to delay feature work for a payment-core redesign? +- Can the team maintain Go/Kotlin/Rust in production? +- Is the biggest current risk supply chain, or incorrect money movement? +- Are admin actions trusted, or should high-risk actions require step-up approval? +- Should Amanat custody funds, or should the provider/payment network hold or route them? +- Are disputes central to the product, or rare manual exceptions? +- Is auditability a regulatory/business requirement or only an internal safety goal? + +## Recommended decision + +Near term: + +- Harden the current Express backend. +- Disable unsafe production routes. +- Add centralized authorization and rate limiting. +- Fix Web3 verification. +- Fix Socket.IO authorization. +- Disable passkeys unless implemented with real WebAuthn. +- Begin ledger/state-machine documentation immediately. + +Medium term: + +- Build a provider-neutral payment and funds layer. +- Add immutable ledger entries. +- Move release/refund/dispute-hold checks into the central payment/funds service. +- Keep SHKeeper compatibility read-only for legacy records. +- Add Request Network or another provider behind the adapter if desired. + +Long term: + +- Rewrite the payment/auth/escrow core in Go or Kotlin/Java if the team can support it. +- Do not rewrite the entire backend until the core is proven. +- Keep lower-risk modules in TypeScript until there is a business or operational reason to migrate them. + +## Open questions for leadership and engineering + +1. Is launch timeline more important than a full payment/funds redesign? +2. Should passkeys be removed from launch scope until production-grade WebAuthn is implemented? +3. Should browser auth move to `httpOnly` cookies even if that requires CSRF work and frontend changes? +4. Should every payout require admin step-up authentication or two-person approval? +5. Should Amanat keep funds in a platform-controlled escrow wallet, or should provider-mediated payment pages become the default? +6. Is Request Network a desired provider migration, or just one option being explored? +7. What new backend stack can the team realistically operate for the next two years? +8. What is the acceptable level of temporary dual-stack complexity during migration? +9. Do we need formal external penetration testing before public launch? +10. Who owns security decisions: product, backend, DevOps, or a dedicated security owner? + +## Relationship to existing docs + +This assessment complements: + +- [[Platform Logical Audit - 2026-05-24]] +- [[PRD - Platform Audit Remediation Plan (2026-05-24)]] +- [[PRD - Request Network Migration and Funds Management]] +- [[Security Architecture]] +- [[Payment Flow - SHKeeper]] +- [[Payment Flow - DePay & Web3]] +- [[Escrow Flow]] +- [[Dispute Flow]] + +The existing remediation PRD is the tactical hardening plan. This document is the strategic backend-stack and refactor assessment. diff --git a/README.md b/README.md index 1579f63..838cd74 100644 --- a/README.md +++ b/README.md @@ -152,6 +152,7 @@ For engineers / SREs running the system in production. |---|---| | **Payments** | [[Payment Flow - SHKeeper]] → [[Payment API]] → [[Payment]] → [[Payout Flow]] | | **Auth** | [[Authentication Flow]] → [[Authentication API]] → [[Security Architecture]] | +| **Backend security / refactor** | [[Backend Stack Security and Refactor Assessment - 2026-05-24]] → [[Platform Logical Audit - 2026-05-24]] → [[PRD - Platform Audit Remediation Plan (2026-05-24)]] | | **Real-time** | [[Real-time Layer]] → [[Socket Events]] → [[Chat Flow]] / [[Notification Flow]] | | **Disputes** | [[Dispute Flow]] → [[Dispute API]] → [[Dispute]] → [[Admin Guide]] §5 | | **Web3** | [[Payment Flow - DePay & Web3]] → [[Frontend Architecture]] §9 |