docs: add backend security refactor assessment

This commit is contained in:
Siavash Sameni
2026-05-24 08:43:01 +04:00
parent fbc13c5128
commit 10a6c2fa53
2 changed files with 561 additions and 0 deletions

View File

@@ -0,0 +1,560 @@
---
title: Backend Stack Security and Refactor Assessment
tags: [audit, security, backend, architecture, payments, refactor]
created: 2026-05-24
status: advisory
---
# Backend Stack Security and Refactor Assessment
## Purpose
This document records an advisory assessment of whether Amanat should keep the current Node/Express backend, harden it in place, or migrate at least the security-critical backend surface to another technology stack.
The conclusion is intentionally strategic rather than implementation-heavy. It should be used as input for architecture review, security planning, and refactor scoping.
## Executive summary
Amanat is not a normal CRUD marketplace. It is a financial escrow platform with authentication, realtime communication, crypto payment intake, payout/release flows, provider webhooks, and dispute-sensitive fund movement.
The main security risk is not simply "Node is insecure." The larger issue is that the current backend mixes high-risk financial state transitions, webhook handling, realtime room membership, admin operations, test/demo endpoints, and ordinary marketplace APIs in one Express application.
Moving away from Node/Express may reduce npm supply-chain exposure and improve long-term auditability, but it will not automatically fix the most important risks. The immediate priority should be to define and enforce the correct security architecture:
- A canonical funds ledger.
- A strict escrow/payment/dispute state machine.
- Centralized authorization and ownership checks.
- Signed webhook handling with idempotency.
- Server-derived realtime authorization.
- Secure session handling.
- A provider-neutral payment abstraction.
Recommended approach:
1. Harden the existing backend immediately.
2. Define the target payment, ledger, and auth architecture in documentation.
3. Extract or rewrite only the security-critical backend core if the team can support the new stack.
4. Keep lower-risk marketplace, chat, notification, and dashboard APIs in TypeScript until the core is stable.
Default recommendation: do not rewrite the entire backend at once. If a rewrite is chosen, start with payment/auth/escrow core services, preferably in Go or Kotlin/Java, while preserving current product behavior behind stable API contracts.
## Current system profile
Observed architecture:
- Frontend: Next.js, React, MUI, Web3, Socket.IO client.
- Backend: Express 5, TypeScript, Mongoose, Socket.IO, SHKeeper, Web3 transaction verification, SMTP, OpenAI integration.
- Storage: MongoDB and Redis, though Redis is not consistently used as a shared state authority for all security-sensitive flows.
- Realtime: Socket.IO rooms for user, buyer, seller, chat, and purchase-request updates.
- Payments: SHKeeper pay-in, SHKeeper payout, decentralized/Web3 payment verification, manual/admin payout paths.
- Docs: existing logical audit and remediation documents already identify several critical flaws.
The backend currently acts as:
- API server.
- Realtime server.
- Payment orchestrator.
- Webhook processor.
- Background-job runner.
- File upload server.
- Auth/session issuer.
- Admin operations surface.
That is too much responsibility in one process for a financial platform unless the architecture is very tightly controlled.
## Code-backed security observations
These findings are consistent with the existing audit docs and representative source review.
### Payment and funds risks
- Payment state is largely represented by mutable `Payment.status` and `escrowState` fields rather than an immutable funds ledger.
- Pay-in, manual confirmation, wallet monitoring, webhook handling, and payout flows can converge on the same records through different paths.
- Release/refund eligibility is not fully centralized around ledger invariants.
- The existing docs identify a dispute/escrow race: disputes do not reliably create an enforceable hold before release.
- `Payment` uses mixed/string-compatible references for some core links, reducing referential integrity and query safety.
- Some payment mutation/history routes were exposed without sufficient authentication or ownership enforcement.
- Web3 verification has been documented as relying primarily on transaction receipt success rather than strict token, recipient, and amount verification.
Security implication: a backend stack change alone will not fix this. The platform needs a funds ledger and state machine first.
### Authentication and session risks
- Browser tokens are stored in `localStorage`, increasing impact from XSS.
- Passkey/WebAuthn behavior is described in the audit docs as stubbed/incomplete and challenge storage is process-local.
- Refresh-token behavior differs between auth paths.
- Admin-sensitive routes need explicit role enforcement, not just authentication.
Security implication: migration should include a session architecture decision, not just a framework change.
### Realtime risks
- Socket.IO room joins are client-driven by IDs such as `join-user-room`, `join-buyer-room`, and `join-seller-room`.
- The server should derive room membership from authenticated socket identity, not trust client-supplied user IDs.
Security implication: realtime authorization needs to be treated like API authorization.
### Rate limiting and abuse controls
- Global rate limiting is explicitly disabled in the Express app.
- Sensitive paths need tiered limits: auth, verification, file upload, AI, payment, webhook, chat.
- AI endpoints and email endpoints can create cost or abuse exposure if not authenticated and rate-limited.
Security implication: this is an immediate hardening task regardless of backend stack.
### Webhook and provider risks
- Webhooks must be verified using raw-body signatures, not reconstructed JSON when signatures depend on raw bytes.
- Webhook delivery must be idempotent.
- Unknown, duplicate, malformed, and failed webhooks should be visible in structured records or dead-letter storage.
- Provider callbacks should create reconciliation events, not directly release funds.
Security implication: payment provider integration should be isolated behind a provider-neutral service contract.
### Supply-chain risks
The Node/npm ecosystem has real and recurring supply-chain risk. For this codebase, that risk matters because both frontend and backend depend heavily on npm packages.
Relevant 2026 context:
- Express published February 2026 security releases, including high-severity Multer issues affecting versions before 2.1.0. The backend manifest currently specifies `multer: ^2.0.2`, so the resolved lockfile version should be reviewed and updated if necessary.
- Node.js published March 2026 security releases across active release lines.
- Microsoft reported an Axios npm supply-chain compromise in March 2026. This project uses Axios on frontend and backend.
- TanStack published a May 2026 npm compromise postmortem. This project uses `@tanstack/react-query`.
References:
- Express security release, 2026-02-27: https://expressjs.com/2026/02/27/security-releases.html
- Node.js March 2026 security releases: https://nodejs.org/en/blog/vulnerability/march-2026-security-releases
- Microsoft on Axios npm supply-chain compromise: https://www.microsoft.com/en-us/security/blog/2026/04/01/mitigating-the-axios-npm-supply-chain-compromise/
- TanStack npm supply-chain compromise postmortem: https://tanstack.com/blog/npm-supply-chain-compromise-postmortem
Security implication: npm supply-chain controls are required even if the backend is rewritten, because the frontend remains npm-based.
## Should the backend move away from Node/Express?
### Reasons to keep and harden first
- The product already exists and has working business flows.
- A full rewrite risks reintroducing escrow/payment bugs.
- The most dangerous issues are domain/state/authorization issues, not syntax or framework issues.
- Hardening can reduce immediate exposure faster than a rewrite.
- The team may currently be more productive in TypeScript.
### Reasons to migrate at least the backend core
- Financial backend code benefits from a smaller, stricter dependency footprint.
- Payment, ledger, webhook, and payout flows need strong invariants and auditability.
- Express makes it easy to accumulate route-level exceptions, test endpoints, and inconsistent middleware.
- Node/npm supply-chain exposure is material and recurring.
- TypeScript runtime enforcement is limited unless paired with strict schema validation everywhere.
- A separate payment core can be more easily audited, threat-modeled, tested, and locked down.
### Balanced conclusion
It is security-wise reasonable to move the highest-risk backend core away from Node/Express, but only after the target security model is specified.
Do not begin with a full product rewrite. Begin with a security-critical core extraction:
- Auth/session/token authority.
- Payment intent creation.
- Provider webhook processing.
- Funds ledger and reconciliation.
- Release/refund/dispute-hold enforcement.
- Admin payout approval and audit logging.
Keep lower-risk modules in the current stack until the core is stable:
- Marketplace browsing/listing.
- Request templates.
- Chat and notifications, after socket authorization fixes.
- Admin dashboard reads.
- File upload, after hardening or moving to object storage.
## Stack options
### Go
Best fit if the team wants a smaller, operationally simple, security-focused payment core.
Strengths:
- Small binaries and deployment footprint.
- Lower dependency surface than typical Node services.
- Strong standard library for HTTP, crypto, JSON, and concurrency.
- Good fit for webhook receivers, ledger services, workers, and reconciliation jobs.
- Easy to run static analysis and produce reproducible builds.
Weaknesses:
- Less ergonomic than TypeScript for rapid product iteration.
- Requires team comfort with Go idioms.
- API/schema generation must be designed deliberately.
Assessment: recommended first choice for a payment/ledger/auth core if the team can maintain Go.
### Kotlin/Java with Spring Boot
Best fit if the team wants enterprise-grade structure, mature auth patterns, and strong ecosystem support.
Strengths:
- Mature security and validation ecosystem.
- Strong typing and tooling.
- Good for complex domain services and audit-heavy systems.
- Well-understood operational patterns.
Weaknesses:
- Heavier runtime and framework footprint.
- More ceremony.
- Slower iteration for a small team.
Assessment: strong choice for a larger engineering team or enterprise-style compliance needs.
### Rust
Best fit if maximum memory safety and correctness are worth slower delivery.
Strengths:
- Strong compile-time safety.
- Good for cryptographic and high-assurance components.
- Very low runtime footprint.
Weaknesses:
- Higher implementation cost.
- Smaller hiring pool.
- Web API development may be slower.
Assessment: attractive for narrow cryptographic or transaction-verification components, but probably too costly for the whole backend unless the team is already strong in Rust.
### Python/FastAPI
Best fit if rapid backend development and clean API typing are more important than strict compile-time guarantees.
Strengths:
- Fast development.
- Good validation with Pydantic.
- Good for admin tools and internal services.
Weaknesses:
- Supply-chain risk remains.
- Runtime typing and async behavior require discipline.
- Less compelling than Go/Kotlin for a financial core.
Assessment: acceptable for internal services, not the preferred payment-core target.
### Continue TypeScript/Node with stronger architecture
Best fit if team capacity cannot support another backend language yet.
Required conditions:
- Strict route registration policy.
- Runtime validation on every boundary.
- No test/demo routes in production builds.
- Full lockfile and package provenance controls.
- Centralized auth, ownership, and role guards.
- Ledger-first payment architecture.
- Secure cookies or a documented token-storage risk acceptance.
- Socket auth middleware.
- Redis-backed challenge/idempotency/rate-limit storage.
Assessment: viable short term, but the security bar must be raised significantly.
## Recommended target architecture
### Phase 0: Immediate containment
Goal: reduce current high-risk exposure without broad redesign.
Actions:
- Disable or protect test/demo payment and email endpoints in production.
- Require authentication and ownership checks on all payment, notification, AI, and file routes.
- Re-enable rate limiting with stricter limits on auth, payment, AI, file upload, and webhook paths.
- Add admin role checks to admin routes.
- Stop accepting arbitrary `userId` from clients for private data.
- Validate all payment mutations through centralized service methods.
- Lock Socket.IO room membership to server-verified identity.
- Review and update lockfiles for known vulnerable packages.
- Rotate any committed or publicly visible secrets.
### Phase 1: Architecture specification
Goal: define the new security model before implementation.
Documents to produce are listed in the "Required documentation" section below.
### Phase 2: Payment and ledger extraction
Goal: move funds logic behind a provider-neutral service.
Introduce:
- `FundsAccount`
- `LedgerEntry`
- `FundsBalance`
- `PaymentIntent`
- `PaymentProviderEvent`
- `ReleaseInstruction`
- `RefundInstruction`
- `DisputeHold`
Key rule: provider webhooks do not directly release funds. They create verified events and ledger entries.
### Phase 3: Backend-core rewrite or service split
Goal: decide whether the extracted core remains TypeScript or moves to Go/Kotlin.
Recommended split:
- `core-payments`: payment intent, webhook, ledger, release/refund, reconciliation.
- `core-auth`: sessions, passkeys, OAuth, token issuance, session revocation.
- `marketplace-api`: purchase requests, offers, categories, templates.
- `realtime-api`: chat, notifications, socket rooms.
The split can be logical first, physical later.
### Phase 4: Full migration only if justified
Goal: avoid rewriting stable lower-risk product surfaces prematurely.
Only consider full backend migration after:
- Payment core is stable.
- Auth/session model is stable.
- API contracts are documented and tested.
- Legacy payment records are migrated or safely read-only.
- Team has demonstrated production maintenance ability in the new stack.
## Required documentation before refactor
### 1. Threat Model
Purpose: identify what must be protected and how it can be attacked.
Should include:
- Assets: user accounts, admin accounts, wallet addresses, payment records, funds, webhook secrets, API keys, private notifications.
- Actors: buyer, seller, admin, support, unauthenticated attacker, compromised user, compromised admin, provider, malicious webhook sender.
- Trust boundaries: browser, backend, database, Redis, provider APIs, wallet/RPC, admin UI, Socket.IO.
- Abuse cases: fake payment proof, replayed webhook, arbitrary room join, stolen token, double payout, dispute bypass, email/AI abuse.
### 2. Funds Ledger Specification
Purpose: make money movement auditable and provider-independent.
Should define:
- Account model per purchase request/order.
- Immutable ledger entry types.
- Derived balance model.
- Gross amount, provider fees, platform fees, held amount, disputed amount, releasable amount, released amount, refunded amount.
- Idempotency keys.
- Reconciliation behavior.
### 3. Escrow State Machine
Purpose: define legal transitions once.
Should include:
- Purchase request states.
- Payment states.
- Escrow/funds states.
- Dispute states.
- Valid transitions and forbidden transitions.
- Who or what can trigger each transition.
- Required preconditions for release, refund, cancellation, dispute hold, and admin override.
### 4. Authorization Matrix
Purpose: remove route-by-route ambiguity.
Should map every endpoint and socket event to:
- Public, authenticated, owner, seller, buyer, admin, support, or service role.
- Required ownership checks.
- Required object state.
- Rate-limit tier.
- Audit-log requirement.
### 5. Payment Provider Adapter Spec
Purpose: decouple business logic from SHKeeper, Request Network, manual wallet flow, and future providers.
Should define:
- `createPayInIntent`
- `getPayInStatus`
- `handleProviderWebhook`
- `createHostedPaymentLink`
- `createReleaseInstruction`
- `createRefundInstruction`
- `getPayoutStatus`
- `searchProviderPayments`
Provider-specific metadata should be namespaced and never become the canonical funds state.
### 6. Webhook Security Spec
Purpose: prevent forged, replayed, or silently failed provider events.
Should define:
- Raw-body signature verification.
- Accepted headers and algorithms.
- Replay prevention.
- Delivery ID/idempotency handling.
- Unknown payment behavior.
- Duplicate event behavior.
- Retry semantics.
- Dead-letter/replay storage.
- Alerting thresholds.
### 7. Session and Auth Architecture
Purpose: decide how browser sessions should work for a financial platform.
Should define:
- Access token lifetime.
- Refresh token lifetime and rotation.
- Whether tokens move from `localStorage` to `httpOnly` cookies.
- CSRF strategy if cookies are used.
- Passkey/WebAuthn implementation requirements.
- OAuth requirements.
- Device/session revocation.
- Admin step-up authentication for payouts or role changes.
### 8. Realtime Authorization Spec
Purpose: make Socket.IO events subject to the same security model as REST.
Should define:
- Socket handshake authentication.
- Server-derived room membership.
- Which rooms exist.
- Who may join each room.
- Whether room membership changes with request/payment/dispute state.
- Event payload privacy rules.
### 9. Migration Plan
Purpose: avoid breaking current payments and historical records.
Should include:
- SHKeeper legacy read path.
- New provider feature flag.
- Ledger backfill strategy.
- Data validation report before enforcement.
- Rollback criteria.
- Cutover date for old webhook routes.
- Operator manual reconciliation workflow.
### 10. Secure Build and Supply-Chain Policy
Purpose: reduce npm and dependency compromise risk.
Should define:
- Package manager and lockfile policy.
- CI install mode.
- Dependency update cadence.
- Security advisory monitoring.
- npm provenance/signature policy where available.
- Secrets handling.
- Production build reproducibility.
- Separation of frontend npm risk from backend core risk.
### 11. Operational Runbooks
Purpose: make security incidents and payment failures survivable.
Should include:
- Failed webhook.
- Duplicate payment.
- Missing payment.
- Stuck release.
- Disputed release attempt.
- Compromised admin.
- Leaked API key.
- Provider outage.
- Chain/RPC outage.
- Suspicious payment proof.
- npm/package compromise.
## Decision framework
Use the following questions before choosing a rewrite:
- Is the current goal safe launch, or long-term platform rebuild?
- Is the team willing to delay feature work for a payment-core redesign?
- Can the team maintain Go/Kotlin/Rust in production?
- Is the biggest current risk supply chain, or incorrect money movement?
- Are admin actions trusted, or should high-risk actions require step-up approval?
- Should Amanat custody funds, or should the provider/payment network hold or route them?
- Are disputes central to the product, or rare manual exceptions?
- Is auditability a regulatory/business requirement or only an internal safety goal?
## Recommended decision
Near term:
- Harden the current Express backend.
- Disable unsafe production routes.
- Add centralized authorization and rate limiting.
- Fix Web3 verification.
- Fix Socket.IO authorization.
- Disable passkeys unless implemented with real WebAuthn.
- Begin ledger/state-machine documentation immediately.
Medium term:
- Build a provider-neutral payment and funds layer.
- Add immutable ledger entries.
- Move release/refund/dispute-hold checks into the central payment/funds service.
- Keep SHKeeper compatibility read-only for legacy records.
- Add Request Network or another provider behind the adapter if desired.
Long term:
- Rewrite the payment/auth/escrow core in Go or Kotlin/Java if the team can support it.
- Do not rewrite the entire backend until the core is proven.
- Keep lower-risk modules in TypeScript until there is a business or operational reason to migrate them.
## Open questions for leadership and engineering
1. Is launch timeline more important than a full payment/funds redesign?
2. Should passkeys be removed from launch scope until production-grade WebAuthn is implemented?
3. Should browser auth move to `httpOnly` cookies even if that requires CSRF work and frontend changes?
4. Should every payout require admin step-up authentication or two-person approval?
5. Should Amanat keep funds in a platform-controlled escrow wallet, or should provider-mediated payment pages become the default?
6. Is Request Network a desired provider migration, or just one option being explored?
7. What new backend stack can the team realistically operate for the next two years?
8. What is the acceptable level of temporary dual-stack complexity during migration?
9. Do we need formal external penetration testing before public launch?
10. Who owns security decisions: product, backend, DevOps, or a dedicated security owner?
## Relationship to existing docs
This assessment complements:
- [[Platform Logical Audit - 2026-05-24]]
- [[PRD - Platform Audit Remediation Plan (2026-05-24)]]
- [[PRD - Request Network Migration and Funds Management]]
- [[Security Architecture]]
- [[Payment Flow - SHKeeper]]
- [[Payment Flow - DePay & Web3]]
- [[Escrow Flow]]
- [[Dispute Flow]]
The existing remediation PRD is the tactical hardening plan. This document is the strategic backend-stack and refactor assessment.