docs: add backend security refactor assessment

2026-05-24 08:43:01 +04:00
parent fbc13c5128
commit 10a6c2fa53
2 changed files with 561 additions and 0 deletions
--- a/Audits/Backend
+++ b/Audits/Backend
@@ -0,0 +1,560 @@
+---
+title: Backend Stack Security and Refactor Assessment
+tags: [audit, security, backend, architecture, payments, refactor]
+created: 2026-05-24
+status: advisory
+---
+
+# Backend Stack Security and Refactor Assessment
+
+## Purpose
+
+This document records an advisory assessment of whether Amanat should keep the current Node/Express backend, harden it in place, or migrate at least the security-critical backend surface to another technology stack.
+
+The conclusion is intentionally strategic rather than implementation-heavy. It should be used as input for architecture review, security planning, and refactor scoping.
+
+## Executive summary
+
+Amanat is not a normal CRUD marketplace. It is a financial escrow platform with authentication, realtime communication, crypto payment intake, payout/release flows, provider webhooks, and dispute-sensitive fund movement.
+
+The main security risk is not simply "Node is insecure." The larger issue is that the current backend mixes high-risk financial state transitions, webhook handling, realtime room membership, admin operations, test/demo endpoints, and ordinary marketplace APIs in one Express application.
+
+Moving away from Node/Express may reduce npm supply-chain exposure and improve long-term auditability, but it will not automatically fix the most important risks. The immediate priority should be to define and enforce the correct security architecture:
+
+- A canonical funds ledger.
+- A strict escrow/payment/dispute state machine.
+- Centralized authorization and ownership checks.
+- Signed webhook handling with idempotency.
+- Server-derived realtime authorization.
+- Secure session handling.
+- A provider-neutral payment abstraction.
+
+Recommended approach:
+
+1. Harden the existing backend immediately.
+2. Define the target payment, ledger, and auth architecture in documentation.
+3. Extract or rewrite only the security-critical backend core if the team can support the new stack.
+4. Keep lower-risk marketplace, chat, notification, and dashboard APIs in TypeScript until the core is stable.
+
+Default recommendation: do not rewrite the entire backend at once. If a rewrite is chosen, start with payment/auth/escrow core services, preferably in Go or Kotlin/Java, while preserving current product behavior behind stable API contracts.
+
+## Current system profile
+
+Observed architecture:
+
+- Frontend: Next.js, React, MUI, Web3, Socket.IO client.
+- Backend: Express 5, TypeScript, Mongoose, Socket.IO, SHKeeper, Web3 transaction verification, SMTP, OpenAI integration.
+- Storage: MongoDB and Redis, though Redis is not consistently used as a shared state authority for all security-sensitive flows.
+- Realtime: Socket.IO rooms for user, buyer, seller, chat, and purchase-request updates.
+- Payments: SHKeeper pay-in, SHKeeper payout, decentralized/Web3 payment verification, manual/admin payout paths.
+- Docs: existing logical audit and remediation documents already identify several critical flaws.
+
+The backend currently acts as:
+
+- API server.
+- Realtime server.
+- Payment orchestrator.
+- Webhook processor.
+- Background-job runner.
+- File upload server.
+- Auth/session issuer.
+- Admin operations surface.
+
+That is too much responsibility in one process for a financial platform unless the architecture is very tightly controlled.
+
+## Code-backed security observations
+
+These findings are consistent with the existing audit docs and representative source review.
+
+### Payment and funds risks
+
+- Payment state is largely represented by mutable `Payment.status` and `escrowState` fields rather than an immutable funds ledger.
+- Pay-in, manual confirmation, wallet monitoring, webhook handling, and payout flows can converge on the same records through different paths.
+- Release/refund eligibility is not fully centralized around ledger invariants.
+- The existing docs identify a dispute/escrow race: disputes do not reliably create an enforceable hold before release.
+- `Payment` uses mixed/string-compatible references for some core links, reducing referential integrity and query safety.
+- Some payment mutation/history routes were exposed without sufficient authentication or ownership enforcement.
+- Web3 verification has been documented as relying primarily on transaction receipt success rather than strict token, recipient, and amount verification.
+
+Security implication: a backend stack change alone will not fix this. The platform needs a funds ledger and state machine first.
+
+### Authentication and session risks
+
+- Browser tokens are stored in `localStorage`, increasing impact from XSS.
+- Passkey/WebAuthn behavior is described in the audit docs as stubbed/incomplete and challenge storage is process-local.
+- Refresh-token behavior differs between auth paths.
+- Admin-sensitive routes need explicit role enforcement, not just authentication.
+
+Security implication: migration should include a session architecture decision, not just a framework change.
+
+### Realtime risks
+
+- Socket.IO room joins are client-driven by IDs such as `join-user-room`, `join-buyer-room`, and `join-seller-room`.
+- The server should derive room membership from authenticated socket identity, not trust client-supplied user IDs.
+
+Security implication: realtime authorization needs to be treated like API authorization.
+
+### Rate limiting and abuse controls
+
+- Global rate limiting is explicitly disabled in the Express app.
+- Sensitive paths need tiered limits: auth, verification, file upload, AI, payment, webhook, chat.
+- AI endpoints and email endpoints can create cost or abuse exposure if not authenticated and rate-limited.
+
+Security implication: this is an immediate hardening task regardless of backend stack.
+
+### Webhook and provider risks
+
+- Webhooks must be verified using raw-body signatures, not reconstructed JSON when signatures depend on raw bytes.
+- Webhook delivery must be idempotent.
+- Unknown, duplicate, malformed, and failed webhooks should be visible in structured records or dead-letter storage.
+- Provider callbacks should create reconciliation events, not directly release funds.
+
+Security implication: payment provider integration should be isolated behind a provider-neutral service contract.
+
+### Supply-chain risks
+
+The Node/npm ecosystem has real and recurring supply-chain risk. For this codebase, that risk matters because both frontend and backend depend heavily on npm packages.
+
+Relevant 2026 context:
+
+- Express published February 2026 security releases, including high-severity Multer issues affecting versions before 2.1.0. The backend manifest currently specifies `multer: ^2.0.2`, so the resolved lockfile version should be reviewed and updated if necessary.
+- Node.js published March 2026 security releases across active release lines.
+- Microsoft reported an Axios npm supply-chain compromise in March 2026. This project uses Axios on frontend and backend.
+- TanStack published a May 2026 npm compromise postmortem. This project uses `@tanstack/react-query`.
+
+References:
+
+- Express security release, 2026-02-27: https://expressjs.com/2026/02/27/security-releases.html
+- Node.js March 2026 security releases: https://nodejs.org/en/blog/vulnerability/march-2026-security-releases
+- Microsoft on Axios npm supply-chain compromise: https://www.microsoft.com/en-us/security/blog/2026/04/01/mitigating-the-axios-npm-supply-chain-compromise/
+- TanStack npm supply-chain compromise postmortem: https://tanstack.com/blog/npm-supply-chain-compromise-postmortem
+
+Security implication: npm supply-chain controls are required even if the backend is rewritten, because the frontend remains npm-based.
+
+## Should the backend move away from Node/Express?
+
+### Reasons to keep and harden first
+
+- The product already exists and has working business flows.
+- A full rewrite risks reintroducing escrow/payment bugs.
+- The most dangerous issues are domain/state/authorization issues, not syntax or framework issues.
+- Hardening can reduce immediate exposure faster than a rewrite.
+- The team may currently be more productive in TypeScript.
+
+### Reasons to migrate at least the backend core
+
+- Financial backend code benefits from a smaller, stricter dependency footprint.
+- Payment, ledger, webhook, and payout flows need strong invariants and auditability.
+- Express makes it easy to accumulate route-level exceptions, test endpoints, and inconsistent middleware.
+- Node/npm supply-chain exposure is material and recurring.
+- TypeScript runtime enforcement is limited unless paired with strict schema validation everywhere.
+- A separate payment core can be more easily audited, threat-modeled, tested, and locked down.
+
+### Balanced conclusion
+
+It is security-wise reasonable to move the highest-risk backend core away from Node/Express, but only after the target security model is specified.
+
+Do not begin with a full product rewrite. Begin with a security-critical core extraction:
+
+- Auth/session/token authority.
+- Payment intent creation.
+- Provider webhook processing.
+- Funds ledger and reconciliation.
+- Release/refund/dispute-hold enforcement.
+- Admin payout approval and audit logging.
+
+Keep lower-risk modules in the current stack until the core is stable:
+
+- Marketplace browsing/listing.
+- Request templates.
+- Chat and notifications, after socket authorization fixes.
+- Admin dashboard reads.
+- File upload, after hardening or moving to object storage.
+
+## Stack options
+
+### Go
+
+Best fit if the team wants a smaller, operationally simple, security-focused payment core.
+
+Strengths:
+
+- Small binaries and deployment footprint.
+- Lower dependency surface than typical Node services.
+- Strong standard library for HTTP, crypto, JSON, and concurrency.
+- Good fit for webhook receivers, ledger services, workers, and reconciliation jobs.
+- Easy to run static analysis and produce reproducible builds.
+
+Weaknesses:
+
+- Less ergonomic than TypeScript for rapid product iteration.
+- Requires team comfort with Go idioms.
+- API/schema generation must be designed deliberately.
+
+Assessment: recommended first choice for a payment/ledger/auth core if the team can maintain Go.
+
+### Kotlin/Java with Spring Boot
+
+Best fit if the team wants enterprise-grade structure, mature auth patterns, and strong ecosystem support.
+
+Strengths:
+
+- Mature security and validation ecosystem.
+- Strong typing and tooling.
+- Good for complex domain services and audit-heavy systems.
+- Well-understood operational patterns.
+
+Weaknesses:
+
+- Heavier runtime and framework footprint.
+- More ceremony.
+- Slower iteration for a small team.
+
+Assessment: strong choice for a larger engineering team or enterprise-style compliance needs.
+
+### Rust
+
+Best fit if maximum memory safety and correctness are worth slower delivery.
+
+Strengths:
+
+- Strong compile-time safety.
+- Good for cryptographic and high-assurance components.
+- Very low runtime footprint.
+
+Weaknesses:
+
+- Higher implementation cost.
+- Smaller hiring pool.
+- Web API development may be slower.
+
+Assessment: attractive for narrow cryptographic or transaction-verification components, but probably too costly for the whole backend unless the team is already strong in Rust.
+
+### Python/FastAPI
+
+Best fit if rapid backend development and clean API typing are more important than strict compile-time guarantees.
+
+Strengths:
+
+- Fast development.
+- Good validation with Pydantic.
+- Good for admin tools and internal services.
+
+Weaknesses:
+
+- Supply-chain risk remains.
+- Runtime typing and async behavior require discipline.
+- Less compelling than Go/Kotlin for a financial core.
+
+Assessment: acceptable for internal services, not the preferred payment-core target.
+
+### Continue TypeScript/Node with stronger architecture
+
+Best fit if team capacity cannot support another backend language yet.
+
+Required conditions:
+
+- Strict route registration policy.
+- Runtime validation on every boundary.
+- No test/demo routes in production builds.
+- Full lockfile and package provenance controls.
+- Centralized auth, ownership, and role guards.
+- Ledger-first payment architecture.
+- Secure cookies or a documented token-storage risk acceptance.
+- Socket auth middleware.
+- Redis-backed challenge/idempotency/rate-limit storage.
+
+Assessment: viable short term, but the security bar must be raised significantly.
+
+## Recommended target architecture
+
+### Phase 0: Immediate containment
+
+Goal: reduce current high-risk exposure without broad redesign.
+
+Actions:
+
+- Disable or protect test/demo payment and email endpoints in production.
+- Require authentication and ownership checks on all payment, notification, AI, and file routes.
+- Re-enable rate limiting with stricter limits on auth, payment, AI, file upload, and webhook paths.
+- Add admin role checks to admin routes.
+- Stop accepting arbitrary `userId` from clients for private data.
+- Validate all payment mutations through centralized service methods.
+- Lock Socket.IO room membership to server-verified identity.
+- Review and update lockfiles for known vulnerable packages.
+- Rotate any committed or publicly visible secrets.
+
+### Phase 1: Architecture specification
+
+Goal: define the new security model before implementation.
+
+Documents to produce are listed in the "Required documentation" section below.
+
+### Phase 2: Payment and ledger extraction
+
+Goal: move funds logic behind a provider-neutral service.
+
+Introduce:
+
+- `FundsAccount`
+- `LedgerEntry`
+- `FundsBalance`
+- `PaymentIntent`
+- `PaymentProviderEvent`
+- `ReleaseInstruction`
+- `RefundInstruction`
+- `DisputeHold`
+
+Key rule: provider webhooks do not directly release funds. They create verified events and ledger entries.
+
+### Phase 3: Backend-core rewrite or service split
+
+Goal: decide whether the extracted core remains TypeScript or moves to Go/Kotlin.
+
+Recommended split:
+
+- `core-payments`: payment intent, webhook, ledger, release/refund, reconciliation.
+- `core-auth`: sessions, passkeys, OAuth, token issuance, session revocation.
+- `marketplace-api`: purchase requests, offers, categories, templates.
+- `realtime-api`: chat, notifications, socket rooms.
+
+The split can be logical first, physical later.
+
+### Phase 4: Full migration only if justified
+
+Goal: avoid rewriting stable lower-risk product surfaces prematurely.
+
+Only consider full backend migration after:
+
+- Payment core is stable.
+- Auth/session model is stable.
+- API contracts are documented and tested.
+- Legacy payment records are migrated or safely read-only.
+- Team has demonstrated production maintenance ability in the new stack.
+
+## Required documentation before refactor
+
+### 1. Threat Model
+
+Purpose: identify what must be protected and how it can be attacked.
+
+Should include:
+
+- Assets: user accounts, admin accounts, wallet addresses, payment records, funds, webhook secrets, API keys, private notifications.
+- Actors: buyer, seller, admin, support, unauthenticated attacker, compromised user, compromised admin, provider, malicious webhook sender.
+- Trust boundaries: browser, backend, database, Redis, provider APIs, wallet/RPC, admin UI, Socket.IO.
+- Abuse cases: fake payment proof, replayed webhook, arbitrary room join, stolen token, double payout, dispute bypass, email/AI abuse.
+
+### 2. Funds Ledger Specification
+
+Purpose: make money movement auditable and provider-independent.
+
+Should define:
+
+- Account model per purchase request/order.
+- Immutable ledger entry types.
+- Derived balance model.
+- Gross amount, provider fees, platform fees, held amount, disputed amount, releasable amount, released amount, refunded amount.
+- Idempotency keys.
+- Reconciliation behavior.
+
+### 3. Escrow State Machine
+
+Purpose: define legal transitions once.
+
+Should include:
+
+- Purchase request states.
+- Payment states.
+- Escrow/funds states.
+- Dispute states.
+- Valid transitions and forbidden transitions.
+- Who or what can trigger each transition.
+- Required preconditions for release, refund, cancellation, dispute hold, and admin override.
+
+### 4. Authorization Matrix
+
+Purpose: remove route-by-route ambiguity.
+
+Should map every endpoint and socket event to:
+
+- Public, authenticated, owner, seller, buyer, admin, support, or service role.
+- Required ownership checks.
+- Required object state.
+- Rate-limit tier.
+- Audit-log requirement.
+
+### 5. Payment Provider Adapter Spec
+
+Purpose: decouple business logic from SHKeeper, Request Network, manual wallet flow, and future providers.
+
+Should define:
+
+- `createPayInIntent`
+- `getPayInStatus`
+- `handleProviderWebhook`
+- `createHostedPaymentLink`
+- `createReleaseInstruction`
+- `createRefundInstruction`
+- `getPayoutStatus`
+- `searchProviderPayments`
+
+Provider-specific metadata should be namespaced and never become the canonical funds state.
+
+### 6. Webhook Security Spec
+
+Purpose: prevent forged, replayed, or silently failed provider events.
+
+Should define:
+
+- Raw-body signature verification.
+- Accepted headers and algorithms.
+- Replay prevention.
+- Delivery ID/idempotency handling.
+- Unknown payment behavior.
+- Duplicate event behavior.
+- Retry semantics.
+- Dead-letter/replay storage.
+- Alerting thresholds.
+
+### 7. Session and Auth Architecture
+
+Purpose: decide how browser sessions should work for a financial platform.
+
+Should define:
+
+- Access token lifetime.
+- Refresh token lifetime and rotation.
+- Whether tokens move from `localStorage` to `httpOnly` cookies.
+- CSRF strategy if cookies are used.
+- Passkey/WebAuthn implementation requirements.
+- OAuth requirements.
+- Device/session revocation.
+- Admin step-up authentication for payouts or role changes.
+
+### 8. Realtime Authorization Spec
+
+Purpose: make Socket.IO events subject to the same security model as REST.
+
+Should define:
+
+- Socket handshake authentication.
+- Server-derived room membership.
+- Which rooms exist.
+- Who may join each room.
+- Whether room membership changes with request/payment/dispute state.
+- Event payload privacy rules.
+
+### 9. Migration Plan
+
+Purpose: avoid breaking current payments and historical records.
+
+Should include:
+
+- SHKeeper legacy read path.
+- New provider feature flag.
+- Ledger backfill strategy.
+- Data validation report before enforcement.
+- Rollback criteria.
+- Cutover date for old webhook routes.
+- Operator manual reconciliation workflow.
+
+### 10. Secure Build and Supply-Chain Policy
+
+Purpose: reduce npm and dependency compromise risk.
+
+Should define:
+
+- Package manager and lockfile policy.
+- CI install mode.
+- Dependency update cadence.
+- Security advisory monitoring.
+- npm provenance/signature policy where available.
+- Secrets handling.
+- Production build reproducibility.
+- Separation of frontend npm risk from backend core risk.
+
+### 11. Operational Runbooks
+
+Purpose: make security incidents and payment failures survivable.
+
+Should include:
+
+- Failed webhook.
+- Duplicate payment.
+- Missing payment.
+- Stuck release.
+- Disputed release attempt.
+- Compromised admin.
+- Leaked API key.
+- Provider outage.
+- Chain/RPC outage.
+- Suspicious payment proof.
+- npm/package compromise.
+
+## Decision framework
+
+Use the following questions before choosing a rewrite:
+
+- Is the current goal safe launch, or long-term platform rebuild?
+- Is the team willing to delay feature work for a payment-core redesign?
+- Can the team maintain Go/Kotlin/Rust in production?
+- Is the biggest current risk supply chain, or incorrect money movement?
+- Are admin actions trusted, or should high-risk actions require step-up approval?
+- Should Amanat custody funds, or should the provider/payment network hold or route them?
+- Are disputes central to the product, or rare manual exceptions?
+- Is auditability a regulatory/business requirement or only an internal safety goal?
+
+## Recommended decision
+
+Near term:
+
+- Harden the current Express backend.
+- Disable unsafe production routes.
+- Add centralized authorization and rate limiting.
+- Fix Web3 verification.
+- Fix Socket.IO authorization.
+- Disable passkeys unless implemented with real WebAuthn.
+- Begin ledger/state-machine documentation immediately.
+
+Medium term:
+
+- Build a provider-neutral payment and funds layer.
+- Add immutable ledger entries.
+- Move release/refund/dispute-hold checks into the central payment/funds service.
+- Keep SHKeeper compatibility read-only for legacy records.
+- Add Request Network or another provider behind the adapter if desired.
+
+Long term:
+
+- Rewrite the payment/auth/escrow core in Go or Kotlin/Java if the team can support it.
+- Do not rewrite the entire backend until the core is proven.
+- Keep lower-risk modules in TypeScript until there is a business or operational reason to migrate them.
+
+## Open questions for leadership and engineering
+
+1. Is launch timeline more important than a full payment/funds redesign?
+2. Should passkeys be removed from launch scope until production-grade WebAuthn is implemented?
+3. Should browser auth move to `httpOnly` cookies even if that requires CSRF work and frontend changes?
+4. Should every payout require admin step-up authentication or two-person approval?
+5. Should Amanat keep funds in a platform-controlled escrow wallet, or should provider-mediated payment pages become the default?
+6. Is Request Network a desired provider migration, or just one option being explored?
+7. What new backend stack can the team realistically operate for the next two years?
+8. What is the acceptable level of temporary dual-stack complexity during migration?
+9. Do we need formal external penetration testing before public launch?
+10. Who owns security decisions: product, backend, DevOps, or a dedicated security owner?
+
+## Relationship to existing docs
+
+This assessment complements:
+
+- [[Platform Logical Audit - 2026-05-24]]
+- [[PRD - Platform Audit Remediation Plan (2026-05-24)]]
+- [[PRD - Request Network Migration and Funds Management]]
+- [[Security Architecture]]
+- [[Payment Flow - SHKeeper]]
+- [[Payment Flow - DePay & Web3]]
+- [[Escrow Flow]]
+- [[Dispute Flow]]
+
+The existing remediation PRD is the tactical hardening plan. This document is the strategic backend-stack and refactor assessment.