audit: 2026-05-30 full-codebase audit — report, issues, docs, runbooks

Full-codebase-audit 2026-05-30 outputs: - Audit report: 09 - Audits/Full Codebase Audit - 2026-05-30.md - 81 issue files ISSUE-055..135 (decisions + 1 skipped no-brainer). - Scanner docs from scratch (was zero): architecture, data model, API ref, payment flow, operations runbook + repo README. - Doc-sync updates across API reference, data models, flows, design system. - Secret Rotation Runbook (08 - Operations) for the exposed credentials. - Reusable workflow guide (07 - Development) + .claude/workflows/full-codebase-audit.js. Issues remain status:open intentionally — the code fixes are uncommitted-then-committed working-tree changes per repo and aren't "resolved" until merged/deployed. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 18:41:44 +04:00
parent eab1d77582
commit dceaf82934
153 changed files with 6276 additions and 179 deletions
--- a/Operations/Monitoring.md
+++ b/Operations/Monitoring.md
@@ -11,24 +11,21 @@ What's instrumented today and what to watch. Today's stack is intentionally lean

 ## 1. Health endpoint

-Path: `GET /health` (backend, port `5001`).
+Two paths are registered (both are public, rate-limited, not auth-gated):

-Defined in `backend/src/app.ts`:
+- `GET /health` — simple ping used by Docker healthchecks. Returns `200 { success, message, timestamp, environment, version }`. Does **not** probe MongoDB or Redis.
+- `GET /api/health` — deep health check added in commit `44579d6` (backend v2.6.49). Calls `runHealthChecks` from `backend/src/services/health/healthCheckService.ts`. Probes MongoDB and Redis, collects memory/uptime stats, and returns a structured report. Returns `503` when `report.status === 'down'`.

-```ts
-app.get("/health", (req, res) => {
-  res.json({
-    success: true,
-    message: "Marketplace Backend API is running",
-    timestamp: new Date().toISOString(),
-    environment: config.nodeEnv,
-    version: packageJson.version,
-  });
-});
+`GET /api/health` response shape (from `healthCheckService`):
+```json
+{
+  "status": "ok",
+  "version": "2.6.xx",
+  "timestamp": "...",
+  "checks": { "mongodb": "ok", "redis": "ok", "uptime": 3600, "memoryMB": 120 }
+}
 ```

-Returns `200` with a JSON envelope as soon as Express is up. Does **not** currently probe MongoDB or Redis — they are checked via separate Docker healthchecks. If you want deep health, extend the endpoint to ping both data stores and return `503` on failure.
-
 Public URL behind Nginx: `https://amn.gg/api/health`.

 ---
--- a/Operations/Scanner
+++ b/Operations/Scanner
@@ -0,0 +1,220 @@
+---
+title: Scanner Operations
+tags: [operations, scanner, deployment]
+created: 2026-05-30
+---
+
+# Scanner Operations
+
+Runbook for deploying, configuring, monitoring, and troubleshooting the AMN Pay Scanner microservice.
+
+---
+
+## 1. Configuration reference
+
+All configuration via environment variables. See `.env.example` in the scanner repo.
+
+| Variable | Default | Required | Description |
+|---|---|---|---|
+| `PORT` | `8080` | no | HTTP listen port |
+| `DB_PATH` | `./scanner.db` | no | SQLite database path |
+| `CHAINS_JSON_PATH` | `./supported-chains.json` | no | Supported chains config |
+| `TOKENS_JSON_PATH` | `./tokens.json` | no | Token registry |
+| `SCANNER_API_KEY` | _(none)_ | **yes (prod)** | Bearer token for all non-health endpoints. Generate with `openssl rand -hex 32` |
+| `POLL_INTERVAL_SEC` | `15` | no | Chain poll interval in seconds |
+| `INTENT_TTL_HOURS` | `24` | no | Pending/confirming intents older than this are expired (0 = disabled) |
+| `WEBHOOK_RETRY_HOURS` | `6` | no | Interval between automatic webhook_failed re-delivery passes (0 = disabled) |
+| `TRONGRID_API_KEY` | _(none)_ | recommended | TronGrid API key; without it rate limits are very low |
+| `TONCENTER_API_KEY` | _(none)_ | recommended | TonCenter API key |
+| `RPC_BSC` | _(chain config)_ | no | Override BSC RPC URL (chain 56) |
+| `RPC_ARB` | _(chain config)_ | no | Override Arbitrum RPC URL (chain 42161) |
+| `RPC_ETH` | _(chain config)_ | no | Override Ethereum RPC URL (chain 1) |
+| `RPC_POLYGON` | _(chain config)_ | no | Override Polygon RPC URL (chain 137) |
+| `RPC_BASE` | _(chain config)_ | no | Override Base RPC URL (chain 8453) |
+
+> [!warning]
+> If `SCANNER_API_KEY` is not set, the scanner logs a warning and accepts all requests. Never run this way in production.
+
+---
+
+## 2. Docker deployment
+
+The scanner ships as a single Docker image. The Dockerfile uses a two-stage build (Go 1.25 builder → Alpine 3.21 runtime).
+
+### Quick start (dev)
+
+```bash
+cd scanner/
+cp .env.example .env
+# edit .env — set SCANNER_API_KEY, RPC overrides, etc.
+
+docker build -t amn-scanner:dev .
+docker run -d \
+  --name amn-scanner \
+  -p 8080:8080 \
+  -v $(pwd)/data:/data \
+  --env-file .env \
+  amn-scanner:dev
+```
+
+### Production (via arcane-cli / Watchtower)
+
+The scanner is deployed manually via `arcane-cli` (not gitops). Watchtower does NOT manage it automatically. After pushing a new image, redeploy with:
+
+```bash
+arcane-cli project redeploy --json <project-id>
+```
+
+The SQLite database is stored on a named Docker volume (`/data`). Do not recreate the volume between deploys — it holds the checkpoint and intent state.
+
+---
+
+## 3. Health check
+
+```bash
+curl http://localhost:8080/health
+# {"status":"ok","time":"2026-05-30T12:00:00Z"}
+```
+
+Docker `HEALTHCHECK` is already configured in the Dockerfile (30 s interval, 5 s timeout, 3 retries).
+
+---
+
+## 4. Monitoring
+
+### Scanner status endpoint
+
+```bash
+curl -H "Authorization: Bearer $SCANNER_API_KEY" \
+     http://localhost:8080/scanner/status | jq .
+```
+
+Check:
+- `lag` — should be near 0 for healthy chains (blocks behind for EVM, seconds for TON)
+- `pendingIntents` — number of unresolved intents per chain
+- `lastScannedBlock` — should advance each poll
+
+### Logs
+
+The scanner uses Go's `log/slog` structured logger with level prefixes. Key log patterns:
+
+| Pattern | Meaning |
+|---|---|
+| `[scanner] worker started` | Worker goroutine began for this chain |
+| `[evm] intent confirming` | EVM tx seen, waiting for confirmations |
+| `[evm] intent confirmed` | EVM: N confirmations reached |
+| `[tron] MATCH` / `[ton] MATCH` | Transfer matched, going to confirmed |
+| `[webhook] delivered` | Webhook POST succeeded |
+| `[webhook] non-2xx response` | Backend returned error (will retry) |
+| `[webhook] all retries exhausted` | Intent moved to webhook_failed |
+| `[scanner] reconciling confirmed intents` | Startup crash recovery in progress |
+| `[evm] scanner lag` | Chain lag > 100 blocks (investigate RPC) |
+
+---
+
+## 5. Adding / modifying chains
+
+Edit `supported-chains.json`. Fields:
+
+| Field | Notes |
+|---|---|
+| `chainId` | Numeric EIP-155 chain ID (arbitrary int for Tron/TON) |
+| `chainType` | `"evm"` (default) / `"tron"` / `"ton"` |
+| `rpcUrl` | Primary RPC endpoint |
+| `publicRpcUrl` | Fallback RPC (EVM only) |
+| `proxyAddress` | ERC20FeeProxy address (EVM); USDT contract (Tron); USDT Jetton master (TON) |
+| `confirmationThreshold` | Blocks required (EVM); ignored for Tron/TON |
+| `verified` | `true` to activate the worker; `false` to disable without deleting |
+
+> [!important]
+> Changing `proxyAddress` for an EVM chain only affects new scans. Existing pending intents will still be matched against the old address until they expire or are confirmed.
+
+After editing, restart the scanner container to pick up the new config.
+
+---
+
+## 6. Adding tokens to the registry
+
+Edit `tokens.json`. Each entry:
+
+```json
+{ "chainId": 56, "address": "0x...", "symbol": "USDC", "decimals": 18, "name": "USD Coin" }
+```
+
+Token registry is used only for populating `tokenSymbol` and `decimals` in the `checkoutBlock` response. Omitting a token does not break scanning — it just leaves those fields empty.
+
+---
+
+## 7. Manual webhook retry
+
+Force immediate re-delivery of all `webhook_failed` intents:
+
+```bash
+curl -X POST -H "Authorization: Bearer $SCANNER_API_KEY" \
+     http://localhost:8080/admin/webhooks/retry
+# {"queued": N}
+```
+
+---
+
+## 8. Database inspection
+
+The SQLite database (`/data/scanner.db`) can be inspected with the `sqlite3` CLI inside the container:
+
+```bash
+docker exec -it amn-scanner sqlite3 /data/scanner.db
+
+# Check stuck intents
+SELECT intent_id, chain_id, status, created_at, webhook_delivered_at
+FROM intents
+WHERE status NOT IN ('confirmed', 'expired')
+ORDER BY created_at DESC;
+
+# Check chain checkpoints
+SELECT chain_id, last_scanned_block, updated_at FROM checkpoints;
+
+# Count by status
+SELECT status, count(*) FROM intents GROUP BY status;
+```
+
+---
+
+## 9. Troubleshooting
+
+### Intent stuck in `pending`
+
+1. Check `/scanner/status` — is the chain worker running and advancing (`lag` > 0 for a long time = RPC issue)?
+2. Check that `chainId` and `tokenAddress` match exactly what is in `supported-chains.json` and `tokens.json`.
+3. For EVM: verify the `proxyAddress` matches the contract the buyer is calling.
+4. For Tron: confirm the destination address is stored in EVM-hex (0x) format in the DB.
+5. Check scanner logs for `REJECT` messages around the expected tx time.
+
+### Webhook never received by backend
+
+1. Check `webhook_delivered_at` in the DB — if not null, the scanner delivered successfully and the backend side is the issue.
+2. If null and status is `webhook_failed`: check backend logs for the incoming POST; verify `X-AMN-Signature` validation code.
+3. If status is `confirmed` but `webhook_delivered_at` is null: startup reconciliation may re-deliver on next restart.
+4. Use `POST /admin/webhooks/retry` to trigger immediate retry.
+
+### High lag on EVM chain
+
+1. Check RPC endpoint availability and rate limits.
+2. Consider setting a `RPC_*` env override to a premium RPC (Alchemy, Infura, QuickNode).
+3. The scanner falls back to `publicRpcUrl` if the primary fails but public nodes have lower limits.
+
+### Intent confirmed but amount looks wrong
+
+The scanner accepts any amount **>=** `intent.Amount`. Overpayments are not flagged. Underpayments result in the intent staying pending until TTL expiry.
+
+---
+
+## 10. CI/CD notes
+
+- Woodpecker CI pipeline is in `.woodpecker/`.
+- Telegram notify steps were removed (no TG secrets configured).
+- Deploy step was removed — the scanner is deployed manually via `arcane-cli`.
+- The CI pipeline builds and pushes the Docker image to the Gitea registry.
+- Image tag format: `dev-<VERSION>` (from the `VERSION` file).
+
+> [!tip]
+> After CI completes, verify the image is in the registry before redeploying. Silent CI failures can leave a stale image tagged. Check the registry tag timestamp, not just the CI green light.
--- a/Operations/Secret
+++ b/Operations/Secret
@@ -0,0 +1,105 @@
+---
+title: Secret Rotation Runbook — 2026-05-30
+tags: [operations, security, secrets, incident]
+created: 2026-05-30
+status: action-required
+source: Full Codebase Audit - 2026-05-30
+---
+
+# Secret Rotation Runbook — 2026-05-30
+
+The 2026-05-30 full codebase audit found live credentials committed to the repos and, in
+some cases, baked into container images. The audit's no-brainer fixes **replaced the
+committed values with placeholders in the working tree**, but the *real* credentials are
+still valid and must be **rotated by a human** — replacing a string in git does not
+invalidate a leaked key.
+
+> Treat every credential below as **compromised**. Anyone with repo (or image) access has
+> had these values. Rotate first, then scrub history.
+
+Related issues: ISSUE-074, ISSUE-075, ISSUE-079, ISSUE-115 and decisions DEC-49, DEC-50,
+DEC-56, DEC-74, DEC-75, DEC-78.
+
+---
+
+## Order of operations (per credential)
+
+1. **Rotate** — generate a new value at the provider.
+2. **Inject at runtime** — put the new value in the deployment secret store (Arcane env /
+   compose secrets), **never** back into a committed file.
+3. **Deploy** — roll the new value out and confirm the service is healthy.
+4. **Revoke** — invalidate the old value at the provider.
+5. **Scrub** — remove the secret from git history (see "History scrub" at the bottom).
+
+Do these one credential at a time and verify the dependent service after each.
+
+---
+
+## Credentials to rotate
+
+| # | Credential | Where it leaked | Blast radius | How to rotate |
+|---|-----------|-----------------|--------------|---------------|
+| 1 | **Telegram bot token** | `backend/.env.development`, `backend/.env.example`, `frontend/.gitleaks.toml` | Full control of the bot: read/send messages, hijack the login widget, phish users | BotFather → `/revoke` → new token. Update `TELEGRAM_BOT_TOKEN`. |
+| 2 | **Resend SMTP / API key** | `backend/.env.development`, `backend/.env.example` | Send email as the platform (phishing, OTP spoofing), read sending logs | Resend dashboard → API Keys → delete + create. Update `RESEND_API_KEY` / SMTP creds. |
+| 3 | **JWT signing secret** | `backend/.env.example` | Forge **any** user/admin session token — critical | Generate 32+ random bytes (`openssl rand -hex 32`). Update `JWT_SECRET`. **Rotating invalidates all sessions** (users re-login). Consider also adding a separate `REFRESH_TOKEN_SECRET` (see DEC-26). |
+| 4 | **Admin bootstrap password** | `backend/.env.example`, was also a hardcoded fallback in `init-admin.ts` (removed by NB-20) | Direct admin login | Set a strong `ADMIN_PASSWORD` secret; change the admin account password in-app; confirm `init-admin` no longer has a fallback. |
+| 5 | **Request Network API key** | `backend/.env.example` | Act against the RN account; manipulate payment intents | RN dashboard → rotate key. Update `REQUEST_NETWORK_API_KEY`. |
+| 6 | **Request Network webhook secret** | `backend/.env.example` | Forge RN webhooks → mark payments paid (this is the HMAC secret the backend verifies) | Rotate at RN; update `REQUEST_NETWORK_WEBHOOK_SECRET`. |
+| 7 | **Telegram webhook secret token** | `backend/.env.example` | Forge Telegram webhook calls | Reset via `setWebhook` with a new `secret_token`; update the env var. |
+| 8 | **Google OAuth client secret** | `backend/.env.example` | Impersonate the OAuth app | Google Cloud Console → Credentials → reset client secret. Update `GOOGLE_CLIENT_SECRET`. |
+| 9 | **Alchemy API key(s)** | `frontend/Dockerfile` ARG defaults (removed by NB-10) | Quota theft / RPC abuse on your account | Alchemy dashboard → rotate app key. Supply via CI build-arg / runtime, not a default. |
+| 10 | **TG_NOTIFY_BOT_TOKEN** (ops alert bot) | backend startup notification (committed env) | Spoof ops alerts; spam the ops channel | BotFather → revoke → new token. Update `TG_NOTIFY_BOT_TOKEN`. See [[telegram_notify_no_parse_mode]]. |
+| 11 | **Frontend test account password** (`Moji6364`) | `frontend/scripts/show-credentials.sh` (DEC-75) | Login as that test user if it exists in any real env | Delete the script (or env-prompt it); rotate the account password if real. |
+
+### Public-by-design (lower priority, but make explicit)
+- **WalletConnect project ID**, **Google OAuth *client ID*** — `frontend/Dockerfile` ARG
+  defaults (DEC-74). These are public values, but remove the baked defaults and pass them
+  via CI build-args so forks don't reuse the production IDs.
+
+---
+
+## Stop re-leaking (pairs with rotation)
+
+These are the structural fixes (tracked as decisions) that stop the secrets coming back:
+
+- **DEC-50 / ISSUE-075** — `backend/.dockerignore` whitelists `.env.development` *into the
+  prod image*. Remove the `!.env.development` line so no env file is ever copied into an
+  image; inject secrets at runtime.
+- **DEC-49 / ISSUE-101** — `backend/src/shared/config/index.ts` loads `.env.development`
+  unconditionally. Load `.env.<NODE_ENV>` (or nothing in production) and never fall back to
+  the dev file.
+- **DEC-56 / ISSUE-074** — untrack `backend/.env.development` entirely (`git rm --cached`)
+  and add it to `.gitignore`.
+- **DEC-78 / ISSUE-079** — `frontend/.gitleaks.toml` allowlists the bot token *by value*.
+  Switch to a path/fingerprint-based allowlist after scrubbing, so gitleaks stops
+  "approving" the secret. See the `handle-gitleaks` skill.
+
+Runtime injection point for this stack: the **Arcane** env / project config (see
+[[arcane_dev_stack]], [[arcane_cli_usage]]) for dev, and the production secret store for
+prod. After changing any backend secret, remember the dev redeploy caveat:
+restart `nickDev-nginx` (see [[devEscrow_nginx_after_redeploy]]).
+
+---
+
+## History scrub (after rotation + revocation)
+
+Only after the old values are revoked, purge them from history so they can't be mined from
+old commits:
+
+1. Use `git filter-repo` (preferred) or BFG to remove the affected files/blobs from each
+   repo's history: `backend/.env.development`, the historical `backend/.env.example`,
+   `frontend/.gitleaks.toml` values, `frontend/scripts/show-credentials.sh`.
+2. Force-push the rewritten history and have all collaborators re-clone. **Coordinate** —
+   per [[parallel_agents_on_escrow]] another agent pushes to these branches; a history
+   rewrite mid-flight will conflict badly. Pick a quiet window.
+3. Re-run gitleaks to confirm the working tree and history are clean.
+
+---
+
+## Verification checklist
+
+- [ ] Each credential rotated at the provider and old value **revoked**.
+- [ ] New values present only in the runtime secret store (no committed file holds a real value).
+- [ ] Backend boots; `/api/health` green; login, email send, Telegram login, and an RN webhook all succeed with new secrets.
+- [ ] `.env.development` untracked; `.dockerignore` no longer whitelists it; config no longer loads it in prod.
+- [ ] gitleaks passes on working tree; history scrubbed and force-pushed in a coordinated window.