audit: 2026-05-30 full-codebase audit — report, issues, docs, runbooks

Full-codebase-audit 2026-05-30 outputs:
- Audit report: 09 - Audits/Full Codebase Audit - 2026-05-30.md
- 81 issue files ISSUE-055..135 (decisions + 1 skipped no-brainer).
- Scanner docs from scratch (was zero): architecture, data model, API ref, payment
  flow, operations runbook + repo README.
- Doc-sync updates across API reference, data models, flows, design system.
- Secret Rotation Runbook (08 - Operations) for the exposed credentials.
- Reusable workflow guide (07 - Development) + .claude/workflows/full-codebase-audit.js.

Issues remain status:open intentionally — the code fixes are uncommitted-then-committed
working-tree changes per repo and aren't "resolved" until merged/deployed.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
Siavash Sameni
2026-05-30 18:41:44 +04:00
parent eab1d77582
commit dceaf82934
153 changed files with 6276 additions and 179 deletions

View File

@@ -11,24 +11,21 @@ What's instrumented today and what to watch. Today's stack is intentionally lean
## 1. Health endpoint
Path: `GET /health` (backend, port `5001`).
Two paths are registered (both are public, rate-limited, not auth-gated):
Defined in `backend/src/app.ts`:
- `GET /health` — simple ping used by Docker healthchecks. Returns `200 { success, message, timestamp, environment, version }`. Does **not** probe MongoDB or Redis.
- `GET /api/health` — deep health check added in commit `44579d6` (backend v2.6.49). Calls `runHealthChecks` from `backend/src/services/health/healthCheckService.ts`. Probes MongoDB and Redis, collects memory/uptime stats, and returns a structured report. Returns `503` when `report.status === 'down'`.
```ts
app.get("/health", (req, res) => {
res.json({
success: true,
message: "Marketplace Backend API is running",
timestamp: new Date().toISOString(),
environment: config.nodeEnv,
version: packageJson.version,
});
});
`GET /api/health` response shape (from `healthCheckService`):
```json
{
"status": "ok",
"version": "2.6.xx",
"timestamp": "...",
"checks": { "mongodb": "ok", "redis": "ok", "uptime": 3600, "memoryMB": 120 }
}
```
Returns `200` with a JSON envelope as soon as Express is up. Does **not** currently probe MongoDB or Redis — they are checked via separate Docker healthchecks. If you want deep health, extend the endpoint to ping both data stores and return `503` on failure.
Public URL behind Nginx: `https://amn.gg/api/health`.
---

View File

@@ -0,0 +1,220 @@
---
title: Scanner Operations
tags: [operations, scanner, deployment]
created: 2026-05-30
---
# Scanner Operations
Runbook for deploying, configuring, monitoring, and troubleshooting the AMN Pay Scanner microservice.
---
## 1. Configuration reference
All configuration via environment variables. See `.env.example` in the scanner repo.
| Variable | Default | Required | Description |
|---|---|---|---|
| `PORT` | `8080` | no | HTTP listen port |
| `DB_PATH` | `./scanner.db` | no | SQLite database path |
| `CHAINS_JSON_PATH` | `./supported-chains.json` | no | Supported chains config |
| `TOKENS_JSON_PATH` | `./tokens.json` | no | Token registry |
| `SCANNER_API_KEY` | _(none)_ | **yes (prod)** | Bearer token for all non-health endpoints. Generate with `openssl rand -hex 32` |
| `POLL_INTERVAL_SEC` | `15` | no | Chain poll interval in seconds |
| `INTENT_TTL_HOURS` | `24` | no | Pending/confirming intents older than this are expired (0 = disabled) |
| `WEBHOOK_RETRY_HOURS` | `6` | no | Interval between automatic webhook_failed re-delivery passes (0 = disabled) |
| `TRONGRID_API_KEY` | _(none)_ | recommended | TronGrid API key; without it rate limits are very low |
| `TONCENTER_API_KEY` | _(none)_ | recommended | TonCenter API key |
| `RPC_BSC` | _(chain config)_ | no | Override BSC RPC URL (chain 56) |
| `RPC_ARB` | _(chain config)_ | no | Override Arbitrum RPC URL (chain 42161) |
| `RPC_ETH` | _(chain config)_ | no | Override Ethereum RPC URL (chain 1) |
| `RPC_POLYGON` | _(chain config)_ | no | Override Polygon RPC URL (chain 137) |
| `RPC_BASE` | _(chain config)_ | no | Override Base RPC URL (chain 8453) |
> [!warning]
> If `SCANNER_API_KEY` is not set, the scanner logs a warning and accepts all requests. Never run this way in production.
---
## 2. Docker deployment
The scanner ships as a single Docker image. The Dockerfile uses a two-stage build (Go 1.25 builder → Alpine 3.21 runtime).
### Quick start (dev)
```bash
cd scanner/
cp .env.example .env
# edit .env — set SCANNER_API_KEY, RPC overrides, etc.
docker build -t amn-scanner:dev .
docker run -d \
--name amn-scanner \
-p 8080:8080 \
-v $(pwd)/data:/data \
--env-file .env \
amn-scanner:dev
```
### Production (via arcane-cli / Watchtower)
The scanner is deployed manually via `arcane-cli` (not gitops). Watchtower does NOT manage it automatically. After pushing a new image, redeploy with:
```bash
arcane-cli project redeploy --json <project-id>
```
The SQLite database is stored on a named Docker volume (`/data`). Do not recreate the volume between deploys — it holds the checkpoint and intent state.
---
## 3. Health check
```bash
curl http://localhost:8080/health
# {"status":"ok","time":"2026-05-30T12:00:00Z"}
```
Docker `HEALTHCHECK` is already configured in the Dockerfile (30 s interval, 5 s timeout, 3 retries).
---
## 4. Monitoring
### Scanner status endpoint
```bash
curl -H "Authorization: Bearer $SCANNER_API_KEY" \
http://localhost:8080/scanner/status | jq .
```
Check:
- `lag` — should be near 0 for healthy chains (blocks behind for EVM, seconds for TON)
- `pendingIntents` — number of unresolved intents per chain
- `lastScannedBlock` — should advance each poll
### Logs
The scanner uses Go's `log/slog` structured logger with level prefixes. Key log patterns:
| Pattern | Meaning |
|---|---|
| `[scanner] worker started` | Worker goroutine began for this chain |
| `[evm] intent confirming` | EVM tx seen, waiting for confirmations |
| `[evm] intent confirmed` | EVM: N confirmations reached |
| `[tron] MATCH` / `[ton] MATCH` | Transfer matched, going to confirmed |
| `[webhook] delivered` | Webhook POST succeeded |
| `[webhook] non-2xx response` | Backend returned error (will retry) |
| `[webhook] all retries exhausted` | Intent moved to webhook_failed |
| `[scanner] reconciling confirmed intents` | Startup crash recovery in progress |
| `[evm] scanner lag` | Chain lag > 100 blocks (investigate RPC) |
---
## 5. Adding / modifying chains
Edit `supported-chains.json`. Fields:
| Field | Notes |
|---|---|
| `chainId` | Numeric EIP-155 chain ID (arbitrary int for Tron/TON) |
| `chainType` | `"evm"` (default) / `"tron"` / `"ton"` |
| `rpcUrl` | Primary RPC endpoint |
| `publicRpcUrl` | Fallback RPC (EVM only) |
| `proxyAddress` | ERC20FeeProxy address (EVM); USDT contract (Tron); USDT Jetton master (TON) |
| `confirmationThreshold` | Blocks required (EVM); ignored for Tron/TON |
| `verified` | `true` to activate the worker; `false` to disable without deleting |
> [!important]
> Changing `proxyAddress` for an EVM chain only affects new scans. Existing pending intents will still be matched against the old address until they expire or are confirmed.
After editing, restart the scanner container to pick up the new config.
---
## 6. Adding tokens to the registry
Edit `tokens.json`. Each entry:
```json
{ "chainId": 56, "address": "0x...", "symbol": "USDC", "decimals": 18, "name": "USD Coin" }
```
Token registry is used only for populating `tokenSymbol` and `decimals` in the `checkoutBlock` response. Omitting a token does not break scanning — it just leaves those fields empty.
---
## 7. Manual webhook retry
Force immediate re-delivery of all `webhook_failed` intents:
```bash
curl -X POST -H "Authorization: Bearer $SCANNER_API_KEY" \
http://localhost:8080/admin/webhooks/retry
# {"queued": N}
```
---
## 8. Database inspection
The SQLite database (`/data/scanner.db`) can be inspected with the `sqlite3` CLI inside the container:
```bash
docker exec -it amn-scanner sqlite3 /data/scanner.db
# Check stuck intents
SELECT intent_id, chain_id, status, created_at, webhook_delivered_at
FROM intents
WHERE status NOT IN ('confirmed', 'expired')
ORDER BY created_at DESC;
# Check chain checkpoints
SELECT chain_id, last_scanned_block, updated_at FROM checkpoints;
# Count by status
SELECT status, count(*) FROM intents GROUP BY status;
```
---
## 9. Troubleshooting
### Intent stuck in `pending`
1. Check `/scanner/status` — is the chain worker running and advancing (`lag` > 0 for a long time = RPC issue)?
2. Check that `chainId` and `tokenAddress` match exactly what is in `supported-chains.json` and `tokens.json`.
3. For EVM: verify the `proxyAddress` matches the contract the buyer is calling.
4. For Tron: confirm the destination address is stored in EVM-hex (0x) format in the DB.
5. Check scanner logs for `REJECT` messages around the expected tx time.
### Webhook never received by backend
1. Check `webhook_delivered_at` in the DB — if not null, the scanner delivered successfully and the backend side is the issue.
2. If null and status is `webhook_failed`: check backend logs for the incoming POST; verify `X-AMN-Signature` validation code.
3. If status is `confirmed` but `webhook_delivered_at` is null: startup reconciliation may re-deliver on next restart.
4. Use `POST /admin/webhooks/retry` to trigger immediate retry.
### High lag on EVM chain
1. Check RPC endpoint availability and rate limits.
2. Consider setting a `RPC_*` env override to a premium RPC (Alchemy, Infura, QuickNode).
3. The scanner falls back to `publicRpcUrl` if the primary fails but public nodes have lower limits.
### Intent confirmed but amount looks wrong
The scanner accepts any amount **>=** `intent.Amount`. Overpayments are not flagged. Underpayments result in the intent staying pending until TTL expiry.
---
## 10. CI/CD notes
- Woodpecker CI pipeline is in `.woodpecker/`.
- Telegram notify steps were removed (no TG secrets configured).
- Deploy step was removed — the scanner is deployed manually via `arcane-cli`.
- The CI pipeline builds and pushes the Docker image to the Gitea registry.
- Image tag format: `dev-<VERSION>` (from the `VERSION` file).
> [!tip]
> After CI completes, verify the image is in the registry before redeploying. Silent CI failures can leave a stale image tagged. Check the registry tag timestamp, not just the CI green light.

View File

@@ -0,0 +1,105 @@
---
title: Secret Rotation Runbook — 2026-05-30
tags: [operations, security, secrets, incident]
created: 2026-05-30
status: action-required
source: Full Codebase Audit - 2026-05-30
---
# Secret Rotation Runbook — 2026-05-30
The 2026-05-30 full codebase audit found live credentials committed to the repos and, in
some cases, baked into container images. The audit's no-brainer fixes **replaced the
committed values with placeholders in the working tree**, but the *real* credentials are
still valid and must be **rotated by a human** — replacing a string in git does not
invalidate a leaked key.
> Treat every credential below as **compromised**. Anyone with repo (or image) access has
> had these values. Rotate first, then scrub history.
Related issues: ISSUE-074, ISSUE-075, ISSUE-079, ISSUE-115 and decisions DEC-49, DEC-50,
DEC-56, DEC-74, DEC-75, DEC-78.
---
## Order of operations (per credential)
1. **Rotate** — generate a new value at the provider.
2. **Inject at runtime** — put the new value in the deployment secret store (Arcane env /
compose secrets), **never** back into a committed file.
3. **Deploy** — roll the new value out and confirm the service is healthy.
4. **Revoke** — invalidate the old value at the provider.
5. **Scrub** — remove the secret from git history (see "History scrub" at the bottom).
Do these one credential at a time and verify the dependent service after each.
---
## Credentials to rotate
| # | Credential | Where it leaked | Blast radius | How to rotate |
|---|-----------|-----------------|--------------|---------------|
| 1 | **Telegram bot token** | `backend/.env.development`, `backend/.env.example`, `frontend/.gitleaks.toml` | Full control of the bot: read/send messages, hijack the login widget, phish users | BotFather → `/revoke` → new token. Update `TELEGRAM_BOT_TOKEN`. |
| 2 | **Resend SMTP / API key** | `backend/.env.development`, `backend/.env.example` | Send email as the platform (phishing, OTP spoofing), read sending logs | Resend dashboard → API Keys → delete + create. Update `RESEND_API_KEY` / SMTP creds. |
| 3 | **JWT signing secret** | `backend/.env.example` | Forge **any** user/admin session token — critical | Generate 32+ random bytes (`openssl rand -hex 32`). Update `JWT_SECRET`. **Rotating invalidates all sessions** (users re-login). Consider also adding a separate `REFRESH_TOKEN_SECRET` (see DEC-26). |
| 4 | **Admin bootstrap password** | `backend/.env.example`, was also a hardcoded fallback in `init-admin.ts` (removed by NB-20) | Direct admin login | Set a strong `ADMIN_PASSWORD` secret; change the admin account password in-app; confirm `init-admin` no longer has a fallback. |
| 5 | **Request Network API key** | `backend/.env.example` | Act against the RN account; manipulate payment intents | RN dashboard → rotate key. Update `REQUEST_NETWORK_API_KEY`. |
| 6 | **Request Network webhook secret** | `backend/.env.example` | Forge RN webhooks → mark payments paid (this is the HMAC secret the backend verifies) | Rotate at RN; update `REQUEST_NETWORK_WEBHOOK_SECRET`. |
| 7 | **Telegram webhook secret token** | `backend/.env.example` | Forge Telegram webhook calls | Reset via `setWebhook` with a new `secret_token`; update the env var. |
| 8 | **Google OAuth client secret** | `backend/.env.example` | Impersonate the OAuth app | Google Cloud Console → Credentials → reset client secret. Update `GOOGLE_CLIENT_SECRET`. |
| 9 | **Alchemy API key(s)** | `frontend/Dockerfile` ARG defaults (removed by NB-10) | Quota theft / RPC abuse on your account | Alchemy dashboard → rotate app key. Supply via CI build-arg / runtime, not a default. |
| 10 | **TG_NOTIFY_BOT_TOKEN** (ops alert bot) | backend startup notification (committed env) | Spoof ops alerts; spam the ops channel | BotFather → revoke → new token. Update `TG_NOTIFY_BOT_TOKEN`. See [[telegram_notify_no_parse_mode]]. |
| 11 | **Frontend test account password** (`Moji6364`) | `frontend/scripts/show-credentials.sh` (DEC-75) | Login as that test user if it exists in any real env | Delete the script (or env-prompt it); rotate the account password if real. |
### Public-by-design (lower priority, but make explicit)
- **WalletConnect project ID**, **Google OAuth *client ID*** — `frontend/Dockerfile` ARG
defaults (DEC-74). These are public values, but remove the baked defaults and pass them
via CI build-args so forks don't reuse the production IDs.
---
## Stop re-leaking (pairs with rotation)
These are the structural fixes (tracked as decisions) that stop the secrets coming back:
- **DEC-50 / ISSUE-075** — `backend/.dockerignore` whitelists `.env.development` *into the
prod image*. Remove the `!.env.development` line so no env file is ever copied into an
image; inject secrets at runtime.
- **DEC-49 / ISSUE-101** — `backend/src/shared/config/index.ts` loads `.env.development`
unconditionally. Load `.env.<NODE_ENV>` (or nothing in production) and never fall back to
the dev file.
- **DEC-56 / ISSUE-074** — untrack `backend/.env.development` entirely (`git rm --cached`)
and add it to `.gitignore`.
- **DEC-78 / ISSUE-079** — `frontend/.gitleaks.toml` allowlists the bot token *by value*.
Switch to a path/fingerprint-based allowlist after scrubbing, so gitleaks stops
"approving" the secret. See the `handle-gitleaks` skill.
Runtime injection point for this stack: the **Arcane** env / project config (see
[[arcane_dev_stack]], [[arcane_cli_usage]]) for dev, and the production secret store for
prod. After changing any backend secret, remember the dev redeploy caveat:
restart `nickDev-nginx` (see [[devEscrow_nginx_after_redeploy]]).
---
## History scrub (after rotation + revocation)
Only after the old values are revoked, purge them from history so they can't be mined from
old commits:
1. Use `git filter-repo` (preferred) or BFG to remove the affected files/blobs from each
repo's history: `backend/.env.development`, the historical `backend/.env.example`,
`frontend/.gitleaks.toml` values, `frontend/scripts/show-credentials.sh`.
2. Force-push the rewritten history and have all collaborators re-clone. **Coordinate**
per [[parallel_agents_on_escrow]] another agent pushes to these branches; a history
rewrite mid-flight will conflict badly. Pick a quiet window.
3. Re-run gitleaks to confirm the working tree and history are clean.
---
## Verification checklist
- [ ] Each credential rotated at the provider and old value **revoked**.
- [ ] New values present only in the runtime secret store (no committed file holds a real value).
- [ ] Backend boots; `/api/health` green; login, email send, Telegram login, and an RN webhook all succeed with new secrets.
- [ ] `.env.development` untracked; `.dockerignore` no longer whitelists it; config no longer loads it in prod.
- [ ] gitleaks passes on working tree; history scrubbed and force-pushed in a coordinated window.