audit: 2026-05-30 full-codebase audit — report, issues, docs, runbooks
Full-codebase-audit 2026-05-30 outputs: - Audit report: 09 - Audits/Full Codebase Audit - 2026-05-30.md - 81 issue files ISSUE-055..135 (decisions + 1 skipped no-brainer). - Scanner docs from scratch (was zero): architecture, data model, API ref, payment flow, operations runbook + repo README. - Doc-sync updates across API reference, data models, flows, design system. - Secret Rotation Runbook (08 - Operations) for the exposed credentials. - Reusable workflow guide (07 - Development) + .claude/workflows/full-codebase-audit.js. Issues remain status:open intentionally — the code fixes are uncommitted-then-committed working-tree changes per repo and aren't "resolved" until merged/deployed. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
@@ -11,24 +11,21 @@ What's instrumented today and what to watch. Today's stack is intentionally lean
|
||||
|
||||
## 1. Health endpoint
|
||||
|
||||
Path: `GET /health` (backend, port `5001`).
|
||||
Two paths are registered (both are public, rate-limited, not auth-gated):
|
||||
|
||||
Defined in `backend/src/app.ts`:
|
||||
- `GET /health` — simple ping used by Docker healthchecks. Returns `200 { success, message, timestamp, environment, version }`. Does **not** probe MongoDB or Redis.
|
||||
- `GET /api/health` — deep health check added in commit `44579d6` (backend v2.6.49). Calls `runHealthChecks` from `backend/src/services/health/healthCheckService.ts`. Probes MongoDB and Redis, collects memory/uptime stats, and returns a structured report. Returns `503` when `report.status === 'down'`.
|
||||
|
||||
```ts
|
||||
app.get("/health", (req, res) => {
|
||||
res.json({
|
||||
success: true,
|
||||
message: "Marketplace Backend API is running",
|
||||
timestamp: new Date().toISOString(),
|
||||
environment: config.nodeEnv,
|
||||
version: packageJson.version,
|
||||
});
|
||||
});
|
||||
`GET /api/health` response shape (from `healthCheckService`):
|
||||
```json
|
||||
{
|
||||
"status": "ok",
|
||||
"version": "2.6.xx",
|
||||
"timestamp": "...",
|
||||
"checks": { "mongodb": "ok", "redis": "ok", "uptime": 3600, "memoryMB": 120 }
|
||||
}
|
||||
```
|
||||
|
||||
Returns `200` with a JSON envelope as soon as Express is up. Does **not** currently probe MongoDB or Redis — they are checked via separate Docker healthchecks. If you want deep health, extend the endpoint to ping both data stores and return `503` on failure.
|
||||
|
||||
Public URL behind Nginx: `https://amn.gg/api/health`.
|
||||
|
||||
---
|
||||
|
||||
220
08 - Operations/Scanner Operations.md
Normal file
220
08 - Operations/Scanner Operations.md
Normal file
@@ -0,0 +1,220 @@
|
||||
---
|
||||
title: Scanner Operations
|
||||
tags: [operations, scanner, deployment]
|
||||
created: 2026-05-30
|
||||
---
|
||||
|
||||
# Scanner Operations
|
||||
|
||||
Runbook for deploying, configuring, monitoring, and troubleshooting the AMN Pay Scanner microservice.
|
||||
|
||||
---
|
||||
|
||||
## 1. Configuration reference
|
||||
|
||||
All configuration via environment variables. See `.env.example` in the scanner repo.
|
||||
|
||||
| Variable | Default | Required | Description |
|
||||
|---|---|---|---|
|
||||
| `PORT` | `8080` | no | HTTP listen port |
|
||||
| `DB_PATH` | `./scanner.db` | no | SQLite database path |
|
||||
| `CHAINS_JSON_PATH` | `./supported-chains.json` | no | Supported chains config |
|
||||
| `TOKENS_JSON_PATH` | `./tokens.json` | no | Token registry |
|
||||
| `SCANNER_API_KEY` | _(none)_ | **yes (prod)** | Bearer token for all non-health endpoints. Generate with `openssl rand -hex 32` |
|
||||
| `POLL_INTERVAL_SEC` | `15` | no | Chain poll interval in seconds |
|
||||
| `INTENT_TTL_HOURS` | `24` | no | Pending/confirming intents older than this are expired (0 = disabled) |
|
||||
| `WEBHOOK_RETRY_HOURS` | `6` | no | Interval between automatic webhook_failed re-delivery passes (0 = disabled) |
|
||||
| `TRONGRID_API_KEY` | _(none)_ | recommended | TronGrid API key; without it rate limits are very low |
|
||||
| `TONCENTER_API_KEY` | _(none)_ | recommended | TonCenter API key |
|
||||
| `RPC_BSC` | _(chain config)_ | no | Override BSC RPC URL (chain 56) |
|
||||
| `RPC_ARB` | _(chain config)_ | no | Override Arbitrum RPC URL (chain 42161) |
|
||||
| `RPC_ETH` | _(chain config)_ | no | Override Ethereum RPC URL (chain 1) |
|
||||
| `RPC_POLYGON` | _(chain config)_ | no | Override Polygon RPC URL (chain 137) |
|
||||
| `RPC_BASE` | _(chain config)_ | no | Override Base RPC URL (chain 8453) |
|
||||
|
||||
> [!warning]
|
||||
> If `SCANNER_API_KEY` is not set, the scanner logs a warning and accepts all requests. Never run this way in production.
|
||||
|
||||
---
|
||||
|
||||
## 2. Docker deployment
|
||||
|
||||
The scanner ships as a single Docker image. The Dockerfile uses a two-stage build (Go 1.25 builder → Alpine 3.21 runtime).
|
||||
|
||||
### Quick start (dev)
|
||||
|
||||
```bash
|
||||
cd scanner/
|
||||
cp .env.example .env
|
||||
# edit .env — set SCANNER_API_KEY, RPC overrides, etc.
|
||||
|
||||
docker build -t amn-scanner:dev .
|
||||
docker run -d \
|
||||
--name amn-scanner \
|
||||
-p 8080:8080 \
|
||||
-v $(pwd)/data:/data \
|
||||
--env-file .env \
|
||||
amn-scanner:dev
|
||||
```
|
||||
|
||||
### Production (via arcane-cli / Watchtower)
|
||||
|
||||
The scanner is deployed manually via `arcane-cli` (not gitops). Watchtower does NOT manage it automatically. After pushing a new image, redeploy with:
|
||||
|
||||
```bash
|
||||
arcane-cli project redeploy --json <project-id>
|
||||
```
|
||||
|
||||
The SQLite database is stored on a named Docker volume (`/data`). Do not recreate the volume between deploys — it holds the checkpoint and intent state.
|
||||
|
||||
---
|
||||
|
||||
## 3. Health check
|
||||
|
||||
```bash
|
||||
curl http://localhost:8080/health
|
||||
# {"status":"ok","time":"2026-05-30T12:00:00Z"}
|
||||
```
|
||||
|
||||
Docker `HEALTHCHECK` is already configured in the Dockerfile (30 s interval, 5 s timeout, 3 retries).
|
||||
|
||||
---
|
||||
|
||||
## 4. Monitoring
|
||||
|
||||
### Scanner status endpoint
|
||||
|
||||
```bash
|
||||
curl -H "Authorization: Bearer $SCANNER_API_KEY" \
|
||||
http://localhost:8080/scanner/status | jq .
|
||||
```
|
||||
|
||||
Check:
|
||||
- `lag` — should be near 0 for healthy chains (blocks behind for EVM, seconds for TON)
|
||||
- `pendingIntents` — number of unresolved intents per chain
|
||||
- `lastScannedBlock` — should advance each poll
|
||||
|
||||
### Logs
|
||||
|
||||
The scanner uses Go's `log/slog` structured logger with level prefixes. Key log patterns:
|
||||
|
||||
| Pattern | Meaning |
|
||||
|---|---|
|
||||
| `[scanner] worker started` | Worker goroutine began for this chain |
|
||||
| `[evm] intent confirming` | EVM tx seen, waiting for confirmations |
|
||||
| `[evm] intent confirmed` | EVM: N confirmations reached |
|
||||
| `[tron] MATCH` / `[ton] MATCH` | Transfer matched, going to confirmed |
|
||||
| `[webhook] delivered` | Webhook POST succeeded |
|
||||
| `[webhook] non-2xx response` | Backend returned error (will retry) |
|
||||
| `[webhook] all retries exhausted` | Intent moved to webhook_failed |
|
||||
| `[scanner] reconciling confirmed intents` | Startup crash recovery in progress |
|
||||
| `[evm] scanner lag` | Chain lag > 100 blocks (investigate RPC) |
|
||||
|
||||
---
|
||||
|
||||
## 5. Adding / modifying chains
|
||||
|
||||
Edit `supported-chains.json`. Fields:
|
||||
|
||||
| Field | Notes |
|
||||
|---|---|
|
||||
| `chainId` | Numeric EIP-155 chain ID (arbitrary int for Tron/TON) |
|
||||
| `chainType` | `"evm"` (default) / `"tron"` / `"ton"` |
|
||||
| `rpcUrl` | Primary RPC endpoint |
|
||||
| `publicRpcUrl` | Fallback RPC (EVM only) |
|
||||
| `proxyAddress` | ERC20FeeProxy address (EVM); USDT contract (Tron); USDT Jetton master (TON) |
|
||||
| `confirmationThreshold` | Blocks required (EVM); ignored for Tron/TON |
|
||||
| `verified` | `true` to activate the worker; `false` to disable without deleting |
|
||||
|
||||
> [!important]
|
||||
> Changing `proxyAddress` for an EVM chain only affects new scans. Existing pending intents will still be matched against the old address until they expire or are confirmed.
|
||||
|
||||
After editing, restart the scanner container to pick up the new config.
|
||||
|
||||
---
|
||||
|
||||
## 6. Adding tokens to the registry
|
||||
|
||||
Edit `tokens.json`. Each entry:
|
||||
|
||||
```json
|
||||
{ "chainId": 56, "address": "0x...", "symbol": "USDC", "decimals": 18, "name": "USD Coin" }
|
||||
```
|
||||
|
||||
Token registry is used only for populating `tokenSymbol` and `decimals` in the `checkoutBlock` response. Omitting a token does not break scanning — it just leaves those fields empty.
|
||||
|
||||
---
|
||||
|
||||
## 7. Manual webhook retry
|
||||
|
||||
Force immediate re-delivery of all `webhook_failed` intents:
|
||||
|
||||
```bash
|
||||
curl -X POST -H "Authorization: Bearer $SCANNER_API_KEY" \
|
||||
http://localhost:8080/admin/webhooks/retry
|
||||
# {"queued": N}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 8. Database inspection
|
||||
|
||||
The SQLite database (`/data/scanner.db`) can be inspected with the `sqlite3` CLI inside the container:
|
||||
|
||||
```bash
|
||||
docker exec -it amn-scanner sqlite3 /data/scanner.db
|
||||
|
||||
# Check stuck intents
|
||||
SELECT intent_id, chain_id, status, created_at, webhook_delivered_at
|
||||
FROM intents
|
||||
WHERE status NOT IN ('confirmed', 'expired')
|
||||
ORDER BY created_at DESC;
|
||||
|
||||
# Check chain checkpoints
|
||||
SELECT chain_id, last_scanned_block, updated_at FROM checkpoints;
|
||||
|
||||
# Count by status
|
||||
SELECT status, count(*) FROM intents GROUP BY status;
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 9. Troubleshooting
|
||||
|
||||
### Intent stuck in `pending`
|
||||
|
||||
1. Check `/scanner/status` — is the chain worker running and advancing (`lag` > 0 for a long time = RPC issue)?
|
||||
2. Check that `chainId` and `tokenAddress` match exactly what is in `supported-chains.json` and `tokens.json`.
|
||||
3. For EVM: verify the `proxyAddress` matches the contract the buyer is calling.
|
||||
4. For Tron: confirm the destination address is stored in EVM-hex (0x) format in the DB.
|
||||
5. Check scanner logs for `REJECT` messages around the expected tx time.
|
||||
|
||||
### Webhook never received by backend
|
||||
|
||||
1. Check `webhook_delivered_at` in the DB — if not null, the scanner delivered successfully and the backend side is the issue.
|
||||
2. If null and status is `webhook_failed`: check backend logs for the incoming POST; verify `X-AMN-Signature` validation code.
|
||||
3. If status is `confirmed` but `webhook_delivered_at` is null: startup reconciliation may re-deliver on next restart.
|
||||
4. Use `POST /admin/webhooks/retry` to trigger immediate retry.
|
||||
|
||||
### High lag on EVM chain
|
||||
|
||||
1. Check RPC endpoint availability and rate limits.
|
||||
2. Consider setting a `RPC_*` env override to a premium RPC (Alchemy, Infura, QuickNode).
|
||||
3. The scanner falls back to `publicRpcUrl` if the primary fails but public nodes have lower limits.
|
||||
|
||||
### Intent confirmed but amount looks wrong
|
||||
|
||||
The scanner accepts any amount **>=** `intent.Amount`. Overpayments are not flagged. Underpayments result in the intent staying pending until TTL expiry.
|
||||
|
||||
---
|
||||
|
||||
## 10. CI/CD notes
|
||||
|
||||
- Woodpecker CI pipeline is in `.woodpecker/`.
|
||||
- Telegram notify steps were removed (no TG secrets configured).
|
||||
- Deploy step was removed — the scanner is deployed manually via `arcane-cli`.
|
||||
- The CI pipeline builds and pushes the Docker image to the Gitea registry.
|
||||
- Image tag format: `dev-<VERSION>` (from the `VERSION` file).
|
||||
|
||||
> [!tip]
|
||||
> After CI completes, verify the image is in the registry before redeploying. Silent CI failures can leave a stale image tagged. Check the registry tag timestamp, not just the CI green light.
|
||||
105
08 - Operations/Secret Rotation Runbook - 2026-05-30.md
Normal file
105
08 - Operations/Secret Rotation Runbook - 2026-05-30.md
Normal file
@@ -0,0 +1,105 @@
|
||||
---
|
||||
title: Secret Rotation Runbook — 2026-05-30
|
||||
tags: [operations, security, secrets, incident]
|
||||
created: 2026-05-30
|
||||
status: action-required
|
||||
source: Full Codebase Audit - 2026-05-30
|
||||
---
|
||||
|
||||
# Secret Rotation Runbook — 2026-05-30
|
||||
|
||||
The 2026-05-30 full codebase audit found live credentials committed to the repos and, in
|
||||
some cases, baked into container images. The audit's no-brainer fixes **replaced the
|
||||
committed values with placeholders in the working tree**, but the *real* credentials are
|
||||
still valid and must be **rotated by a human** — replacing a string in git does not
|
||||
invalidate a leaked key.
|
||||
|
||||
> Treat every credential below as **compromised**. Anyone with repo (or image) access has
|
||||
> had these values. Rotate first, then scrub history.
|
||||
|
||||
Related issues: ISSUE-074, ISSUE-075, ISSUE-079, ISSUE-115 and decisions DEC-49, DEC-50,
|
||||
DEC-56, DEC-74, DEC-75, DEC-78.
|
||||
|
||||
---
|
||||
|
||||
## Order of operations (per credential)
|
||||
|
||||
1. **Rotate** — generate a new value at the provider.
|
||||
2. **Inject at runtime** — put the new value in the deployment secret store (Arcane env /
|
||||
compose secrets), **never** back into a committed file.
|
||||
3. **Deploy** — roll the new value out and confirm the service is healthy.
|
||||
4. **Revoke** — invalidate the old value at the provider.
|
||||
5. **Scrub** — remove the secret from git history (see "History scrub" at the bottom).
|
||||
|
||||
Do these one credential at a time and verify the dependent service after each.
|
||||
|
||||
---
|
||||
|
||||
## Credentials to rotate
|
||||
|
||||
| # | Credential | Where it leaked | Blast radius | How to rotate |
|
||||
|---|-----------|-----------------|--------------|---------------|
|
||||
| 1 | **Telegram bot token** | `backend/.env.development`, `backend/.env.example`, `frontend/.gitleaks.toml` | Full control of the bot: read/send messages, hijack the login widget, phish users | BotFather → `/revoke` → new token. Update `TELEGRAM_BOT_TOKEN`. |
|
||||
| 2 | **Resend SMTP / API key** | `backend/.env.development`, `backend/.env.example` | Send email as the platform (phishing, OTP spoofing), read sending logs | Resend dashboard → API Keys → delete + create. Update `RESEND_API_KEY` / SMTP creds. |
|
||||
| 3 | **JWT signing secret** | `backend/.env.example` | Forge **any** user/admin session token — critical | Generate 32+ random bytes (`openssl rand -hex 32`). Update `JWT_SECRET`. **Rotating invalidates all sessions** (users re-login). Consider also adding a separate `REFRESH_TOKEN_SECRET` (see DEC-26). |
|
||||
| 4 | **Admin bootstrap password** | `backend/.env.example`, was also a hardcoded fallback in `init-admin.ts` (removed by NB-20) | Direct admin login | Set a strong `ADMIN_PASSWORD` secret; change the admin account password in-app; confirm `init-admin` no longer has a fallback. |
|
||||
| 5 | **Request Network API key** | `backend/.env.example` | Act against the RN account; manipulate payment intents | RN dashboard → rotate key. Update `REQUEST_NETWORK_API_KEY`. |
|
||||
| 6 | **Request Network webhook secret** | `backend/.env.example` | Forge RN webhooks → mark payments paid (this is the HMAC secret the backend verifies) | Rotate at RN; update `REQUEST_NETWORK_WEBHOOK_SECRET`. |
|
||||
| 7 | **Telegram webhook secret token** | `backend/.env.example` | Forge Telegram webhook calls | Reset via `setWebhook` with a new `secret_token`; update the env var. |
|
||||
| 8 | **Google OAuth client secret** | `backend/.env.example` | Impersonate the OAuth app | Google Cloud Console → Credentials → reset client secret. Update `GOOGLE_CLIENT_SECRET`. |
|
||||
| 9 | **Alchemy API key(s)** | `frontend/Dockerfile` ARG defaults (removed by NB-10) | Quota theft / RPC abuse on your account | Alchemy dashboard → rotate app key. Supply via CI build-arg / runtime, not a default. |
|
||||
| 10 | **TG_NOTIFY_BOT_TOKEN** (ops alert bot) | backend startup notification (committed env) | Spoof ops alerts; spam the ops channel | BotFather → revoke → new token. Update `TG_NOTIFY_BOT_TOKEN`. See [[telegram_notify_no_parse_mode]]. |
|
||||
| 11 | **Frontend test account password** (`Moji6364`) | `frontend/scripts/show-credentials.sh` (DEC-75) | Login as that test user if it exists in any real env | Delete the script (or env-prompt it); rotate the account password if real. |
|
||||
|
||||
### Public-by-design (lower priority, but make explicit)
|
||||
- **WalletConnect project ID**, **Google OAuth *client ID*** — `frontend/Dockerfile` ARG
|
||||
defaults (DEC-74). These are public values, but remove the baked defaults and pass them
|
||||
via CI build-args so forks don't reuse the production IDs.
|
||||
|
||||
---
|
||||
|
||||
## Stop re-leaking (pairs with rotation)
|
||||
|
||||
These are the structural fixes (tracked as decisions) that stop the secrets coming back:
|
||||
|
||||
- **DEC-50 / ISSUE-075** — `backend/.dockerignore` whitelists `.env.development` *into the
|
||||
prod image*. Remove the `!.env.development` line so no env file is ever copied into an
|
||||
image; inject secrets at runtime.
|
||||
- **DEC-49 / ISSUE-101** — `backend/src/shared/config/index.ts` loads `.env.development`
|
||||
unconditionally. Load `.env.<NODE_ENV>` (or nothing in production) and never fall back to
|
||||
the dev file.
|
||||
- **DEC-56 / ISSUE-074** — untrack `backend/.env.development` entirely (`git rm --cached`)
|
||||
and add it to `.gitignore`.
|
||||
- **DEC-78 / ISSUE-079** — `frontend/.gitleaks.toml` allowlists the bot token *by value*.
|
||||
Switch to a path/fingerprint-based allowlist after scrubbing, so gitleaks stops
|
||||
"approving" the secret. See the `handle-gitleaks` skill.
|
||||
|
||||
Runtime injection point for this stack: the **Arcane** env / project config (see
|
||||
[[arcane_dev_stack]], [[arcane_cli_usage]]) for dev, and the production secret store for
|
||||
prod. After changing any backend secret, remember the dev redeploy caveat:
|
||||
restart `nickDev-nginx` (see [[devEscrow_nginx_after_redeploy]]).
|
||||
|
||||
---
|
||||
|
||||
## History scrub (after rotation + revocation)
|
||||
|
||||
Only after the old values are revoked, purge them from history so they can't be mined from
|
||||
old commits:
|
||||
|
||||
1. Use `git filter-repo` (preferred) or BFG to remove the affected files/blobs from each
|
||||
repo's history: `backend/.env.development`, the historical `backend/.env.example`,
|
||||
`frontend/.gitleaks.toml` values, `frontend/scripts/show-credentials.sh`.
|
||||
2. Force-push the rewritten history and have all collaborators re-clone. **Coordinate** —
|
||||
per [[parallel_agents_on_escrow]] another agent pushes to these branches; a history
|
||||
rewrite mid-flight will conflict badly. Pick a quiet window.
|
||||
3. Re-run gitleaks to confirm the working tree and history are clean.
|
||||
|
||||
---
|
||||
|
||||
## Verification checklist
|
||||
|
||||
- [ ] Each credential rotated at the provider and old value **revoked**.
|
||||
- [ ] New values present only in the runtime secret store (no committed file holds a real value).
|
||||
- [ ] Backend boots; `/api/health` green; login, email send, Telegram login, and an RN webhook all succeed with new secrets.
|
||||
- [ ] `.env.development` untracked; `.dockerignore` no longer whitelists it; config no longer loads it in prod.
|
||||
- [ ] gitleaks passes on working tree; history scrubbed and force-pushed in a coordinated window.
|
||||
Reference in New Issue
Block a user