Files
nick-doc/08 - Operations/Scanner Operations.md

261 lines
10 KiB
Markdown

---
title: Scanner Operations
tags: [operations, scanner, deployment]
created: 2026-05-30
---
# Scanner Operations
Runbook for deploying, configuring, monitoring, and troubleshooting the AMN Pay Scanner microservice.
---
## 1. Configuration reference
All configuration via environment variables. See `.env.example` in the scanner repo.
| Variable | Default | Required | Description |
|---|---|---|---|
| `PORT` | `8080` | no | HTTP listen port |
| `DB_PATH` | `./scanner.db` | no | SQLite database path |
| `CHAINS_JSON_PATH` | `./supported-chains.json` | no | Supported chains config |
| `TOKENS_JSON_PATH` | `./tokens.json` | no | Token registry |
| `SCANNER_API_KEY` | _(none)_ | **yes (prod)** | Bearer token for all non-health endpoints. Generate with `openssl rand -hex 32` |
| `POLL_INTERVAL_SEC` | `15` | no | Chain poll interval in seconds |
| `INTENT_TTL_HOURS` | `24` | no | Pending/confirming intents older than this are expired (0 = disabled) |
| `WEBHOOK_RETRY_HOURS` | `6` | no | Interval between automatic webhook_failed re-delivery passes (0 = disabled) |
| `BALANCE_WATCH_TICK_SEC` | `60` | no | Scheduler tick for due direct-address balance watches |
| `BALANCE_WATCH_BATCH_SIZE` | `50` | no | Max due balance watches processed per scheduler tick |
| `TRONGRID_API_KEY` | _(none)_ | recommended | TronGrid API key; without it rate limits are very low |
| `TONCENTER_API_KEY` | _(none)_ | recommended | TonCenter API key |
| `RPC_BSC` | _(chain config)_ | no | Override BSC RPC URL (chain 56) |
| `RPC_ARB` | _(chain config)_ | no | Override Arbitrum RPC URL (chain 42161) |
| `RPC_ETH` | _(chain config)_ | no | Override Ethereum RPC URL (chain 1) |
| `RPC_POLYGON` | _(chain config)_ | no | Override Polygon RPC URL (chain 137) |
| `RPC_BASE` | _(chain config)_ | no | Override Base RPC URL (chain 8453) |
> [!warning]
> If `SCANNER_API_KEY` is not set, the scanner logs a warning and accepts all requests. Never run this way in production.
---
## 2. Docker deployment
The scanner ships as a single Docker image. The Dockerfile uses a two-stage build (Go 1.25 builder → Alpine 3.21 runtime).
### Quick start (dev)
```bash
cd scanner/
cp .env.example .env
# edit .env — set SCANNER_API_KEY, RPC overrides, etc.
docker build -t amn-scanner:dev .
docker run -d \
--name amn-scanner \
-p 8080:8080 \
-v $(pwd)/data:/data \
--env-file .env \
amn-scanner:dev
```
### Production (via arcane-cli / Watchtower)
The scanner is deployed manually via `arcane-cli` (not gitops). Watchtower does NOT manage it automatically. After pushing a new image, redeploy with:
```bash
arcane-cli project redeploy --json <project-id>
```
The SQLite database is stored on a named Docker volume (`/data`). Do not recreate the volume between deploys — it holds the checkpoint and intent state.
---
## 3. Health check
```bash
curl http://localhost:8080/health
# {"status":"ok","time":"2026-05-30T12:00:00Z"}
```
Docker `HEALTHCHECK` is already configured in the Dockerfile (30 s interval, 5 s timeout, 3 retries).
---
## 4. Monitoring
### Scanner status endpoint
```bash
curl -H "Authorization: Bearer $SCANNER_API_KEY" \
http://localhost:8080/scanner/status | jq .
```
Check:
- `lag` — should be near 0 for healthy chains (blocks behind for EVM, seconds for TON)
- `pendingIntents` — number of unresolved intents per chain
- `activeBalanceWatches` — number of direct-address watches in `watching` status per chain
- `lastScannedBlock` — should advance each poll
### Logs
The scanner uses Go's `log/slog` structured logger with level prefixes. Key log patterns:
| Pattern | Meaning |
|---|---|
| `[scanner] worker started` | Worker goroutine began for this chain |
| `[evm] intent confirming` | EVM tx seen, waiting for confirmations |
| `[evm] intent confirmed` | EVM: N confirmations reached |
| `[tron] MATCH` / `[ton] MATCH` | Transfer matched, going to confirmed |
| `[webhook] delivered` | Webhook POST succeeded |
| `[webhook] non-2xx response` | Backend returned error (will retry) |
| `[webhook] all retries exhausted` | Intent moved to webhook_failed |
| `[scanner] reconciling confirmed intents` | Startup crash recovery in progress |
| `[evm] scanner lag` | Chain lag > 100 blocks (investigate RPC) |
| `[scanner] balance watch scheduler started` | Balance watch polling loop started |
| `[api] balance watch created` | Backend registered a direct-address watch |
| `[balance-watch] balance read error` | RPC failed while reading a watched balance |
| `[balance-watch-webhook] delivered` | Changed-balance webhook POST succeeded |
| `[balance-watch-webhook] non-2xx response` | Backend rejected changed-balance webhook; scanner will retry the change later |
---
## 5. Adding / modifying chains
Edit `supported-chains.json`. Fields:
| Field | Notes |
|---|---|
| `chainId` | Numeric EIP-155 chain ID (arbitrary int for Tron/TON) |
| `chainType` | `"evm"` (default) / `"tron"` / `"ton"` |
| `rpcUrl` | Primary RPC endpoint |
| `publicRpcUrl` | Fallback RPC (EVM only) |
| `proxyAddress` | ERC20FeeProxy address (EVM); USDT contract (Tron); USDT Jetton master (TON) |
| `confirmationThreshold` | Chain acceptance floor. EVM workers wait this many blocks; Tron/TON use it as the accepted confirmation count reported to backend |
| `verified` | `true` to activate the worker; `false` to disable without deleting |
> [!important]
> Changing `proxyAddress` for an EVM chain only affects new scans. Existing pending intents will still be matched against the old address until they expire or are confirmed.
After editing, restart the scanner container to pick up the new config.
---
## 6. Adding tokens to the registry
Edit `tokens.json`. Each entry:
```json
{ "chainId": 56, "address": "0x...", "symbol": "USDC", "decimals": 18, "name": "USD Coin" }
```
Token registry is used only for populating `tokenSymbol` and `decimals` in the `checkoutBlock` response. Omitting a token does not break scanning — it just leaves those fields empty.
---
## 7. Manual webhook retry
Force immediate re-delivery of all `webhook_failed` intents:
```bash
curl -X POST -H "Authorization: Bearer $SCANNER_API_KEY" \
http://localhost:8080/admin/webhooks/retry
# {"queued": N}
```
---
## 8. Database inspection
The SQLite database (`/data/scanner.db`) can be inspected with the `sqlite3` CLI inside the container:
```bash
docker exec -it amn-scanner sqlite3 /data/scanner.db
# Check stuck intents
SELECT intent_id, chain_id, status, created_at, webhook_delivered_at
FROM intents
WHERE status NOT IN ('confirmed', 'expired')
ORDER BY created_at DESC;
# Check chain checkpoints
SELECT chain_id, last_scanned_block, updated_at FROM checkpoints;
# Count by status
SELECT status, count(*) FROM intents GROUP BY status;
# Check active direct-address watches
SELECT watch_id, chain_id, token_symbol, address, current_balance, next_check_at, expires_at
FROM balance_watches
WHERE status = 'watching'
ORDER BY next_check_at ASC;
# Count watches by status
SELECT status, count(*) FROM balance_watches GROUP BY status;
```
---
## 9. Troubleshooting
### Intent stuck in `pending`
1. Check `/scanner/status` — is the chain worker running and advancing (`lag` > 0 for a long time = RPC issue)?
2. Check that `chainId` and `tokenAddress` match exactly what is in `supported-chains.json` and `tokens.json`.
3. For EVM: verify the `proxyAddress` matches the contract the buyer is calling.
4. For Tron: confirm the destination address is stored in EVM-hex (0x) format in the DB.
5. Check scanner logs for `REJECT` messages around the expected tx time.
### Webhook never received by backend
1. Check `webhook_delivered_at` in the DB — if not null, the scanner delivered successfully and the backend side is the issue.
2. If null and status is `webhook_failed`: check backend logs for the incoming POST; verify `X-AMN-Signature` validation code.
3. If status is `confirmed` but `webhook_delivered_at` is null: startup reconciliation may re-deliver on next restart.
4. Use `POST /admin/webhooks/retry` to trigger immediate retry.
### High lag on EVM chain
1. Check RPC endpoint availability and rate limits.
2. Consider setting a `RPC_*` env override to a premium RPC (Alchemy, Infura, QuickNode).
3. The scanner falls back to `publicRpcUrl` if the primary fails but public nodes have lower limits.
### Intent confirmed but amount looks wrong
The scanner accepts any amount **>=** `intent.Amount`. Overpayments are not flagged. Underpayments result in the intent staying pending until TTL expiry.
### Direct balance watch is not firing
1. Confirm the target chain is EVM. Scanner `0.1.8` direct balance checks use ERC-20 `balanceOf(address)` and do not yet support Tron/TON balance reads.
2. Check `/scanner/status` for `activeBalanceWatches` on the expected chain.
3. Inspect `balance_watches.next_check_at`; if it is in the future, the scheduler is waiting according to the decay cadence.
4. Check logs for `[balance-watch] balance read error`; RPC failures reschedule the watch without notifying backend.
5. Confirm `callbackUrl` and `callbackSecret` match backend `AMN_SCANNER_WEBHOOK_SECRET`.
6. If `[balance-watch-webhook] non-2xx response` appears, inspect backend logs for the AMN scanner webhook route. The scanner keeps `current_balance` unchanged and retries the same balance change on the next due check.
### Direct balance watch should stop
Use either stop form:
```bash
curl -X DELETE -H "Authorization: Bearer $SCANNER_API_KEY" \
http://localhost:8080/balance-watches/<watchId>
curl -X POST -H "Authorization: Bearer $SCANNER_API_KEY" \
http://localhost:8080/balance-watches/<watchId>/stop
```
Backend should stop a watch after payment acceptance, cancellation, manual resolution, or when the payment is no longer payable.
---
## 10. CI/CD notes
- Woodpecker CI pipeline is in `.woodpecker/`.
- Telegram notify steps were removed (no TG secrets configured).
- Deploy step was removed — the scanner is deployed manually via `arcane-cli`.
- The CI pipeline builds and pushes the Docker image to the Gitea registry.
- Image tag format: `dev-<VERSION>` (from the `VERSION` file).
> [!tip]
> After CI completes, verify the image is in the registry before redeploying. Silent CI failures can leave a stale image tagged. Check the registry tag timestamp, not just the CI green light.