Captures the runtime-monitoring side of the 2026-05-28 silent-empty- registry incident retrospective. Pairs with backend commit 28b17f2 (CI typecheck gate). Defines the proposed Gatus probe set, the /api/health endpoint that has to land first, and a follow-up issue list. Includes a retrospective table showing what this would have caught across recent incidents. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
11 KiB
Gatus Monitoring — Proposed Config
Status: Draft / proposal. Not deployed yet.
Owner: nick + claude
Author date: 2026-05-28
Related: Handoff - Request Network In-House Checkout - 2026-05-28, memory entries woodpecker_silent_build_fail and feedback-json-assets-copy-to-dist.
Why
On 2026-05-28 dev.amn.gg silently regressed: every BSC checkout returned unsupported_chain:56 for hours before a user reported it. The cause was a build-pipeline bug (woodpecker_silent_build_fail) compounded by an in-process empty chain registry that the backend served happily because the load failure was swallowed by console.error.
The CI typecheck gate added in backend commit 28b17f2 closes the build side. The runtime side — is the deployed thing currently healthy? — is Gatus's job. A Gatus probe hitting the registry endpoint would have paged within 60 seconds of today's regression, instead of waiting for a user to notice.
Gatus also catches drift that CI cannot:
- A configuration file edited live on the server.
- A dependency that quietly fails to load on container restart.
- A database connection that drops mid-day.
- Stale upstream IPs after devEscrow_nginx_after_redeploy.
What we should monitor
Each endpoint serves one of three purposes: liveness (is the container up?), structural invariants (is the data the container needs actually loaded?), integration health (can it reach the things it depends on?).
Backend — dev.amn.gg
| Endpoint | Purpose | Interval | Condition |
|---|---|---|---|
GET /api/version |
Liveness | 60s | [STATUS] == 200, [BODY].version != "" |
GET /api/admin/rn/networks (auth-gated) |
Chain registry not empty | 60s | [STATUS] == 200, [BODY].chains[0].chainId != null |
GET /api/health (does not exist yet — see "Required backend work" below) |
DB + Redis + registry health all in one | 30s | [STATUS] == 200, [BODY].status == "ok" |
The most valuable probe is the chain-registry one. The current /api/admin/rn/networks requires admin auth, so either:
- Gatus posts an admin token per request (cheapest now, leaks an admin token into the monitoring config).
- We add a
GET /api/healthendpoint that exposes a subset of invariants (counts only, no addresses) without auth. Recommended.
Backend — prod.amn.gg
Identical probes, separate Gatus group so dev incidents don't drown out prod ones.
Frontend — dev.amn.gg / prod.amn.gg
| Endpoint | Purpose | Interval | Condition |
|---|---|---|---|
GET / |
Page renders | 60s | [STATUS] == 200, [RESPONSE_TIME] < 3000ms |
GET /api/health (Next.js route, proxy to backend) |
End-to-end reachability | 60s | [STATUS] == 200 |
External dependencies
| Endpoint | Purpose | Interval | Condition |
|---|---|---|---|
https://api.request.network/... |
RN API reachable | 5m | [STATUS] in (200, 401) (401 is fine — means it answered) |
https://public.chainalysis.com/api/v1/address/... |
AML provider reachable | 5m | [STATUS] in (200, 404) |
https://bsc-rpc.publicnode.com (eth_chainId) |
RPC liveness | 2m | [BODY].result == "0x38" |
If RN's API goes down, in-house checkout still works (we already have the cached intent), but new payment creation fails. Gatus catching this lets us flip to the hosted-page fallback proactively.
Required backend work (before Gatus can be useful)
The probe set above presumes a GET /api/health endpoint that exposes invariants without admin auth. It does NOT exist today.
Shape of the endpoint:
// GET /api/health (public, rate-limited but not auth-gated)
{
"status": "ok" | "degraded" | "down",
"version": "2.6.48",
"uptimeSec": 12345,
"checks": {
"db": { "ok": true, "latencyMs": 4 },
"redis": { "ok": true, "latencyMs": 1 },
"rnChainRegistry": { "ok": true, "chainCount": 5 },
"rnTokenRegistry": { "ok": true, "tokenCount": 10 },
"rnApi": { "ok": true, "latencyMs": 134 }
}
}
Each checks.*.ok must reflect the actual current state, not a cached one. If any check fails, status flips to degraded. If db.ok === false, status flips to down.
Why this shape rather than per-check endpoints:
- One probe, all invariants — cheaper for Gatus and clearer in the dashboard.
- The structure lets us add invariants later (e.g.
walletMonitor.ok,paymentRedisService.queueDepth) without changing the URL. - Public exposure of counts (not addresses, not balances) is low-risk.
Estimated work: half a day backend, including a unit test that asserts every check's ok flag toggles correctly when the underlying dependency is mocked failed.
Proposed Gatus config
Once /api/health exists, the config is straightforward. Drop this into wherever the homelab Gatus instance reads its config from. Adjust group names and Slack/Telegram webhook references to whatever the existing Gatus setup uses for other services.
# gatus.amanat.yaml
#
# Amanat escrow monitoring. Three groups: backend (dev + prod), frontend
# (dev + prod), external (RN + Chainalysis + RPCs).
#
# Alerting: piggyback the existing CI Telegram channel so notifications
# show up next to the same channel where deploy notifications already go.
alerting:
telegram:
token: "${TG_TOKEN}"
id: "${TG_GATUS_CHAT_ID}"
default-alert:
enabled: true
send-on-resolved: true
failure-threshold: 3 # 3 consecutive failures before paging
success-threshold: 2
endpoints:
# ── Backend (dev) ───────────────────────────────────────────────────
- name: backend-dev-version
group: backend-dev
url: https://dev.amn.gg/api/version
interval: 60s
conditions:
- "[STATUS] == 200"
- "[BODY].version != \"\""
alerts:
- type: telegram
- name: backend-dev-health
group: backend-dev
url: https://dev.amn.gg/api/health
interval: 30s
conditions:
- "[STATUS] == 200"
- "[BODY].status == ok"
- "[BODY].checks.db.ok == true"
- "[BODY].checks.redis.ok == true"
- "[BODY].checks.rnChainRegistry.ok == true"
- "[BODY].checks.rnChainRegistry.chainCount >= 1"
- "[BODY].checks.rnTokenRegistry.ok == true"
- "[BODY].checks.rnTokenRegistry.tokenCount >= 1"
alerts:
- type: telegram
# ── Backend (prod) ──────────────────────────────────────────────────
- name: backend-prod-version
group: backend-prod
url: https://amn.gg/api/version
interval: 60s
conditions:
- "[STATUS] == 200"
- "[BODY].version != \"\""
alerts:
- type: telegram
failure-threshold: 2 # tighter on prod
- name: backend-prod-health
group: backend-prod
url: https://amn.gg/api/health
interval: 30s
conditions:
- "[STATUS] == 200"
- "[BODY].status == ok"
- "[BODY].checks.db.ok == true"
- "[BODY].checks.redis.ok == true"
- "[BODY].checks.rnChainRegistry.chainCount >= 1"
- "[BODY].checks.rnTokenRegistry.tokenCount >= 1"
alerts:
- type: telegram
failure-threshold: 2
# ── Frontend ────────────────────────────────────────────────────────
- name: frontend-dev
group: frontend
url: https://dev.amn.gg/
interval: 60s
conditions:
- "[STATUS] == 200"
- "[RESPONSE_TIME] < 3000"
alerts:
- type: telegram
- name: frontend-prod
group: frontend
url: https://amn.gg/
interval: 60s
conditions:
- "[STATUS] == 200"
- "[RESPONSE_TIME] < 3000"
alerts:
- type: telegram
failure-threshold: 2
# ── External dependencies ───────────────────────────────────────────
- name: rn-api-reachable
group: external
url: https://api.request.network/v2/health
interval: 5m
conditions:
- "[STATUS] in (200, 401, 404)" # any answer = up
alerts:
- type: telegram
- name: chainalysis-public-api
group: external
url: https://public.chainalysis.com/api/v1/address/0x0000000000000000000000000000000000000000
interval: 5m
conditions:
- "[STATUS] in (200, 404)"
alerts:
- type: telegram
- name: bsc-rpc-publicnode
group: external
method: POST
url: https://bsc-rpc.publicnode.com
headers:
Content-Type: application/json
body: '{"jsonrpc":"2.0","method":"eth_chainId","params":[],"id":1}'
interval: 2m
conditions:
- "[STATUS] == 200"
- "[BODY].result == \"0x38\""
alerts:
- type: telegram
Notes on the config
failure-threshold: 3on dev,2on prod — dev is allowed to flap once before paging; prod pages faster.- Telegram alerts only, matching the rest of the CI/ops channel. If a separate ops channel is wanted, give Gatus its own bot + chat ID.
- External-dependency probes use response-code ranges, not strict 200s — RN's API answering with 401 still means it's reachable; what we care about is "is the upstream alive."
Required follow-up issues
| # | What | Where |
|---|---|---|
| 1 | Backend: add GET /api/health endpoint exposing the structured check object |
backend (escrow-backend) |
| 2 | Frontend: add /api/health Next.js route that fetches backend health and surfaces it (optional, for end-to-end check) |
frontend (escrow-frontend) |
| 3 | Ops: deploy Gatus config to the homelab Gatus instance, wire to Telegram | wherever the existing Gatus lives |
| 4 | Ops: document the runbook for each alert (what to check when "backend-dev-health" fires) | this doc, or a sibling runbook file |
What this would have caught (incident retrospective)
| Incident | Probe that would have fired | How long until alert |
|---|---|---|
2026-05-28 BSC unsupported_chain:56 |
backend-dev-health.checks.rnChainRegistry.chainCount >= 1 |
~90s (3× 30s probes) |
| Stale image after silent-build-fail (Tasks #9/#10) | backend-dev-version.[BODY].version not matching expected post-deploy version |
~3 min |
| devEscrow_nginx_after_redeploy (stale upstream 502s) | backend-dev-version returning 502 |
~3 min |
| Mongo password rotation breaking connections | backend-dev-health.checks.db.ok |
~90s |
| RN API outage on payment-intent creation | rn-api-reachable 5m × 3 failures = ~15 min — slower but acceptable for an upstream we don't control |
Net: every major silent-mode incident in this project would now have an alert. The cost is one new endpoint, one Gatus config file, and one Telegram chat ID.