# Gatus Monitoring — Proposed Config **Status:** Backend endpoint shipped in 2.6.49 (`backend@6c01a30`). Gatus config ready for deployment. Frontend `/api/health` proxy and ops deployment still pending. **Owner:** nick + claude **Author date:** 2026-05-28 **Related:** [[Handoff - Request Network In-House Checkout - 2026-05-28]], memory entries `woodpecker_silent_build_fail` and `feedback-json-assets-copy-to-dist`. --- ## Why On 2026-05-28 dev.amn.gg silently regressed: every BSC checkout returned `unsupported_chain:56` for hours before a user reported it. The cause was a build-pipeline bug ([[woodpecker_silent_build_fail]]) compounded by an in-process empty chain registry that the backend served happily because the load failure was swallowed by `console.error`. The CI typecheck gate added in backend commit `28b17f2` closes the build side. The runtime side — *is the deployed thing currently healthy?* — is Gatus's job. A Gatus probe hitting the registry endpoint would have paged within 60 seconds of today's regression, instead of waiting for a user to notice. Gatus also catches drift that CI cannot: - A configuration file edited live on the server. - A dependency that quietly fails to load on container restart. - A database connection that drops mid-day. - Stale upstream IPs after [[devEscrow_nginx_after_redeploy]]. --- ## What we should monitor Each endpoint serves one of three purposes: **liveness** (is the container up?), **structural invariants** (is the data the container needs actually loaded?), **integration health** (can it reach the things it depends on?). ### Backend — dev.amn.gg | Endpoint | Purpose | Interval | Condition | |---|---|---|---| | `GET /api/version` | Liveness | 60s | `[STATUS] == 200`, `[BODY].version != ""` | | `GET /api/admin/rn/networks` (auth-gated) | Chain registry not empty | 60s | `[STATUS] == 200`, `[BODY].chains[0].chainId != null` | | `GET /api/health` (does not exist yet — see "Required backend work" below) | DB + Redis + registry health all in one | 30s | `[STATUS] == 200`, `[BODY].status == "ok"` | The most valuable probe is the chain-registry one. The current `/api/admin/rn/networks` requires admin auth, so either: 1. Gatus posts an admin token per request (cheapest now, leaks an admin token into the monitoring config). 2. We add a `GET /api/health` endpoint that exposes a *subset* of invariants (counts only, no addresses) without auth. **Recommended.** ### Backend — prod.amn.gg Identical probes, separate Gatus group so dev incidents don't drown out prod ones. ### Frontend — dev.amn.gg / prod.amn.gg | Endpoint | Purpose | Interval | Condition | |---|---|---|---| | `GET /` | Page renders | 60s | `[STATUS] == 200`, `[RESPONSE_TIME] < 3000ms` | | `GET /api/health` (Next.js route, proxy to backend) | End-to-end reachability | 60s | `[STATUS] == 200` | ### External dependencies | Endpoint | Purpose | Interval | Condition | |---|---|---|---| | `https://api.request.network/...` | RN API reachable | 5m | `[STATUS] in (200, 401)` (401 is fine — means it answered) | | `https://public.chainalysis.com/api/v1/address/...` | AML provider reachable | 5m | `[STATUS] in (200, 404)` | | `https://bsc-rpc.publicnode.com` (eth_chainId) | RPC liveness | 2m | `[BODY].result == "0x38"` | If RN's API goes down, in-house checkout still works (we already have the cached intent), but new payment creation fails. Gatus catching this lets us flip to the hosted-page fallback proactively. --- ## Required backend work (before Gatus can be useful) The `GET /api/health` endpoint was shipped in backend 2.6.49. It is public, rate-limited-skipped, and returns the structured check object below. **Shape of the endpoint:** ```ts // GET /api/health (public, rate-limited but not auth-gated) { "status": "ok" | "degraded" | "down", "version": "2.6.48", "uptimeSec": 12345, "checks": { "db": { "ok": true, "latencyMs": 4 }, "redis": { "ok": true, "latencyMs": 1 }, "rnChainRegistry": { "ok": true, "chainCount": 5 }, "rnTokenRegistry": { "ok": true, "tokenCount": 10 }, "rnApi": { "ok": true, "latencyMs": 134 } } } ``` Each `checks.*.ok` must reflect the actual current state, not a cached one. If any check fails, `status` flips to `degraded`. If `db.ok === false`, `status` flips to `down`. **Why this shape rather than per-check endpoints:** - One probe, all invariants — cheaper for Gatus and clearer in the dashboard. - The structure lets us add invariants later (e.g. `walletMonitor.ok`, `paymentRedisService.queueDepth`) without changing the URL. - Public exposure of counts (not addresses, not balances) is low-risk. **Backend work:** ✅ Complete (2.6.49). Includes `healthCheckService` with 5 checks, route wired in `app.ts`, rate-limiter + logging skip, and 5 route-level unit tests. --- ## Proposed Gatus config Once `/api/health` exists, the config is straightforward. Drop this into wherever the homelab Gatus instance reads its config from. Adjust group names and Slack/Telegram webhook references to whatever the existing Gatus setup uses for other services. ```yaml # gatus.amanat.yaml # # Amanat escrow monitoring. Three groups: backend (dev + prod), frontend # (dev + prod), external (RN + Chainalysis + RPCs). # # Alerting: piggyback the existing CI Telegram channel so notifications # show up next to the same channel where deploy notifications already go. alerting: telegram: token: "${TG_TOKEN}" id: "${TG_GATUS_CHAT_ID}" default-alert: enabled: true send-on-resolved: true failure-threshold: 3 # 3 consecutive failures before paging success-threshold: 2 endpoints: # ── Backend (dev) ─────────────────────────────────────────────────── - name: backend-dev-version group: backend-dev url: https://dev.amn.gg/api/version interval: 60s conditions: - "[STATUS] == 200" - "[BODY].version != \"\"" alerts: - type: telegram - name: backend-dev-health group: backend-dev url: https://dev.amn.gg/api/health interval: 30s conditions: - "[STATUS] == 200" - "[BODY].status == ok" - "[BODY].checks.db.ok == true" - "[BODY].checks.redis.ok == true" - "[BODY].checks.rnChainRegistry.ok == true" - "[BODY].checks.rnChainRegistry.chainCount >= 1" - "[BODY].checks.rnTokenRegistry.ok == true" - "[BODY].checks.rnTokenRegistry.tokenCount >= 1" alerts: - type: telegram # ── Backend (prod) ────────────────────────────────────────────────── - name: backend-prod-version group: backend-prod url: https://amn.gg/api/version interval: 60s conditions: - "[STATUS] == 200" - "[BODY].version != \"\"" alerts: - type: telegram failure-threshold: 2 # tighter on prod - name: backend-prod-health group: backend-prod url: https://amn.gg/api/health interval: 30s conditions: - "[STATUS] == 200" - "[BODY].status == ok" - "[BODY].checks.db.ok == true" - "[BODY].checks.redis.ok == true" - "[BODY].checks.rnChainRegistry.chainCount >= 1" - "[BODY].checks.rnTokenRegistry.tokenCount >= 1" alerts: - type: telegram failure-threshold: 2 # ── Frontend ──────────────────────────────────────────────────────── - name: frontend-dev group: frontend url: https://dev.amn.gg/ interval: 60s conditions: - "[STATUS] == 200" - "[RESPONSE_TIME] < 3000" alerts: - type: telegram - name: frontend-prod group: frontend url: https://amn.gg/ interval: 60s conditions: - "[STATUS] == 200" - "[RESPONSE_TIME] < 3000" alerts: - type: telegram failure-threshold: 2 # ── External dependencies ─────────────────────────────────────────── - name: rn-api-reachable group: external url: https://api.request.network/v2/health interval: 5m conditions: - "[STATUS] in (200, 401, 404)" # any answer = up alerts: - type: telegram - name: chainalysis-public-api group: external url: https://public.chainalysis.com/api/v1/address/0x0000000000000000000000000000000000000000 interval: 5m conditions: - "[STATUS] in (200, 404)" alerts: - type: telegram - name: bsc-rpc-publicnode group: external method: POST url: https://bsc-rpc.publicnode.com headers: Content-Type: application/json body: '{"jsonrpc":"2.0","method":"eth_chainId","params":[],"id":1}' interval: 2m conditions: - "[STATUS] == 200" - "[BODY].result == \"0x38\"" alerts: - type: telegram ``` ### Notes on the config - **`failure-threshold: 3` on dev, `2` on prod** — dev is allowed to flap once before paging; prod pages faster. - **Telegram alerts only**, matching the rest of the CI/ops channel. If a separate ops channel is wanted, give Gatus its own bot + chat ID. - **External-dependency probes use response-code ranges**, not strict 200s — RN's API answering with 401 still means it's reachable; what we care about is "is the upstream alive." --- ## Required follow-up issues | # | What | Where | Status | |---|------|-------|--------| | 1 | Backend: add `GET /api/health` endpoint exposing the structured check object | `backend` (escrow-backend) | ✅ Shipped in 2.6.49 | | 2 | Frontend: add `/api/health` Next.js route that fetches backend health and surfaces it (optional, for end-to-end check) | `frontend` (escrow-frontend) | ⏳ Pending | | 3 | Ops: deploy Gatus config to the homelab Gatus instance, wire to Telegram | `deployment` repo (`deployment/gatus/config.yaml`) | ✅ Config committed; needs `docker-compose up -d gatus` on server | | 4 | Ops: document the runbook for each alert (what to check when "backend-dev-health" fires) | this doc, or a sibling runbook file | ⏳ Pending | --- ## What this would have caught (incident retrospective) | Incident | Probe that would have fired | How long until alert | |---|---|---| | 2026-05-28 BSC `unsupported_chain:56` | `backend-dev-health.checks.rnChainRegistry.chainCount >= 1` | ~90s (3× 30s probes) | | Stale image after silent-build-fail (Tasks #9/#10) | `backend-dev-version.[BODY].version` not matching expected post-deploy version | ~3 min | | [[devEscrow_nginx_after_redeploy]] (stale upstream 502s) | `backend-dev-version` returning 502 | ~3 min | | Mongo password rotation breaking connections | `backend-dev-health.checks.db.ok` | ~90s | | RN API outage on payment-intent creation | `rn-api-reachable` 5m × 3 failures = ~15 min — slower but acceptable for an upstream we don't control | Net: every major silent-mode incident in this project would now have an alert. The cost is one new endpoint, one Gatus config file, and one Telegram chat ID.