Files
nick-doc/08 - Operations/Gatus Monitoring - Proposed Config.md

261 lines
11 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Gatus Monitoring — Proposed Config
**Status:** Backend endpoint shipped in 2.6.49 (`backend@6c01a30`). Gatus config ready for deployment. Frontend `/api/health` proxy and ops deployment still pending.
**Owner:** nick + claude
**Author date:** 2026-05-28
**Related:** [[Handoff - Request Network In-House Checkout - 2026-05-28]], memory entries `woodpecker_silent_build_fail` and `feedback-json-assets-copy-to-dist`.
---
## Why
On 2026-05-28 dev.amn.gg silently regressed: every BSC checkout returned `unsupported_chain:56` for hours before a user reported it. The cause was a build-pipeline bug ([[woodpecker_silent_build_fail]]) compounded by an in-process empty chain registry that the backend served happily because the load failure was swallowed by `console.error`.
The CI typecheck gate added in backend commit `28b17f2` closes the build side. The runtime side — *is the deployed thing currently healthy?* — is Gatus's job. A Gatus probe hitting the registry endpoint would have paged within 60 seconds of today's regression, instead of waiting for a user to notice.
Gatus also catches drift that CI cannot:
- A configuration file edited live on the server.
- A dependency that quietly fails to load on container restart.
- A database connection that drops mid-day.
- Stale upstream IPs after [[devEscrow_nginx_after_redeploy]].
---
## What we should monitor
Each endpoint serves one of three purposes: **liveness** (is the container up?), **structural invariants** (is the data the container needs actually loaded?), **integration health** (can it reach the things it depends on?).
### Backend — dev.amn.gg
| Endpoint | Purpose | Interval | Condition |
|---|---|---|---|
| `GET /api/version` | Liveness | 60s | `[STATUS] == 200`, `[BODY].version != ""` |
| `GET /api/admin/rn/networks` (auth-gated) | Chain registry not empty | 60s | `[STATUS] == 200`, `[BODY].chains[0].chainId != null` |
| `GET /api/health` (does not exist yet — see "Required backend work" below) | DB + Redis + registry health all in one | 30s | `[STATUS] == 200`, `[BODY].status == "ok"` |
The most valuable probe is the chain-registry one. The current `/api/admin/rn/networks` requires admin auth, so either:
1. Gatus posts an admin token per request (cheapest now, leaks an admin token into the monitoring config).
2. We add a `GET /api/health` endpoint that exposes a *subset* of invariants (counts only, no addresses) without auth. **Recommended.**
### Backend — prod.amn.gg
Identical probes, separate Gatus group so dev incidents don't drown out prod ones.
### Frontend — dev.amn.gg / prod.amn.gg
| Endpoint | Purpose | Interval | Condition |
|---|---|---|---|
| `GET /` | Page renders | 60s | `[STATUS] == 200`, `[RESPONSE_TIME] < 3000ms` |
| `GET /api/health` (Next.js route, proxy to backend) | End-to-end reachability | 60s | `[STATUS] == 200` |
### External dependencies
| Endpoint | Purpose | Interval | Condition |
|---|---|---|---|
| `https://api.request.network/...` | RN API reachable | 5m | `[STATUS] in (200, 401)` (401 is fine — means it answered) |
| `https://public.chainalysis.com/api/v1/address/...` | AML provider reachable | 5m | `[STATUS] in (200, 404)` |
| `https://bsc-rpc.publicnode.com` (eth_chainId) | RPC liveness | 2m | `[BODY].result == "0x38"` |
If RN's API goes down, in-house checkout still works (we already have the cached intent), but new payment creation fails. Gatus catching this lets us flip to the hosted-page fallback proactively.
---
## Required backend work (before Gatus can be useful)
The `GET /api/health` endpoint was shipped in backend 2.6.49. It is public, rate-limited-skipped, and returns the structured check object below.
**Shape of the endpoint:**
```ts
// GET /api/health (public, rate-limited but not auth-gated)
{
"status": "ok" | "degraded" | "down",
"version": "2.6.48",
"uptimeSec": 12345,
"checks": {
"db": { "ok": true, "latencyMs": 4 },
"redis": { "ok": true, "latencyMs": 1 },
"rnChainRegistry": { "ok": true, "chainCount": 5 },
"rnTokenRegistry": { "ok": true, "tokenCount": 10 },
"rnApi": { "ok": true, "latencyMs": 134 }
}
}
```
Each `checks.*.ok` must reflect the actual current state, not a cached one. If any check fails, `status` flips to `degraded`. If `db.ok === false`, `status` flips to `down`.
**Why this shape rather than per-check endpoints:**
- One probe, all invariants — cheaper for Gatus and clearer in the dashboard.
- The structure lets us add invariants later (e.g. `walletMonitor.ok`, `paymentRedisService.queueDepth`) without changing the URL.
- Public exposure of counts (not addresses, not balances) is low-risk.
**Backend work:** ✅ Complete (2.6.49). Includes `healthCheckService` with 5 checks, route wired in `app.ts`, rate-limiter + logging skip, and 5 route-level unit tests.
---
## Proposed Gatus config
Once `/api/health` exists, the config is straightforward. Drop this into wherever the homelab Gatus instance reads its config from. Adjust group names and Slack/Telegram webhook references to whatever the existing Gatus setup uses for other services.
```yaml
# gatus.amanat.yaml
#
# Amanat escrow monitoring. Three groups: backend (dev + prod), frontend
# (dev + prod), external (RN + Chainalysis + RPCs).
#
# Alerting: piggyback the existing CI Telegram channel so notifications
# show up next to the same channel where deploy notifications already go.
alerting:
telegram:
token: "${TG_TOKEN}"
id: "${TG_GATUS_CHAT_ID}"
default-alert:
enabled: true
send-on-resolved: true
failure-threshold: 3 # 3 consecutive failures before paging
success-threshold: 2
endpoints:
# ── Backend (dev) ───────────────────────────────────────────────────
- name: backend-dev-version
group: backend-dev
url: https://dev.amn.gg/api/version
interval: 60s
conditions:
- "[STATUS] == 200"
- "[BODY].version != \"\""
alerts:
- type: telegram
- name: backend-dev-health
group: backend-dev
url: https://dev.amn.gg/api/health
interval: 30s
conditions:
- "[STATUS] == 200"
- "[BODY].status == ok"
- "[BODY].checks.db.ok == true"
- "[BODY].checks.redis.ok == true"
- "[BODY].checks.rnChainRegistry.ok == true"
- "[BODY].checks.rnChainRegistry.chainCount >= 1"
- "[BODY].checks.rnTokenRegistry.ok == true"
- "[BODY].checks.rnTokenRegistry.tokenCount >= 1"
alerts:
- type: telegram
# ── Backend (prod) ──────────────────────────────────────────────────
- name: backend-prod-version
group: backend-prod
url: https://amn.gg/api/version
interval: 60s
conditions:
- "[STATUS] == 200"
- "[BODY].version != \"\""
alerts:
- type: telegram
failure-threshold: 2 # tighter on prod
- name: backend-prod-health
group: backend-prod
url: https://amn.gg/api/health
interval: 30s
conditions:
- "[STATUS] == 200"
- "[BODY].status == ok"
- "[BODY].checks.db.ok == true"
- "[BODY].checks.redis.ok == true"
- "[BODY].checks.rnChainRegistry.chainCount >= 1"
- "[BODY].checks.rnTokenRegistry.tokenCount >= 1"
alerts:
- type: telegram
failure-threshold: 2
# ── Frontend ────────────────────────────────────────────────────────
- name: frontend-dev
group: frontend
url: https://dev.amn.gg/
interval: 60s
conditions:
- "[STATUS] == 200"
- "[RESPONSE_TIME] < 3000"
alerts:
- type: telegram
- name: frontend-prod
group: frontend
url: https://amn.gg/
interval: 60s
conditions:
- "[STATUS] == 200"
- "[RESPONSE_TIME] < 3000"
alerts:
- type: telegram
failure-threshold: 2
# ── External dependencies ───────────────────────────────────────────
- name: rn-api-reachable
group: external
url: https://api.request.network/v2/health
interval: 5m
conditions:
- "[STATUS] in (200, 401, 404)" # any answer = up
alerts:
- type: telegram
- name: chainalysis-public-api
group: external
url: https://public.chainalysis.com/api/v1/address/0x0000000000000000000000000000000000000000
interval: 5m
conditions:
- "[STATUS] in (200, 404)"
alerts:
- type: telegram
- name: bsc-rpc-publicnode
group: external
method: POST
url: https://bsc-rpc.publicnode.com
headers:
Content-Type: application/json
body: '{"jsonrpc":"2.0","method":"eth_chainId","params":[],"id":1}'
interval: 2m
conditions:
- "[STATUS] == 200"
- "[BODY].result == \"0x38\""
alerts:
- type: telegram
```
### Notes on the config
- **`failure-threshold: 3` on dev, `2` on prod** — dev is allowed to flap once before paging; prod pages faster.
- **Telegram alerts only**, matching the rest of the CI/ops channel. If a separate ops channel is wanted, give Gatus its own bot + chat ID.
- **External-dependency probes use response-code ranges**, not strict 200s — RN's API answering with 401 still means it's reachable; what we care about is "is the upstream alive."
---
## Required follow-up issues
| # | What | Where | Status |
|---|------|-------|--------|
| 1 | Backend: add `GET /api/health` endpoint exposing the structured check object | `backend` (escrow-backend) | ✅ Shipped in 2.6.49 |
| 2 | Frontend: add `/api/health` Next.js route that fetches backend health and surfaces it (optional, for end-to-end check) | `frontend` (escrow-frontend) | ⏳ Pending |
| 3 | Ops: deploy Gatus config to the homelab Gatus instance, wire to Telegram | `deployment` repo (`deployment/gatus/config.yaml`) | ✅ Config committed; needs `docker-compose up -d gatus` on server |
| 4 | Ops: document the runbook for each alert (what to check when "backend-dev-health" fires) | this doc, or a sibling runbook file | ⏳ Pending |
---
## What this would have caught (incident retrospective)
| Incident | Probe that would have fired | How long until alert |
|---|---|---|
| 2026-05-28 BSC `unsupported_chain:56` | `backend-dev-health.checks.rnChainRegistry.chainCount >= 1` | ~90s (3× 30s probes) |
| Stale image after silent-build-fail (Tasks #9/#10) | `backend-dev-version.[BODY].version` not matching expected post-deploy version | ~3 min |
| [[devEscrow_nginx_after_redeploy]] (stale upstream 502s) | `backend-dev-version` returning 502 | ~3 min |
| Mongo password rotation breaking connections | `backend-dev-health.checks.db.ok` | ~90s |
| RN API outage on payment-intent creation | `rn-api-reachable` 5m × 3 failures = ~15 min — slower but acceptable for an upstream we don't control |
Net: every major silent-mode incident in this project would now have an alert. The cost is one new endpoint, one Gatus config file, and one Telegram chat ID.