ops: draft Gatus monitoring proposal + /api/health endpoint shape
Captures the runtime-monitoring side of the 2026-05-28 silent-empty- registry incident retrospective. Pairs with backend commit 28b17f2 (CI typecheck gate). Defines the proposed Gatus probe set, the /api/health endpoint that has to land first, and a follow-up issue list. Includes a retrospective table showing what this would have caught across recent incidents. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
260
08 - Operations/Gatus Monitoring - Proposed Config.md
Normal file
260
08 - Operations/Gatus Monitoring - Proposed Config.md
Normal file
@@ -0,0 +1,260 @@
|
|||||||
|
# Gatus Monitoring — Proposed Config
|
||||||
|
|
||||||
|
**Status:** Draft / proposal. Not deployed yet.
|
||||||
|
**Owner:** nick + claude
|
||||||
|
**Author date:** 2026-05-28
|
||||||
|
**Related:** [[Handoff - Request Network In-House Checkout - 2026-05-28]], memory entries `woodpecker_silent_build_fail` and `feedback-json-assets-copy-to-dist`.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Why
|
||||||
|
|
||||||
|
On 2026-05-28 dev.amn.gg silently regressed: every BSC checkout returned `unsupported_chain:56` for hours before a user reported it. The cause was a build-pipeline bug ([[woodpecker_silent_build_fail]]) compounded by an in-process empty chain registry that the backend served happily because the load failure was swallowed by `console.error`.
|
||||||
|
|
||||||
|
The CI typecheck gate added in backend commit `28b17f2` closes the build side. The runtime side — *is the deployed thing currently healthy?* — is Gatus's job. A Gatus probe hitting the registry endpoint would have paged within 60 seconds of today's regression, instead of waiting for a user to notice.
|
||||||
|
|
||||||
|
Gatus also catches drift that CI cannot:
|
||||||
|
- A configuration file edited live on the server.
|
||||||
|
- A dependency that quietly fails to load on container restart.
|
||||||
|
- A database connection that drops mid-day.
|
||||||
|
- Stale upstream IPs after [[devEscrow_nginx_after_redeploy]].
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## What we should monitor
|
||||||
|
|
||||||
|
Each endpoint serves one of three purposes: **liveness** (is the container up?), **structural invariants** (is the data the container needs actually loaded?), **integration health** (can it reach the things it depends on?).
|
||||||
|
|
||||||
|
### Backend — dev.amn.gg
|
||||||
|
|
||||||
|
| Endpoint | Purpose | Interval | Condition |
|
||||||
|
|---|---|---|---|
|
||||||
|
| `GET /api/version` | Liveness | 60s | `[STATUS] == 200`, `[BODY].version != ""` |
|
||||||
|
| `GET /api/admin/rn/networks` (auth-gated) | Chain registry not empty | 60s | `[STATUS] == 200`, `[BODY].chains[0].chainId != null` |
|
||||||
|
| `GET /api/health` (does not exist yet — see "Required backend work" below) | DB + Redis + registry health all in one | 30s | `[STATUS] == 200`, `[BODY].status == "ok"` |
|
||||||
|
|
||||||
|
The most valuable probe is the chain-registry one. The current `/api/admin/rn/networks` requires admin auth, so either:
|
||||||
|
1. Gatus posts an admin token per request (cheapest now, leaks an admin token into the monitoring config).
|
||||||
|
2. We add a `GET /api/health` endpoint that exposes a *subset* of invariants (counts only, no addresses) without auth. **Recommended.**
|
||||||
|
|
||||||
|
### Backend — prod.amn.gg
|
||||||
|
|
||||||
|
Identical probes, separate Gatus group so dev incidents don't drown out prod ones.
|
||||||
|
|
||||||
|
### Frontend — dev.amn.gg / prod.amn.gg
|
||||||
|
|
||||||
|
| Endpoint | Purpose | Interval | Condition |
|
||||||
|
|---|---|---|---|
|
||||||
|
| `GET /` | Page renders | 60s | `[STATUS] == 200`, `[RESPONSE_TIME] < 3000ms` |
|
||||||
|
| `GET /api/health` (Next.js route, proxy to backend) | End-to-end reachability | 60s | `[STATUS] == 200` |
|
||||||
|
|
||||||
|
### External dependencies
|
||||||
|
|
||||||
|
| Endpoint | Purpose | Interval | Condition |
|
||||||
|
|---|---|---|---|
|
||||||
|
| `https://api.request.network/...` | RN API reachable | 5m | `[STATUS] in (200, 401)` (401 is fine — means it answered) |
|
||||||
|
| `https://public.chainalysis.com/api/v1/address/...` | AML provider reachable | 5m | `[STATUS] in (200, 404)` |
|
||||||
|
| `https://bsc-rpc.publicnode.com` (eth_chainId) | RPC liveness | 2m | `[BODY].result == "0x38"` |
|
||||||
|
|
||||||
|
If RN's API goes down, in-house checkout still works (we already have the cached intent), but new payment creation fails. Gatus catching this lets us flip to the hosted-page fallback proactively.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Required backend work (before Gatus can be useful)
|
||||||
|
|
||||||
|
The probe set above presumes a `GET /api/health` endpoint that exposes invariants without admin auth. It does NOT exist today.
|
||||||
|
|
||||||
|
**Shape of the endpoint:**
|
||||||
|
|
||||||
|
```ts
|
||||||
|
// GET /api/health (public, rate-limited but not auth-gated)
|
||||||
|
{
|
||||||
|
"status": "ok" | "degraded" | "down",
|
||||||
|
"version": "2.6.48",
|
||||||
|
"uptimeSec": 12345,
|
||||||
|
"checks": {
|
||||||
|
"db": { "ok": true, "latencyMs": 4 },
|
||||||
|
"redis": { "ok": true, "latencyMs": 1 },
|
||||||
|
"rnChainRegistry": { "ok": true, "chainCount": 5 },
|
||||||
|
"rnTokenRegistry": { "ok": true, "tokenCount": 10 },
|
||||||
|
"rnApi": { "ok": true, "latencyMs": 134 }
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
Each `checks.*.ok` must reflect the actual current state, not a cached one. If any check fails, `status` flips to `degraded`. If `db.ok === false`, `status` flips to `down`.
|
||||||
|
|
||||||
|
**Why this shape rather than per-check endpoints:**
|
||||||
|
- One probe, all invariants — cheaper for Gatus and clearer in the dashboard.
|
||||||
|
- The structure lets us add invariants later (e.g. `walletMonitor.ok`, `paymentRedisService.queueDepth`) without changing the URL.
|
||||||
|
- Public exposure of counts (not addresses, not balances) is low-risk.
|
||||||
|
|
||||||
|
**Estimated work:** half a day backend, including a unit test that asserts every check's `ok` flag toggles correctly when the underlying dependency is mocked failed.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Proposed Gatus config
|
||||||
|
|
||||||
|
Once `/api/health` exists, the config is straightforward. Drop this into wherever the homelab Gatus instance reads its config from. Adjust group names and Slack/Telegram webhook references to whatever the existing Gatus setup uses for other services.
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
# gatus.amanat.yaml
|
||||||
|
#
|
||||||
|
# Amanat escrow monitoring. Three groups: backend (dev + prod), frontend
|
||||||
|
# (dev + prod), external (RN + Chainalysis + RPCs).
|
||||||
|
#
|
||||||
|
# Alerting: piggyback the existing CI Telegram channel so notifications
|
||||||
|
# show up next to the same channel where deploy notifications already go.
|
||||||
|
|
||||||
|
alerting:
|
||||||
|
telegram:
|
||||||
|
token: "${TG_TOKEN}"
|
||||||
|
id: "${TG_GATUS_CHAT_ID}"
|
||||||
|
default-alert:
|
||||||
|
enabled: true
|
||||||
|
send-on-resolved: true
|
||||||
|
failure-threshold: 3 # 3 consecutive failures before paging
|
||||||
|
success-threshold: 2
|
||||||
|
|
||||||
|
endpoints:
|
||||||
|
|
||||||
|
# ── Backend (dev) ───────────────────────────────────────────────────
|
||||||
|
- name: backend-dev-version
|
||||||
|
group: backend-dev
|
||||||
|
url: https://dev.amn.gg/api/version
|
||||||
|
interval: 60s
|
||||||
|
conditions:
|
||||||
|
- "[STATUS] == 200"
|
||||||
|
- "[BODY].version != \"\""
|
||||||
|
alerts:
|
||||||
|
- type: telegram
|
||||||
|
|
||||||
|
- name: backend-dev-health
|
||||||
|
group: backend-dev
|
||||||
|
url: https://dev.amn.gg/api/health
|
||||||
|
interval: 30s
|
||||||
|
conditions:
|
||||||
|
- "[STATUS] == 200"
|
||||||
|
- "[BODY].status == ok"
|
||||||
|
- "[BODY].checks.db.ok == true"
|
||||||
|
- "[BODY].checks.redis.ok == true"
|
||||||
|
- "[BODY].checks.rnChainRegistry.ok == true"
|
||||||
|
- "[BODY].checks.rnChainRegistry.chainCount >= 1"
|
||||||
|
- "[BODY].checks.rnTokenRegistry.ok == true"
|
||||||
|
- "[BODY].checks.rnTokenRegistry.tokenCount >= 1"
|
||||||
|
alerts:
|
||||||
|
- type: telegram
|
||||||
|
|
||||||
|
# ── Backend (prod) ──────────────────────────────────────────────────
|
||||||
|
- name: backend-prod-version
|
||||||
|
group: backend-prod
|
||||||
|
url: https://amn.gg/api/version
|
||||||
|
interval: 60s
|
||||||
|
conditions:
|
||||||
|
- "[STATUS] == 200"
|
||||||
|
- "[BODY].version != \"\""
|
||||||
|
alerts:
|
||||||
|
- type: telegram
|
||||||
|
failure-threshold: 2 # tighter on prod
|
||||||
|
|
||||||
|
- name: backend-prod-health
|
||||||
|
group: backend-prod
|
||||||
|
url: https://amn.gg/api/health
|
||||||
|
interval: 30s
|
||||||
|
conditions:
|
||||||
|
- "[STATUS] == 200"
|
||||||
|
- "[BODY].status == ok"
|
||||||
|
- "[BODY].checks.db.ok == true"
|
||||||
|
- "[BODY].checks.redis.ok == true"
|
||||||
|
- "[BODY].checks.rnChainRegistry.chainCount >= 1"
|
||||||
|
- "[BODY].checks.rnTokenRegistry.tokenCount >= 1"
|
||||||
|
alerts:
|
||||||
|
- type: telegram
|
||||||
|
failure-threshold: 2
|
||||||
|
|
||||||
|
# ── Frontend ────────────────────────────────────────────────────────
|
||||||
|
- name: frontend-dev
|
||||||
|
group: frontend
|
||||||
|
url: https://dev.amn.gg/
|
||||||
|
interval: 60s
|
||||||
|
conditions:
|
||||||
|
- "[STATUS] == 200"
|
||||||
|
- "[RESPONSE_TIME] < 3000"
|
||||||
|
alerts:
|
||||||
|
- type: telegram
|
||||||
|
|
||||||
|
- name: frontend-prod
|
||||||
|
group: frontend
|
||||||
|
url: https://amn.gg/
|
||||||
|
interval: 60s
|
||||||
|
conditions:
|
||||||
|
- "[STATUS] == 200"
|
||||||
|
- "[RESPONSE_TIME] < 3000"
|
||||||
|
alerts:
|
||||||
|
- type: telegram
|
||||||
|
failure-threshold: 2
|
||||||
|
|
||||||
|
# ── External dependencies ───────────────────────────────────────────
|
||||||
|
- name: rn-api-reachable
|
||||||
|
group: external
|
||||||
|
url: https://api.request.network/v2/health
|
||||||
|
interval: 5m
|
||||||
|
conditions:
|
||||||
|
- "[STATUS] in (200, 401, 404)" # any answer = up
|
||||||
|
alerts:
|
||||||
|
- type: telegram
|
||||||
|
|
||||||
|
- name: chainalysis-public-api
|
||||||
|
group: external
|
||||||
|
url: https://public.chainalysis.com/api/v1/address/0x0000000000000000000000000000000000000000
|
||||||
|
interval: 5m
|
||||||
|
conditions:
|
||||||
|
- "[STATUS] in (200, 404)"
|
||||||
|
alerts:
|
||||||
|
- type: telegram
|
||||||
|
|
||||||
|
- name: bsc-rpc-publicnode
|
||||||
|
group: external
|
||||||
|
method: POST
|
||||||
|
url: https://bsc-rpc.publicnode.com
|
||||||
|
headers:
|
||||||
|
Content-Type: application/json
|
||||||
|
body: '{"jsonrpc":"2.0","method":"eth_chainId","params":[],"id":1}'
|
||||||
|
interval: 2m
|
||||||
|
conditions:
|
||||||
|
- "[STATUS] == 200"
|
||||||
|
- "[BODY].result == \"0x38\""
|
||||||
|
alerts:
|
||||||
|
- type: telegram
|
||||||
|
```
|
||||||
|
|
||||||
|
### Notes on the config
|
||||||
|
|
||||||
|
- **`failure-threshold: 3` on dev, `2` on prod** — dev is allowed to flap once before paging; prod pages faster.
|
||||||
|
- **Telegram alerts only**, matching the rest of the CI/ops channel. If a separate ops channel is wanted, give Gatus its own bot + chat ID.
|
||||||
|
- **External-dependency probes use response-code ranges**, not strict 200s — RN's API answering with 401 still means it's reachable; what we care about is "is the upstream alive."
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Required follow-up issues
|
||||||
|
|
||||||
|
| # | What | Where |
|
||||||
|
|---|------|-------|
|
||||||
|
| 1 | Backend: add `GET /api/health` endpoint exposing the structured check object | `backend` (escrow-backend) |
|
||||||
|
| 2 | Frontend: add `/api/health` Next.js route that fetches backend health and surfaces it (optional, for end-to-end check) | `frontend` (escrow-frontend) |
|
||||||
|
| 3 | Ops: deploy Gatus config to the homelab Gatus instance, wire to Telegram | wherever the existing Gatus lives |
|
||||||
|
| 4 | Ops: document the runbook for each alert (what to check when "backend-dev-health" fires) | this doc, or a sibling runbook file |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## What this would have caught (incident retrospective)
|
||||||
|
|
||||||
|
| Incident | Probe that would have fired | How long until alert |
|
||||||
|
|---|---|---|
|
||||||
|
| 2026-05-28 BSC `unsupported_chain:56` | `backend-dev-health.checks.rnChainRegistry.chainCount >= 1` | ~90s (3× 30s probes) |
|
||||||
|
| Stale image after silent-build-fail (Tasks #9/#10) | `backend-dev-version.[BODY].version` not matching expected post-deploy version | ~3 min |
|
||||||
|
| [[devEscrow_nginx_after_redeploy]] (stale upstream 502s) | `backend-dev-version` returning 502 | ~3 min |
|
||||||
|
| Mongo password rotation breaking connections | `backend-dev-health.checks.db.ok` | ~90s |
|
||||||
|
| RN API outage on payment-intent creation | `rn-api-reachable` 5m × 3 failures = ~15 min — slower but acceptable for an upstream we don't control |
|
||||||
|
|
||||||
|
Net: every major silent-mode incident in this project would now have an alert. The cost is one new endpoint, one Gatus config file, and one Telegram chat ID.
|
||||||
Reference in New Issue
Block a user