Files

Siavash Sameni 8a9e562ced ops: draft Gatus monitoring proposal + /api/health endpoint shape

Captures the runtime-monitoring side of the 2026-05-28 silent-empty-
registry incident retrospective. Pairs with backend commit 28b17f2
(CI typecheck gate). Defines the proposed Gatus probe set, the
/api/health endpoint that has to land first, and a follow-up issue
list. Includes a retrospective table showing what this would have
caught across recent incidents.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

2026-05-28 21:33:33 +04:00

11 KiB

Raw Blame History

Gatus Monitoring — Proposed Config

Status: Draft / proposal. Not deployed yet. Owner: nick + claude Author date: 2026-05-28 Related: Handoff - Request Network In-House Checkout - 2026-05-28, memory entries woodpecker_silent_build_fail and feedback-json-assets-copy-to-dist.

Why

On 2026-05-28 dev.amn.gg silently regressed: every BSC checkout returned unsupported_chain:56 for hours before a user reported it. The cause was a build-pipeline bug (woodpecker_silent_build_fail) compounded by an in-process empty chain registry that the backend served happily because the load failure was swallowed by console.error.

The CI typecheck gate added in backend commit 28b17f2 closes the build side. The runtime side — is the deployed thing currently healthy? — is Gatus's job. A Gatus probe hitting the registry endpoint would have paged within 60 seconds of today's regression, instead of waiting for a user to notice.

Gatus also catches drift that CI cannot:

A configuration file edited live on the server.
A dependency that quietly fails to load on container restart.
A database connection that drops mid-day.
Stale upstream IPs after devEscrow_nginx_after_redeploy.

What we should monitor

Each endpoint serves one of three purposes: liveness (is the container up?), structural invariants (is the data the container needs actually loaded?), integration health (can it reach the things it depends on?).

Backend — dev.amn.gg

Endpoint	Purpose	Interval	Condition
`GET /api/version`	Liveness	60s	`[STATUS] == 200`, `[BODY].version != ""`
`GET /api/admin/rn/networks` (auth-gated)	Chain registry not empty	60s	`[STATUS] == 200`, `[BODY].chains[0].chainId != null`
`GET /api/health` (does not exist yet — see "Required backend work" below)	DB + Redis + registry health all in one	30s	`[STATUS] == 200`, `[BODY].status == "ok"`

The most valuable probe is the chain-registry one. The current /api/admin/rn/networks requires admin auth, so either:

Gatus posts an admin token per request (cheapest now, leaks an admin token into the monitoring config).
We add a GET /api/health endpoint that exposes a subset of invariants (counts only, no addresses) without auth. Recommended.

Backend — prod.amn.gg

Identical probes, separate Gatus group so dev incidents don't drown out prod ones.

Frontend — dev.amn.gg / prod.amn.gg

Endpoint	Purpose	Interval	Condition
`GET /`	Page renders	60s	`[STATUS] == 200`, `[RESPONSE_TIME] < 3000ms`
`GET /api/health` (Next.js route, proxy to backend)	End-to-end reachability	60s	`[STATUS] == 200`

External dependencies

Endpoint	Purpose	Interval	Condition
`https://api.request.network/...`	RN API reachable	5m	`[STATUS] in (200, 401)` (401 is fine — means it answered)
`https://public.chainalysis.com/api/v1/address/...`	AML provider reachable	5m	`[STATUS] in (200, 404)`
`https://bsc-rpc.publicnode.com` (eth_chainId)	RPC liveness	2m	`[BODY].result == "0x38"`

If RN's API goes down, in-house checkout still works (we already have the cached intent), but new payment creation fails. Gatus catching this lets us flip to the hosted-page fallback proactively.

Required backend work (before Gatus can be useful)

The probe set above presumes a GET /api/health endpoint that exposes invariants without admin auth. It does NOT exist today.

Shape of the endpoint:

// GET /api/health  (public, rate-limited but not auth-gated)
{
  "status": "ok" | "degraded" | "down",
  "version": "2.6.48",
  "uptimeSec": 12345,
  "checks": {
    "db": { "ok": true, "latencyMs": 4 },
    "redis": { "ok": true, "latencyMs": 1 },
    "rnChainRegistry": { "ok": true, "chainCount": 5 },
    "rnTokenRegistry": { "ok": true, "tokenCount": 10 },
    "rnApi": { "ok": true, "latencyMs": 134 }
  }
}

Each checks.*.ok must reflect the actual current state, not a cached one. If any check fails, status flips to degraded. If db.ok === false, status flips to down.

Why this shape rather than per-check endpoints:

One probe, all invariants — cheaper for Gatus and clearer in the dashboard.
The structure lets us add invariants later (e.g. walletMonitor.ok, paymentRedisService.queueDepth) without changing the URL.
Public exposure of counts (not addresses, not balances) is low-risk.

Estimated work: half a day backend, including a unit test that asserts every check's ok flag toggles correctly when the underlying dependency is mocked failed.

Proposed Gatus config

Once /api/health exists, the config is straightforward. Drop this into wherever the homelab Gatus instance reads its config from. Adjust group names and Slack/Telegram webhook references to whatever the existing Gatus setup uses for other services.

# gatus.amanat.yaml
#
# Amanat escrow monitoring. Three groups: backend (dev + prod), frontend
# (dev + prod), external (RN + Chainalysis + RPCs).
#
# Alerting: piggyback the existing CI Telegram channel so notifications
# show up next to the same channel where deploy notifications already go.

alerting:
  telegram:
    token: "${TG_TOKEN}"
    id: "${TG_GATUS_CHAT_ID}"
    default-alert:
      enabled: true
      send-on-resolved: true
      failure-threshold: 3   # 3 consecutive failures before paging
      success-threshold: 2

endpoints:

  # ── Backend (dev) ───────────────────────────────────────────────────
  - name: backend-dev-version
    group: backend-dev
    url: https://dev.amn.gg/api/version
    interval: 60s
    conditions:
      - "[STATUS] == 200"
      - "[BODY].version != \"\""
    alerts:
      - type: telegram

  - name: backend-dev-health
    group: backend-dev
    url: https://dev.amn.gg/api/health
    interval: 30s
    conditions:
      - "[STATUS] == 200"
      - "[BODY].status == ok"
      - "[BODY].checks.db.ok == true"
      - "[BODY].checks.redis.ok == true"
      - "[BODY].checks.rnChainRegistry.ok == true"
      - "[BODY].checks.rnChainRegistry.chainCount >= 1"
      - "[BODY].checks.rnTokenRegistry.ok == true"
      - "[BODY].checks.rnTokenRegistry.tokenCount >= 1"
    alerts:
      - type: telegram

  # ── Backend (prod) ──────────────────────────────────────────────────
  - name: backend-prod-version
    group: backend-prod
    url: https://amn.gg/api/version
    interval: 60s
    conditions:
      - "[STATUS] == 200"
      - "[BODY].version != \"\""
    alerts:
      - type: telegram
        failure-threshold: 2   # tighter on prod

  - name: backend-prod-health
    group: backend-prod
    url: https://amn.gg/api/health
    interval: 30s
    conditions:
      - "[STATUS] == 200"
      - "[BODY].status == ok"
      - "[BODY].checks.db.ok == true"
      - "[BODY].checks.redis.ok == true"
      - "[BODY].checks.rnChainRegistry.chainCount >= 1"
      - "[BODY].checks.rnTokenRegistry.tokenCount >= 1"
    alerts:
      - type: telegram
        failure-threshold: 2

  # ── Frontend ────────────────────────────────────────────────────────
  - name: frontend-dev
    group: frontend
    url: https://dev.amn.gg/
    interval: 60s
    conditions:
      - "[STATUS] == 200"
      - "[RESPONSE_TIME] < 3000"
    alerts:
      - type: telegram

  - name: frontend-prod
    group: frontend
    url: https://amn.gg/
    interval: 60s
    conditions:
      - "[STATUS] == 200"
      - "[RESPONSE_TIME] < 3000"
    alerts:
      - type: telegram
        failure-threshold: 2

  # ── External dependencies ───────────────────────────────────────────
  - name: rn-api-reachable
    group: external
    url: https://api.request.network/v2/health
    interval: 5m
    conditions:
      - "[STATUS] in (200, 401, 404)"   # any answer = up
    alerts:
      - type: telegram

  - name: chainalysis-public-api
    group: external
    url: https://public.chainalysis.com/api/v1/address/0x0000000000000000000000000000000000000000
    interval: 5m
    conditions:
      - "[STATUS] in (200, 404)"
    alerts:
      - type: telegram

  - name: bsc-rpc-publicnode
    group: external
    method: POST
    url: https://bsc-rpc.publicnode.com
    headers:
      Content-Type: application/json
    body: '{"jsonrpc":"2.0","method":"eth_chainId","params":[],"id":1}'
    interval: 2m
    conditions:
      - "[STATUS] == 200"
      - "[BODY].result == \"0x38\""
    alerts:
      - type: telegram

Notes on the config

failure-threshold: 3 on dev, 2 on prod — dev is allowed to flap once before paging; prod pages faster.
Telegram alerts only, matching the rest of the CI/ops channel. If a separate ops channel is wanted, give Gatus its own bot + chat ID.
External-dependency probes use response-code ranges, not strict 200s — RN's API answering with 401 still means it's reachable; what we care about is "is the upstream alive."

Required follow-up issues

#	What	Where
1	Backend: add `GET /api/health` endpoint exposing the structured check object	`backend` (escrow-backend)
2	Frontend: add `/api/health` Next.js route that fetches backend health and surfaces it (optional, for end-to-end check)	`frontend` (escrow-frontend)
3	Ops: deploy Gatus config to the homelab Gatus instance, wire to Telegram	wherever the existing Gatus lives
4	Ops: document the runbook for each alert (what to check when "backend-dev-health" fires)	this doc, or a sibling runbook file

What this would have caught (incident retrospective)

Incident	Probe that would have fired	How long until alert
2026-05-28 BSC `unsupported_chain:56`	`backend-dev-health.checks.rnChainRegistry.chainCount >= 1`	~90s (3× 30s probes)
Stale image after silent-build-fail (Tasks #9/#10)	`backend-dev-version.[BODY].version` not matching expected post-deploy version	~3 min
devEscrow_nginx_after_redeploy (stale upstream 502s)	`backend-dev-version` returning 502	~3 min
Mongo password rotation breaking connections	`backend-dev-health.checks.db.ok`	~90s
RN API outage on payment-intent creation	`rn-api-reachable` 5m × 3 failures = ~15 min — slower but acceptable for an upstream we don't control

Net: every major silent-mode incident in this project would now have an alert. The cost is one new endpoint, one Gatus config file, and one Telegram chat ID.

11 KiB Raw Blame History Unescape Escape