Files
nick-doc/08 - Operations/Gatus Monitoring - Proposed Config.md
2026-06-01 21:40:42 +04:00

12 KiB
Raw Blame History

Gatus Monitoring — Proposed Config

Status: Backend endpoint shipped in 2.6.49 (backend@6c01a30). Gatus config ready for deployment. Frontend /api/health proxy and ops deployment still pending. Owner: nick + claude Author date: 2026-05-28 Related: Handoff - Request Network In-House Checkout - 2026-05-28, memory entries woodpecker_silent_build_fail and feedback-json-assets-copy-to-dist.


Why

On 2026-05-28 dev.amn.gg silently regressed: every BSC checkout returned unsupported_chain:56 for hours before a user reported it. The cause was a build-pipeline bug (woodpecker_silent_build_fail) compounded by an in-process empty chain registry that the backend served happily because the load failure was swallowed by console.error.

The CI typecheck gate added in backend commit 28b17f2 closes the build side. The runtime side — is the deployed thing currently healthy? — is Gatus's job. A Gatus probe hitting the registry endpoint would have paged within 60 seconds of today's regression, instead of waiting for a user to notice.

Gatus also catches drift that CI cannot:

  • A configuration file edited live on the server.
  • A dependency that quietly fails to load on container restart.
  • A database connection that drops mid-day.
  • Stale upstream IPs after devEscrow_nginx_after_redeploy.

What we should monitor

Each endpoint serves one of three purposes: liveness (is the container up?), structural invariants (is the data the container needs actually loaded?), integration health (can it reach the things it depends on?).

Backend — dev.amn.gg

Endpoint Purpose Interval Condition
GET /api/version Liveness 60s [STATUS] == 200, [BODY].version != ""
GET /api/admin/rn/networks (auth-gated) Chain registry not empty 60s [STATUS] == 200, [BODY].chains[0].chainId != null
GET /api/health (does not exist yet — see "Required backend work" below) DB + Redis + registry health all in one 30s [STATUS] == 200, [BODY].status == "ok"

The most valuable probe is the chain-registry one. The current /api/admin/rn/networks requires admin auth, so either:

  1. Gatus posts an admin token per request (cheapest now, leaks an admin token into the monitoring config).
  2. We add a GET /api/health endpoint that exposes a subset of invariants (counts only, no addresses) without auth. Recommended.

Backend — prod.amn.gg

Identical probes, separate Gatus group so dev incidents don't drown out prod ones.

Frontend — dev.amn.gg / prod.amn.gg

Endpoint Purpose Interval Condition
GET / Page renders 60s [STATUS] == 200, [RESPONSE_TIME] < 3000ms
GET /api/health (Next.js route, proxy to backend) End-to-end reachability 60s [STATUS] == 200

External dependencies

Endpoint Purpose Interval Condition
https://api.request.network/... RN API reachable 5m [STATUS] in (200, 401) (401 is fine — means it answered)
https://public.chainalysis.com/api/v1/address/... AML provider reachable 5m [STATUS] in (200, 404)
https://bsc-rpc.publicnode.com (eth_chainId) RPC liveness 2m [BODY].result == "0x38"

If RN's API goes down, in-house checkout still works (we already have the cached intent), but new payment creation fails. Gatus catching this lets us flip to the hosted-page fallback proactively.


Required backend work (before Gatus can be useful)

The GET /api/health endpoint was shipped in backend 2.6.49. It is public, rate-limited-skipped, and returns the structured check object below.

Shape of the endpoint:

// GET /api/health  (public, skipped by the global rate limiter)
{
  "status": "ok" | "degraded" | "down",
  "version": "2.6.84",
  "uptimeSec": 12345,
  "checks": {
    "db": { "ok": true, "latencyMs": 4 },
    "redis": { "ok": true, "latencyMs": 1 },
    "rnChainRegistry": { "ok": true, "chainCount": 5 },
    "rnTokenRegistry": { "ok": true, "tokenCount": 10 },
    "rnApi": { "ok": true, "latencyMs": 134 }
  }
}

Each checks.*.ok reflects the current backend state, except rnApi, which is cached for 60 seconds as of backend 2.6.84 to avoid monitoring-induced upstream rate limits. rnApi.status === 429 is treated as reachable because Request Network answered; 5xx/timeouts still degrade the report. If any non-DB check fails, status flips to degraded. If db.ok === false, status flips to down.

Why this shape rather than per-check endpoints:

  • One probe, all invariants — cheaper for Gatus and clearer in the dashboard.
  • The structure lets us add invariants later (e.g. walletMonitor.ok, paymentRedisService.queueDepth) without changing the URL.
  • Public exposure of counts (not addresses, not balances) is low-risk.

Backend work: Complete (2.6.49). Includes healthCheckService with 5 checks, route wired in app.ts, rate-limiter + logging skip, and 5 route-level unit tests.

Postgres cutover monitoring: As of deployment 38cb75b, the live dev config also asserts checks.postgres.enabledStoreCount >= 7 plus the individual checks.postgres.storeModes.* == "postgres" values for auth, config, address, category, level config, shop settings, and reviews.


Proposed Gatus config

Once /api/health exists, the config is straightforward. Drop this into wherever the homelab Gatus instance reads its config from. Adjust group names and Slack/Telegram webhook references to whatever the existing Gatus setup uses for other services.

# gatus.amanat.yaml
#
# Amanat escrow monitoring. Three groups: backend (dev + prod), frontend
# (dev + prod), external (RN + Chainalysis + RPCs).
#
# Alerting: piggyback the existing CI Telegram channel so notifications
# show up next to the same channel where deploy notifications already go.

alerting:
  telegram:
    token: "${TG_TOKEN}"
    id: "${TG_GATUS_CHAT_ID}"
    default-alert:
      enabled: true
      send-on-resolved: true
      failure-threshold: 3   # 3 consecutive failures before paging
      success-threshold: 2

endpoints:

  # ── Backend (dev) ───────────────────────────────────────────────────
  - name: backend-dev-version
    group: backend-dev
    url: https://dev.amn.gg/api/version
    interval: 60s
    conditions:
      - "[STATUS] == 200"
      - "[BODY].version != \"\""
    alerts:
      - type: telegram

  - name: backend-dev-health
    group: backend-dev
    url: https://dev.amn.gg/api/health
    interval: 30s
    conditions:
      - "[STATUS] == 200"
      - "[BODY].status == \"ok\""
      - "[BODY].checks.db.ok == true"
      - "[BODY].checks.postgres.ok == true"
      - "[BODY].checks.postgres.configured == true"
      - "[BODY].checks.postgres.required == true"
      - "[BODY].checks.postgres.enabledStoreCount >= 7"
      - "[BODY].checks.postgres.storeModes.auth == \"postgres\""
      - "[BODY].checks.postgres.storeModes.config == \"postgres\""
      - "[BODY].checks.postgres.storeModes.address == \"postgres\""
      - "[BODY].checks.postgres.storeModes.category == \"postgres\""
      - "[BODY].checks.postgres.storeModes.levelConfig == \"postgres\""
      - "[BODY].checks.postgres.storeModes.shopSettings == \"postgres\""
      - "[BODY].checks.postgres.storeModes.review == \"postgres\""
      - "[BODY].checks.redis.ok == true"
      - "[BODY].checks.rnChainRegistry.ok == true"
      - "[BODY].checks.rnChainRegistry.chainCount >= 1"
      - "[BODY].checks.rnTokenRegistry.ok == true"
      - "[BODY].checks.rnTokenRegistry.tokenCount >= 1"
    alerts:
      - type: telegram

  # ── Backend (prod) ──────────────────────────────────────────────────
  - name: backend-prod-version
    group: backend-prod
    url: https://amn.gg/api/version
    interval: 60s
    conditions:
      - "[STATUS] == 200"
      - "[BODY].version != \"\""
    alerts:
      - type: telegram
        failure-threshold: 2   # tighter on prod

  - name: backend-prod-health
    group: backend-prod
    url: https://amn.gg/api/health
    interval: 30s
    conditions:
      - "[STATUS] == 200"
      - "[BODY].status == \"ok\""
      - "[BODY].checks.db.ok == true"
      - "[BODY].checks.postgres.ok == true"
      - "[BODY].checks.redis.ok == true"
      - "[BODY].checks.rnChainRegistry.chainCount >= 1"
      - "[BODY].checks.rnTokenRegistry.tokenCount >= 1"
    alerts:
      - type: telegram
        failure-threshold: 2

  # ── Frontend ────────────────────────────────────────────────────────
  - name: frontend-dev
    group: frontend
    url: https://dev.amn.gg/
    interval: 60s
    conditions:
      - "[STATUS] == 200"
      - "[RESPONSE_TIME] < 3000"
    alerts:
      - type: telegram

  - name: frontend-prod
    group: frontend
    url: https://amn.gg/
    interval: 60s
    conditions:
      - "[STATUS] == 200"
      - "[RESPONSE_TIME] < 3000"
    alerts:
      - type: telegram
        failure-threshold: 2

  # ── External dependencies ───────────────────────────────────────────
  - name: rn-api-reachable
    group: external
    url: https://api.request.network/v2/health
    interval: 5m
    conditions:
      - "[STATUS] in (200, 401, 404)"   # any answer = up
    alerts:
      - type: telegram

  - name: chainalysis-public-api
    group: external
    url: https://public.chainalysis.com/api/v1/address/0x0000000000000000000000000000000000000000
    interval: 5m
    conditions:
      - "[STATUS] in (200, 404)"
    alerts:
      - type: telegram

  - name: bsc-rpc-publicnode
    group: external
    method: POST
    url: https://bsc-rpc.publicnode.com
    headers:
      Content-Type: application/json
    body: '{"jsonrpc":"2.0","method":"eth_chainId","params":[],"id":1}'
    interval: 2m
    conditions:
      - "[STATUS] == 200"
      - "[BODY].result == \"0x38\""
    alerts:
      - type: telegram

Notes on the config

  • failure-threshold: 3 on dev, 2 on prod — dev is allowed to flap once before paging; prod pages faster.
  • Telegram alerts only, matching the rest of the CI/ops channel. If a separate ops channel is wanted, give Gatus its own bot + chat ID.
  • External-dependency probes use response-code ranges, not strict 200s — RN's API answering with 401 still means it's reachable; what we care about is "is the upstream alive."

Required follow-up issues

# What Where Status
1 Backend: add GET /api/health endpoint exposing the structured check object backend (escrow-backend) Shipped in 2.6.49
2 Frontend: add /api/health Next.js route that fetches backend health and surfaces it (optional, for end-to-end check) frontend (escrow-frontend) Pending
3 Ops: deploy Gatus config to the homelab Gatus instance, wire to Telegram deployment repo (deployment/gatus/config.yaml) Config committed; needs docker-compose up -d gatus on server
4 Ops: document the runbook for each alert (what to check when "backend-dev-health" fires) this doc, or a sibling runbook file Pending

What this would have caught (incident retrospective)

Incident Probe that would have fired How long until alert
2026-05-28 BSC unsupported_chain:56 backend-dev-health.checks.rnChainRegistry.chainCount >= 1 ~90s (3× 30s probes)
Stale image after silent-build-fail (Tasks #9/#10) backend-dev-version.[BODY].version not matching expected post-deploy version ~3 min
devEscrow_nginx_after_redeploy (stale upstream 502s) backend-dev-version returning 502 ~3 min
Mongo password rotation breaking connections backend-dev-health.checks.db.ok ~90s
RN API outage on payment-intent creation rn-api-reachable 5m × 3 failures = ~15 min — slower but acceptable for an upstream we don't control

Net: every major silent-mode incident in this project would now have an alert. The cost is one new endpoint, one Gatus config file, and one Telegram chat ID.