Files
nick-doc/08 - Operations/Monitoring.md
Siavash Sameni a5d71bcc05 docs: sync documentation with latest codebase state
- Update Activity Log with 108 missing commits (48 backend + 60 frontend)
- Update version references: backend v2.8.79, frontend v2.8.94
- Update migration count: 18 migrations (0000-0017)
- Update Telegram Mini App Flow to v2.8.94
- Update Payment Flow - Scanner to 2026-06-05
- Update all architectural and database references

Generated by Mistral Vibe.
Co-Authored-By: Mistral Vibe <vibe@mistral.ai>
2026-06-05 07:34:49 +04:00

10 KiB

title, tags
title tags
Monitoring
operations

Monitoring

What's instrumented today and what to watch. Today's stack is intentionally lean — health endpoints, Docker healthchecks, Sentry, and access logs. Bigger metric pipelines (Prometheus, Grafana, OpenSearch) are a future addition.


1. Health endpoint

Two paths are registered (both are public, rate-limited, not auth-gated):

  • GET /health — simple ping used by Docker healthchecks. Returns 200 { success, message, timestamp, environment, version }. Does not probe MongoDB or Redis.
  • GET /api/health — deep health check added in commit 44579d6 (backend v2.6.49). Calls runHealthChecks from backend/src/services/health/healthCheckService.ts. Probes MongoDB, Postgres, Redis, Request Network registry data, and Request Network API reachability. Returns 503 only when report.status === 'down'. As of backend 2.8.79, Postgres is a hard dependency only when at least one *_STORE=postgres flag is enabled; otherwise an unconfigured Postgres check is reported as skipped. The Postgres check also reports active store modes so monitoring can distinguish "PG is reachable" from "this runtime is actually using PG-backed stores". As of deployment 38cb75b, dev Gatus requires all seven PG-capable store modes to be postgres and enabledStoreCount >= 7.

GET /api/health response shape (from healthCheckService):

{
  "status": "ok",
  "version": "2.8.79",
  "uptimeSec": 662,
  "checks": {
    "db": { "ok": true, "latencyMs": 4 },
    "postgres": {
      "ok": true,
      "latencyMs": 5,
      "configured": true,
      "required": true,
      "storeModes": {
        "auth": "postgres",
        "config": "postgres",
        "address": "postgres",
        "category": "postgres",
        "levelConfig": "postgres",
        "shopSettings": "postgres",
        "review": "postgres"
      },
      "enabledStores": [
        "auth",
        "config",
        "address",
        "category",
        "levelConfig",
        "shopSettings",
        "review"
      ],
      "enabledStoreCount": 7,
      "database": "amanat_dev",
      "user": "amanat"
    },
    "redis": { "ok": true, "latencyMs": 1 },
    "rnChainRegistry": { "ok": true, "latencyMs": 0, "chainCount": 7 },
    "rnTokenRegistry": { "ok": true, "latencyMs": 0, "tokenCount": 12 },
    "rnApi": { "ok": true, "latencyMs": 134, "status": 401 }
  }
}

Public URL behind Nginx: https://amn.gg/api/health.


2. Docker healthchecks

Each long-lived container has a HEALTHCHECK baked in or declared in compose.

Container Probe Interval Failure threshold
nickapp-backend node healthcheck.js (HTTP GET /health) 30s 3 retries
nickapp-frontend curl -f http://localhost:8083/ 30s 3 retries
mongodb mongosh --eval "db.adminCommand('ping')" 30s 3 retries
redis redis-cli -a $REDIS_PASSWORD ping 30s 3 retries

healthcheck.js (backend) is a tiny Node script that does a local HTTP GET to /health and exits 0 / 1.

Inspect health:

docker ps --format "table {{.Names}}\t{{.Status}}"

# Detailed
docker inspect --format='{{json .State.Health}}' nickapp-backend | jq

If a container is unhealthy, Watchtower will not roll it (it expects the new container to pass healthcheck). Investigate with docker logs <container>.


3. Sentry — error tracking

Frontend

@sentry/nextjs ^10.22.0 is wired in via three config files at the repo root:

  • sentry.client.config.ts — browser SDK (with Session Replay enabled at 10% session / 100% error rate).
  • sentry.server.config.ts — server-rendered components (no Replay).
  • sentry.edge.config.ts — edge runtime (not currently used heavily).

Common settings:

Sentry.init({
  dsn: process.env.NEXT_PUBLIC_SENTRY_DSN,
  tracesSampleRate: process.env.NODE_ENV === 'production' ? 0.1 : 1.0,
  environment: process.env.NODE_ENV || 'development',
  enabled: process.env.NODE_ENV === 'production',
  ignoreErrors: ['ResizeObserver loop limit exceeded', 'ChunkLoadError', ...],
});

Errors from localhost are filtered out — only prod errors land in the dashboard.

Backend

@sentry/node ^10.22.0 + @sentry/profiling-node ^10.22.0 are initialised first in src/app.ts (before any other import) via src/config/sentry.ts. DSN comes from SENTRY_DSN env var (see Environment Variables#sentry).

What's captured:

  • Uncaught exceptions in route handlers
  • Promise rejections inside asyncHandler-wrapped routes
  • Manually-captured errors via Sentry.captureException(err)
  • Performance traces (10% sample rate in prod)
  • Profiling samples via @sentry/profiling-node

Source maps

Frontend uploads source maps to Sentry at build time when SENTRY_AUTH_TOKEN, SENTRY_ORG, and SENTRY_PROJECT are set in the CI env. Without them the build still succeeds but Sentry traces will show minified frames.

Alerts

Configure in the Sentry dashboard (Issues → Alerts) — common alerts:

  • Any new issue in production → Slack
  • Error frequency > 50/minute → page on-call
  • Performance regression on /api/payments/* traces → email

4. Logs

Backend application logs

Routed through src/utils/logger.ts — currently a thin console.log wrapper with emoji prefixes. Output goes to stdout, captured by Docker:

# Live tail
docker compose -f docker-compose.production.yml logs -f --tail=200 nickapp-backend

# Search for a request
docker logs nickapp-backend 2>&1 | grep "POST /api/payments"

# Pre-filter by date
docker logs --since 1h nickapp-backend

Notable log lines to look for:

Prefix Meaning
✅ Connected to MongoDB DB connection established
🚀 Server running on port 5001 App fully started
🔌 User connected: <id> Socket.IO connection
📥 Inbound HTTP request log
💳 Request Network Request Network webhook / API call
🔐 Webhook verification Webhook signature check result
❌ Error Manual error log (also captured by Sentry)

Nginx access + error logs

Bind-mounted to ./nginx/logs/ on the host:

tail -f /opt/backend/nginx/logs/access.log
tail -f /opt/backend/nginx/logs/error.log

Rotate these via host logrotate to avoid disk fill.

Frontend logs

Next.js logs go to the container stdout:

docker logs -f nickapp-frontend

Browser-side logs that need attention go through Sentry (above) — src/utils/logger.ts in the frontend forwards via Sentry breadcrumbs.


5. Key metrics to watch

Today these are read manually from logs / Sentry. As Prometheus is added, encode them as alerting rules.

Application

Metric Where to check Healthy Alert
5xx rate Sentry, Nginx access.log < 0.5 % > 2 % over 5 min
/health p95 latency curl + timer < 100 ms > 1 s
Login success rate Sentry custom event > 95 % < 90 %
Socket disconnect storm 🔌 User disconnected log frequency < 1/s sustained > 10/s sustained
OpenAI 429s Backend log OpenAI ... 429 0 any

Payments

Metric Where Healthy Alert
Payment success rate db.payments.aggregate([{$group:{_id:"$status",n:{$sum:1}}}]) > 95 % completed of 24h-old payments < 90 %
Webhook signature failures log Webhook verification failed 0 > 0
Request Network webhook 4xx nginx access log /api/payment/request-network/webhook 0 any real provider delivery returning 4xx
Request Network safety-pending payments db.payments.find({"metadata.transactionSafety.status":"pending"}) explained/short-lived pending > 10 min without operator note
Request Network API errors (5xx) log + Sentry 0 > 5/min sustained
Payouts stuck in pending > 30 min db.payments.find({type:'payout',status:'pending',createdAt:{$lt:ISODate(30 min ago)}}) empty non-empty
Missing transactionHash after completed the same query that drives fix-transaction-hashes.js empty non-empty

MongoDB

db.serverStatus().connections           // active connections; alert if >1000
db.serverStatus().opcounters            // ops/sec
db.serverStatus().wiredTiger.cache      // cache hit ratio; aim > 95 %
db.currentOp({ secs_running: { $gte: 5 } })  // long-running queries

Redis

docker exec nickapp-redis redis-cli -a "$REDIS_PASSWORD" INFO stats
# Watch: instantaneous_ops_per_sec, keyspace_hits/misses, rejected_connections, evicted_keys

Alert thresholds: rejected_connections > 0, evicted_keys rising while you don't expect cache pressure, latency_ms p99 > 5ms.

Host

Metric Tool Healthy Alert
Disk usage on /var/lib/docker df -h < 80 % > 90 %
/opt/backend/uploads size du -sh watch trend bursty growth (>5 GB/day)
Memory pressure free -h, docker stats < 80 % swap actively used
Open file descriptors cat /proc/<pid>/limits well under hard limit nearing limit

6. Smoke tests after a deploy

Drop these in a runbook for the on-call:

# 1. API health
curl -fsS https://amn.gg/api/health | jq '.success,.version,.environment'

# 2. Login
curl -fsS -X POST https://amn.gg/api/auth/login \
  -H "Content-Type: application/json" \
  -d '{"email":"admin@marketplace.com","password":"<prod-admin-pwd>"}' \
  | jq '.success,.data.user.email'

# 3. Frontend HTML loads
curl -fsS https://amn.gg/ -I | head -1   # expect 200

# 4. Socket.IO handshake
curl -fsS "https://amn.gg/socket.io/?EIO=4&transport=polling" -I | head -1

# 5. Containers healthy
docker ps --filter "name=nickapp-" --format "table {{.Names}}\t{{.Status}}"

Any non-OK → see Incident Response.


7. Future work

  • Prometheus + Grafana with Node exporter + Mongo exporter + Redis exporter — for proper time-series.
  • OpenTelemetry spans from backend → Sentry / Jaeger.
  • Healthcheck endpoint that probes Mongo + Redis and returns 503 when degraded.
  • PagerDuty / OpsGenie wiring from Sentry alerts.
  • Synthetic checks (Pingdom / UptimeRobot) hitting /health from multiple regions.

For now, Sentry + Docker healthchecks + manual log checks cover the basics. See Incident Response for what to do when something fires.