nick/nick-doc

Fork 0

Files

Siavash Sameni 5352a78e96 docs: record postgres health store modes

2026-06-01 14:00:16 +04:00

10 KiB

Raw Blame History

title, tags

title

Monitoring

What's instrumented today and what to watch. Today's stack is intentionally lean — health endpoints, Docker healthchecks, Sentry, and access logs. Bigger metric pipelines (Prometheus, Grafana, OpenSearch) are a future addition.

1. Health endpoint

Two paths are registered (both are public, rate-limited, not auth-gated):

GET /health — simple ping used by Docker healthchecks. Returns 200 { success, message, timestamp, environment, version }. Does not probe MongoDB or Redis.
GET /api/health — deep health check added in commit 44579d6 (backend v2.6.49). Calls runHealthChecks from backend/src/services/health/healthCheckService.ts. Probes MongoDB, Postgres, Redis, Request Network registry data, and Request Network API reachability. Returns 503 only when report.status === 'down'. As of backend 2.8.11, Postgres is a hard dependency only when at least one *_STORE=postgres flag is enabled; otherwise an unconfigured Postgres check is reported as skipped. The Postgres check also reports active store modes so monitoring can distinguish "PG is reachable" from "this runtime is actually using PG-backed stores".

GET /api/health response shape (from healthCheckService):

{
  "status": "ok",
  "version": "2.8.11",
  "uptimeSec": 662,
  "checks": {
    "db": { "ok": true, "latencyMs": 4 },
    "postgres": {
      "ok": true,
      "latencyMs": 5,
      "configured": true,
      "required": true,
      "storeModes": {
        "auth": "postgres",
        "config": "postgres",
        "address": "postgres",
        "category": "postgres",
        "levelConfig": "postgres",
        "shopSettings": "postgres",
        "review": "postgres"
      },
      "enabledStores": [
        "auth",
        "config",
        "address",
        "category",
        "levelConfig",
        "shopSettings",
        "review"
      ],
      "enabledStoreCount": 7,
      "database": "amanat_dev",
      "user": "amanat"
    },
    "redis": { "ok": true, "latencyMs": 1 },
    "rnChainRegistry": { "ok": true, "latencyMs": 0, "chainCount": 7 },
    "rnTokenRegistry": { "ok": true, "latencyMs": 0, "tokenCount": 12 },
    "rnApi": { "ok": true, "latencyMs": 134, "status": 401 }
  }
}

Public URL behind Nginx: https://amn.gg/api/health.

2. Docker healthchecks

Each long-lived container has a HEALTHCHECK baked in or declared in compose.

Container	Probe	Interval	Failure threshold
`nickapp-backend`	`node healthcheck.js` (HTTP GET `/health`)	30s	3 retries
`nickapp-frontend`	`curl -f http://localhost:8083/`	30s	3 retries
`mongodb`	`mongosh --eval "db.adminCommand('ping')"`	30s	3 retries
`redis`	`redis-cli -a $REDIS_PASSWORD ping`	30s	3 retries

healthcheck.js (backend) is a tiny Node script that does a local HTTP GET to /health and exits 0 / 1.

Inspect health:

docker ps --format "table {{.Names}}\t{{.Status}}"

# Detailed
docker inspect --format='{{json .State.Health}}' nickapp-backend | jq

If a container is unhealthy, Watchtower will not roll it (it expects the new container to pass healthcheck). Investigate with docker logs <container>.

3. Sentry — error tracking

Frontend

@sentry/nextjs ^10.22.0 is wired in via three config files at the repo root:

sentry.client.config.ts — browser SDK (with Session Replay enabled at 10% session / 100% error rate).
sentry.server.config.ts — server-rendered components (no Replay).
sentry.edge.config.ts — edge runtime (not currently used heavily).

Common settings:

Sentry.init({
  dsn: process.env.NEXT_PUBLIC_SENTRY_DSN,
  tracesSampleRate: process.env.NODE_ENV === 'production' ? 0.1 : 1.0,
  environment: process.env.NODE_ENV || 'development',
  enabled: process.env.NODE_ENV === 'production',
  ignoreErrors: ['ResizeObserver loop limit exceeded', 'ChunkLoadError', ...],
});

Errors from localhost are filtered out — only prod errors land in the dashboard.

Backend

@sentry/node ^10.22.0 + @sentry/profiling-node ^10.22.0 are initialised first in src/app.ts (before any other import) via src/config/sentry.ts. DSN comes from SENTRY_DSN env var (see Environment Variables#sentry).

What's captured:

Uncaught exceptions in route handlers
Promise rejections inside asyncHandler-wrapped routes
Manually-captured errors via Sentry.captureException(err)
Performance traces (10% sample rate in prod)
Profiling samples via @sentry/profiling-node

Source maps

Frontend uploads source maps to Sentry at build time when SENTRY_AUTH_TOKEN, SENTRY_ORG, and SENTRY_PROJECT are set in the CI env. Without them the build still succeeds but Sentry traces will show minified frames.

Alerts

Configure in the Sentry dashboard (Issues → Alerts) — common alerts:

Any new issue in production → Slack
Error frequency > 50/minute → page on-call
Performance regression on /api/payments/* traces → email

4. Logs

Backend application logs

Routed through src/utils/logger.ts — currently a thin console.log wrapper with emoji prefixes. Output goes to stdout, captured by Docker:

# Live tail
docker compose -f docker-compose.production.yml logs -f --tail=200 nickapp-backend

# Search for a request
docker logs nickapp-backend 2>&1 | grep "POST /api/payments"

# Pre-filter by date
docker logs --since 1h nickapp-backend

Notable log lines to look for:

Prefix	Meaning
`✅ Connected to MongoDB`	DB connection established
`🚀 Server running on port 5001`	App fully started
`🔌 User connected: <id>`	Socket.IO connection
`📥`	Inbound HTTP request log
`💳 Request Network`	Request Network webhook / API call
`🔐 Webhook verification`	Webhook signature check result
`❌ Error`	Manual error log (also captured by Sentry)

Nginx access + error logs

Bind-mounted to ./nginx/logs/ on the host:

tail -f /opt/backend/nginx/logs/access.log
tail -f /opt/backend/nginx/logs/error.log

Rotate these via host logrotate to avoid disk fill.

Frontend logs

Next.js logs go to the container stdout:

docker logs -f nickapp-frontend

Browser-side logs that need attention go through Sentry (above) — src/utils/logger.ts in the frontend forwards via Sentry breadcrumbs.

5. Key metrics to watch

Today these are read manually from logs / Sentry. As Prometheus is added, encode them as alerting rules.

Application

Metric	Where to check	Healthy	Alert
5xx rate	Sentry, Nginx access.log	< 0.5 %	> 2 % over 5 min
`/health` p95 latency	curl + timer	< 100 ms	> 1 s
Login success rate	Sentry custom event	> 95 %	< 90 %
Socket disconnect storm	`🔌 User disconnected` log frequency	< 1/s sustained	> 10/s sustained
OpenAI 429s	Backend log `OpenAI ... 429`	0	any

Payments

Metric	Where	Healthy	Alert
Payment success rate	`db.payments.aggregate([{$group:{_id:"$status",n:{$sum:1}}}])`	> 95 % completed of 24h-old payments	< 90 %
Webhook signature failures	log `Webhook verification failed`	0	> 0
Request Network webhook 4xx	nginx access log `/api/payment/request-network/webhook`	0	any real provider delivery returning 4xx
Request Network safety-pending payments	`db.payments.find({"metadata.transactionSafety.status":"pending"})`	explained/short-lived	pending > 10 min without operator note
Request Network API errors (5xx)	log + Sentry	0	> 5/min sustained
Payouts stuck in `pending` > 30 min	`db.payments.find({type:'payout',status:'pending',createdAt:{$lt:ISODate(30 min ago)}})`	empty	non-empty
Missing `transactionHash` after `completed`	the same query that drives `fix-transaction-hashes.js`	empty	non-empty

MongoDB

db.serverStatus().connections           // active connections; alert if >1000
db.serverStatus().opcounters            // ops/sec
db.serverStatus().wiredTiger.cache      // cache hit ratio; aim > 95 %
db.currentOp({ secs_running: { $gte: 5 } })  // long-running queries

Redis

docker exec nickapp-redis redis-cli -a "$REDIS_PASSWORD" INFO stats
# Watch: instantaneous_ops_per_sec, keyspace_hits/misses, rejected_connections, evicted_keys

Alert thresholds: rejected_connections > 0, evicted_keys rising while you don't expect cache pressure, latency_ms p99 > 5ms.

Host

Metric	Tool	Healthy	Alert
Disk usage on `/var/lib/docker`	`df -h`	< 80 %	> 90 %
`/opt/backend/uploads` size	`du -sh`	watch trend	bursty growth (>5 GB/day)
Memory pressure	`free -h`, `docker stats`	< 80 %	swap actively used
Open file descriptors	`cat /proc/<pid>/limits`	well under hard limit	nearing limit

6. Smoke tests after a deploy

Drop these in a runbook for the on-call:

# 1. API health
curl -fsS https://amn.gg/api/health | jq '.success,.version,.environment'

# 2. Login
curl -fsS -X POST https://amn.gg/api/auth/login \
  -H "Content-Type: application/json" \
  -d '{"email":"admin@marketplace.com","password":"<prod-admin-pwd>"}' \
  | jq '.success,.data.user.email'

# 3. Frontend HTML loads
curl -fsS https://amn.gg/ -I | head -1   # expect 200

# 4. Socket.IO handshake
curl -fsS "https://amn.gg/socket.io/?EIO=4&transport=polling" -I | head -1

# 5. Containers healthy
docker ps --filter "name=nickapp-" --format "table {{.Names}}\t{{.Status}}"

Any non-OK → see Incident Response.

7. Future work

Prometheus + Grafana with Node exporter + Mongo exporter + Redis exporter — for proper time-series.
OpenTelemetry spans from backend → Sentry / Jaeger.
Healthcheck endpoint that probes Mongo + Redis and returns 503 when degraded.
PagerDuty / OpsGenie wiring from Sentry alerts.
Synthetic checks (Pingdom / UptimeRobot) hitting /health from multiple regions.

For now, Sentry + Docker healthchecks + manual log checks cover the basics. See Incident Response for what to do when something fires.

10 KiB Raw Blame History