nick/nick-doc

Fork 0

Files

Siavash Sameni 7b5dbb2683 docs: sync from backend 1757f1e - postgres cutover stores

2026-06-01 11:54:56 +04:00

9.5 KiB

Raw Blame History

title, tags

title

Monitoring

What's instrumented today and what to watch. Today's stack is intentionally lean — health endpoints, Docker healthchecks, Sentry, and access logs. Bigger metric pipelines (Prometheus, Grafana, OpenSearch) are a future addition.

1. Health endpoint

Two paths are registered (both are public, rate-limited, not auth-gated):

GET /health — simple ping used by Docker healthchecks. Returns 200 { success, message, timestamp, environment, version }. Does not probe MongoDB or Redis.
GET /api/health — deep health check added in commit 44579d6 (backend v2.6.49). Calls runHealthChecks from backend/src/services/health/healthCheckService.ts. Probes MongoDB, Postgres, Redis, Request Network registry data, and Request Network API reachability. Returns 503 only when report.status === 'down'. As of backend 2.8.9, Postgres is a hard dependency only when at least one *_STORE=postgres flag is enabled; otherwise an unconfigured Postgres check is reported as skipped.

GET /api/health response shape (from healthCheckService):

{
  "status": "ok",
  "version": "2.6.xx",
  "timestamp": "...",
  "checks": {
    "db": { "ok": true, "latencyMs": 4 },
    "postgres": {
      "ok": true,
      "latencyMs": 5,
      "configured": true,
      "required": true,
      "database": "amanat_dev",
      "user": "amanat"
    },
    "redis": { "ok": true, "latencyMs": 1 },
    "rnChainRegistry": { "ok": true, "latencyMs": 0, "chainCount": 7 },
    "rnTokenRegistry": { "ok": true, "latencyMs": 0, "tokenCount": 12 },
    "rnApi": { "ok": true, "latencyMs": 134, "status": 401 }
  }
}

Public URL behind Nginx: https://amn.gg/api/health.

2. Docker healthchecks

Each long-lived container has a HEALTHCHECK baked in or declared in compose.

Container	Probe	Interval	Failure threshold
`nickapp-backend`	`node healthcheck.js` (HTTP GET `/health`)	30s	3 retries
`nickapp-frontend`	`curl -f http://localhost:8083/`	30s	3 retries
`mongodb`	`mongosh --eval "db.adminCommand('ping')"`	30s	3 retries
`redis`	`redis-cli -a $REDIS_PASSWORD ping`	30s	3 retries

healthcheck.js (backend) is a tiny Node script that does a local HTTP GET to /health and exits 0 / 1.

Inspect health:

docker ps --format "table {{.Names}}\t{{.Status}}"

# Detailed
docker inspect --format='{{json .State.Health}}' nickapp-backend | jq

If a container is unhealthy, Watchtower will not roll it (it expects the new container to pass healthcheck). Investigate with docker logs <container>.

3. Sentry — error tracking

Frontend

@sentry/nextjs ^10.22.0 is wired in via three config files at the repo root:

sentry.client.config.ts — browser SDK (with Session Replay enabled at 10% session / 100% error rate).
sentry.server.config.ts — server-rendered components (no Replay).
sentry.edge.config.ts — edge runtime (not currently used heavily).

Common settings:

Sentry.init({
  dsn: process.env.NEXT_PUBLIC_SENTRY_DSN,
  tracesSampleRate: process.env.NODE_ENV === 'production' ? 0.1 : 1.0,
  environment: process.env.NODE_ENV || 'development',
  enabled: process.env.NODE_ENV === 'production',
  ignoreErrors: ['ResizeObserver loop limit exceeded', 'ChunkLoadError', ...],
});

Errors from localhost are filtered out — only prod errors land in the dashboard.

Backend

@sentry/node ^10.22.0 + @sentry/profiling-node ^10.22.0 are initialised first in src/app.ts (before any other import) via src/config/sentry.ts. DSN comes from SENTRY_DSN env var (see Environment Variables#sentry).

What's captured:

Uncaught exceptions in route handlers
Promise rejections inside asyncHandler-wrapped routes
Manually-captured errors via Sentry.captureException(err)
Performance traces (10% sample rate in prod)
Profiling samples via @sentry/profiling-node

Source maps

Frontend uploads source maps to Sentry at build time when SENTRY_AUTH_TOKEN, SENTRY_ORG, and SENTRY_PROJECT are set in the CI env. Without them the build still succeeds but Sentry traces will show minified frames.

Alerts

Configure in the Sentry dashboard (Issues → Alerts) — common alerts:

Any new issue in production → Slack
Error frequency > 50/minute → page on-call
Performance regression on /api/payments/* traces → email

4. Logs

Backend application logs

Routed through src/utils/logger.ts — currently a thin console.log wrapper with emoji prefixes. Output goes to stdout, captured by Docker:

# Live tail
docker compose -f docker-compose.production.yml logs -f --tail=200 nickapp-backend

# Search for a request
docker logs nickapp-backend 2>&1 | grep "POST /api/payments"

# Pre-filter by date
docker logs --since 1h nickapp-backend

Notable log lines to look for:

Prefix	Meaning
`✅ Connected to MongoDB`	DB connection established
`🚀 Server running on port 5001`	App fully started
`🔌 User connected: <id>`	Socket.IO connection
`📥`	Inbound HTTP request log
`💳 Request Network`	Request Network webhook / API call
`🔐 Webhook verification`	Webhook signature check result
`❌ Error`	Manual error log (also captured by Sentry)

Nginx access + error logs

Bind-mounted to ./nginx/logs/ on the host:

tail -f /opt/backend/nginx/logs/access.log
tail -f /opt/backend/nginx/logs/error.log

Rotate these via host logrotate to avoid disk fill.

Frontend logs

Next.js logs go to the container stdout:

docker logs -f nickapp-frontend

Browser-side logs that need attention go through Sentry (above) — src/utils/logger.ts in the frontend forwards via Sentry breadcrumbs.

5. Key metrics to watch

Today these are read manually from logs / Sentry. As Prometheus is added, encode them as alerting rules.

Application

Metric	Where to check	Healthy	Alert
5xx rate	Sentry, Nginx access.log	< 0.5 %	> 2 % over 5 min
`/health` p95 latency	curl + timer	< 100 ms	> 1 s
Login success rate	Sentry custom event	> 95 %	< 90 %
Socket disconnect storm	`🔌 User disconnected` log frequency	< 1/s sustained	> 10/s sustained
OpenAI 429s	Backend log `OpenAI ... 429`	0	any

Payments

Metric	Where	Healthy	Alert
Payment success rate	`db.payments.aggregate([{$group:{_id:"$status",n:{$sum:1}}}])`	> 95 % completed of 24h-old payments	< 90 %
Webhook signature failures	log `Webhook verification failed`	0	> 0
Request Network webhook 4xx	nginx access log `/api/payment/request-network/webhook`	0	any real provider delivery returning 4xx
Request Network safety-pending payments	`db.payments.find({"metadata.transactionSafety.status":"pending"})`	explained/short-lived	pending > 10 min without operator note
Request Network API errors (5xx)	log + Sentry	0	> 5/min sustained
Payouts stuck in `pending` > 30 min	`db.payments.find({type:'payout',status:'pending',createdAt:{$lt:ISODate(30 min ago)}})`	empty	non-empty
Missing `transactionHash` after `completed`	the same query that drives `fix-transaction-hashes.js`	empty	non-empty

MongoDB

db.serverStatus().connections           // active connections; alert if >1000
db.serverStatus().opcounters            // ops/sec
db.serverStatus().wiredTiger.cache      // cache hit ratio; aim > 95 %
db.currentOp({ secs_running: { $gte: 5 } })  // long-running queries

Redis

docker exec nickapp-redis redis-cli -a "$REDIS_PASSWORD" INFO stats
# Watch: instantaneous_ops_per_sec, keyspace_hits/misses, rejected_connections, evicted_keys

Alert thresholds: rejected_connections > 0, evicted_keys rising while you don't expect cache pressure, latency_ms p99 > 5ms.

Host

Metric	Tool	Healthy	Alert
Disk usage on `/var/lib/docker`	`df -h`	< 80 %	> 90 %
`/opt/backend/uploads` size	`du -sh`	watch trend	bursty growth (>5 GB/day)
Memory pressure	`free -h`, `docker stats`	< 80 %	swap actively used
Open file descriptors	`cat /proc/<pid>/limits`	well under hard limit	nearing limit

6. Smoke tests after a deploy

Drop these in a runbook for the on-call:

# 1. API health
curl -fsS https://amn.gg/api/health | jq '.success,.version,.environment'

# 2. Login
curl -fsS -X POST https://amn.gg/api/auth/login \
  -H "Content-Type: application/json" \
  -d '{"email":"admin@marketplace.com","password":"<prod-admin-pwd>"}' \
  | jq '.success,.data.user.email'

# 3. Frontend HTML loads
curl -fsS https://amn.gg/ -I | head -1   # expect 200

# 4. Socket.IO handshake
curl -fsS "https://amn.gg/socket.io/?EIO=4&transport=polling" -I | head -1

# 5. Containers healthy
docker ps --filter "name=nickapp-" --format "table {{.Names}}\t{{.Status}}"

Any non-OK → see Incident Response.

7. Future work

Prometheus + Grafana with Node exporter + Mongo exporter + Redis exporter — for proper time-series.
OpenTelemetry spans from backend → Sentry / Jaeger.
Healthcheck endpoint that probes Mongo + Redis and returns 503 when degraded.
PagerDuty / OpsGenie wiring from Sentry alerts.
Synthetic checks (Pingdom / UptimeRobot) hitting /health from multiple regions.

For now, Sentry + Docker healthchecks + manual log checks cover the basics. See Incident Response for what to do when something fires.

9.5 KiB Raw Blame History