Files
nick-doc/08 - Operations/Monitoring.md
2026-05-23 20:35:34 +03:30

8.5 KiB

title, tags
title tags
Monitoring
operations

Monitoring

What's instrumented today and what to watch. Today's stack is intentionally lean — health endpoints, Docker healthchecks, Sentry, and access logs. Bigger metric pipelines (Prometheus, Grafana, OpenSearch) are a future addition.


1. Health endpoint

Path: GET /health (backend, port 5001).

Defined in backend/src/app.ts:

app.get("/health", (req, res) => {
  res.json({
    success: true,
    message: "Marketplace Backend API is running",
    timestamp: new Date().toISOString(),
    environment: config.nodeEnv,
    version: packageJson.version,
  });
});

Returns 200 with a JSON envelope as soon as Express is up. Does not currently probe MongoDB or Redis — they are checked via separate Docker healthchecks. If you want deep health, extend the endpoint to ping both data stores and return 503 on failure.

Public URL behind Nginx: https://amn.gg/api/health.


2. Docker healthchecks

Each long-lived container has a HEALTHCHECK baked in or declared in compose.

Container Probe Interval Failure threshold
nickapp-backend node healthcheck.js (HTTP GET /health) 30s 3 retries
nickapp-frontend curl -f http://localhost:8083/ 30s 3 retries
mongodb mongosh --eval "db.adminCommand('ping')" 30s 3 retries
redis redis-cli -a $REDIS_PASSWORD ping 30s 3 retries

healthcheck.js (backend) is a tiny Node script that does a local HTTP GET to /health and exits 0 / 1.

Inspect health:

docker ps --format "table {{.Names}}\t{{.Status}}"

# Detailed
docker inspect --format='{{json .State.Health}}' nickapp-backend | jq

If a container is unhealthy, Watchtower will not roll it (it expects the new container to pass healthcheck). Investigate with docker logs <container>.


3. Sentry — error tracking

Frontend

@sentry/nextjs ^10.22.0 is wired in via three config files at the repo root:

  • sentry.client.config.ts — browser SDK (with Session Replay enabled at 10% session / 100% error rate).
  • sentry.server.config.ts — server-rendered components (no Replay).
  • sentry.edge.config.ts — edge runtime (not currently used heavily).

Common settings:

Sentry.init({
  dsn: process.env.NEXT_PUBLIC_SENTRY_DSN,
  tracesSampleRate: process.env.NODE_ENV === 'production' ? 0.1 : 1.0,
  environment: process.env.NODE_ENV || 'development',
  enabled: process.env.NODE_ENV === 'production',
  ignoreErrors: ['ResizeObserver loop limit exceeded', 'ChunkLoadError', ...],
});

Errors from localhost are filtered out — only prod errors land in the dashboard.

Backend

@sentry/node ^10.22.0 + @sentry/profiling-node ^10.22.0 are initialised first in src/app.ts (before any other import) via src/config/sentry.ts. DSN comes from SENTRY_DSN env var (see Environment Variables#sentry).

What's captured:

  • Uncaught exceptions in route handlers
  • Promise rejections inside asyncHandler-wrapped routes
  • Manually-captured errors via Sentry.captureException(err)
  • Performance traces (10% sample rate in prod)
  • Profiling samples via @sentry/profiling-node

Source maps

Frontend uploads source maps to Sentry at build time when SENTRY_AUTH_TOKEN, SENTRY_ORG, and SENTRY_PROJECT are set in the CI env. Without them the build still succeeds but Sentry traces will show minified frames.

Alerts

Configure in the Sentry dashboard (Issues → Alerts) — common alerts:

  • Any new issue in production → Slack
  • Error frequency > 50/minute → page on-call
  • Performance regression on /api/payments/* traces → email

4. Logs

Backend application logs

Routed through src/utils/logger.ts — currently a thin console.log wrapper with emoji prefixes. Output goes to stdout, captured by Docker:

# Live tail
docker compose -f docker-compose.production.yml logs -f --tail=200 nickapp-backend

# Search for a request
docker logs nickapp-backend 2>&1 | grep "POST /api/payments"

# Pre-filter by date
docker logs --since 1h nickapp-backend

Notable log lines to look for:

Prefix Meaning
✅ Connected to MongoDB DB connection established
🚀 Server running on port 5001 App fully started
🔌 User connected: <id> Socket.IO connection
📥 Inbound HTTP request log
💳 SHKeeper SHKeeper webhook / API call
🔐 Webhook verification Webhook signature check result
❌ Error Manual error log (also captured by Sentry)

Nginx access + error logs

Bind-mounted to ./nginx/logs/ on the host:

tail -f /opt/backend/nginx/logs/access.log
tail -f /opt/backend/nginx/logs/error.log

Rotate these via host logrotate to avoid disk fill.

Frontend logs

Next.js logs go to the container stdout:

docker logs -f nickapp-frontend

Browser-side logs that need attention go through Sentry (above) — src/utils/logger.ts in the frontend forwards via Sentry breadcrumbs.


5. Key metrics to watch

Today these are read manually from logs / Sentry. As Prometheus is added, encode them as alerting rules.

Application

Metric Where to check Healthy Alert
5xx rate Sentry, Nginx access.log < 0.5 % > 2 % over 5 min
/health p95 latency curl + timer < 100 ms > 1 s
Login success rate Sentry custom event > 95 % < 90 %
Socket disconnect storm 🔌 User disconnected log frequency < 1/s sustained > 10/s sustained
OpenAI 429s Backend log OpenAI ... 429 0 any

Payments

Metric Where Healthy Alert
Payment success rate db.payments.aggregate([{$group:{_id:"$status",n:{$sum:1}}}]) > 95 % completed of 24h-old payments < 90 %
Webhook signature failures log Webhook verification failed 0 > 0
SHKeeper API errors (5xx) log + Sentry 0 > 5/min sustained
Payouts stuck in pending > 30 min db.payments.find({type:'payout',status:'pending',createdAt:{$lt:ISODate(30 min ago)}}) empty non-empty
Missing transactionHash after completed the same query that drives fix-transaction-hashes.js empty non-empty

MongoDB

db.serverStatus().connections           // active connections; alert if >1000
db.serverStatus().opcounters            // ops/sec
db.serverStatus().wiredTiger.cache      // cache hit ratio; aim > 95 %
db.currentOp({ secs_running: { $gte: 5 } })  // long-running queries

Redis

docker exec nickapp-redis redis-cli -a "$REDIS_PASSWORD" INFO stats
# Watch: instantaneous_ops_per_sec, keyspace_hits/misses, rejected_connections, evicted_keys

Alert thresholds: rejected_connections > 0, evicted_keys rising while you don't expect cache pressure, latency_ms p99 > 5ms.

Host

Metric Tool Healthy Alert
Disk usage on /var/lib/docker df -h < 80 % > 90 %
/opt/backend/uploads size du -sh watch trend bursty growth (>5 GB/day)
Memory pressure free -h, docker stats < 80 % swap actively used
Open file descriptors cat /proc/<pid>/limits well under hard limit nearing limit

6. Smoke tests after a deploy

Drop these in a runbook for the on-call:

# 1. API health
curl -fsS https://amn.gg/api/health | jq '.success,.version,.environment'

# 2. Login
curl -fsS -X POST https://amn.gg/api/auth/login \
  -H "Content-Type: application/json" \
  -d '{"email":"admin@marketplace.com","password":"<prod-admin-pwd>"}' \
  | jq '.success,.data.user.email'

# 3. Frontend HTML loads
curl -fsS https://amn.gg/ -I | head -1   # expect 200

# 4. Socket.IO handshake
curl -fsS "https://amn.gg/socket.io/?EIO=4&transport=polling" -I | head -1

# 5. Containers healthy
docker ps --filter "name=nickapp-" --format "table {{.Names}}\t{{.Status}}"

Any non-OK → see Incident Response.


7. Future work

  • Prometheus + Grafana with Node exporter + Mongo exporter + Redis exporter — for proper time-series.
  • OpenTelemetry spans from backend → Sentry / Jaeger.
  • Healthcheck endpoint that probes Mongo + Redis and returns 503 when degraded.
  • PagerDuty / OpsGenie wiring from Sentry alerts.
  • Synthetic checks (Pingdom / UptimeRobot) hitting /health from multiple regions.

For now, Sentry + Docker healthchecks + manual log checks cover the basics. See Incident Response for what to do when something fires.