10 KiB
title, tags
| title | tags | |
|---|---|---|
| Monitoring |
|
Monitoring
What's instrumented today and what to watch. Today's stack is intentionally lean — health endpoints, Docker healthchecks, Sentry, and access logs. Bigger metric pipelines (Prometheus, Grafana, OpenSearch) are a future addition.
1. Health endpoint
Two paths are registered (both are public, rate-limited, not auth-gated):
GET /health— simple ping used by Docker healthchecks. Returns200 { success, message, timestamp, environment, version }. Does not probe MongoDB or Redis.GET /api/health— deep health check added in commit44579d6(backend v2.6.49). CallsrunHealthChecksfrombackend/src/services/health/healthCheckService.ts. Probes MongoDB, Postgres, Redis, Request Network registry data, and Request Network API reachability. Returns503only whenreport.status === 'down'. As of backend2.8.11, Postgres is a hard dependency only when at least one*_STORE=postgresflag is enabled; otherwise an unconfigured Postgres check is reported as skipped. The Postgres check also reports active store modes so monitoring can distinguish "PG is reachable" from "this runtime is actually using PG-backed stores".
GET /api/health response shape (from healthCheckService):
{
"status": "ok",
"version": "2.8.11",
"uptimeSec": 662,
"checks": {
"db": { "ok": true, "latencyMs": 4 },
"postgres": {
"ok": true,
"latencyMs": 5,
"configured": true,
"required": true,
"storeModes": {
"auth": "postgres",
"config": "postgres",
"address": "postgres",
"category": "postgres",
"levelConfig": "postgres",
"shopSettings": "postgres",
"review": "postgres"
},
"enabledStores": [
"auth",
"config",
"address",
"category",
"levelConfig",
"shopSettings",
"review"
],
"enabledStoreCount": 7,
"database": "amanat_dev",
"user": "amanat"
},
"redis": { "ok": true, "latencyMs": 1 },
"rnChainRegistry": { "ok": true, "latencyMs": 0, "chainCount": 7 },
"rnTokenRegistry": { "ok": true, "latencyMs": 0, "tokenCount": 12 },
"rnApi": { "ok": true, "latencyMs": 134, "status": 401 }
}
}
Public URL behind Nginx: https://amn.gg/api/health.
2. Docker healthchecks
Each long-lived container has a HEALTHCHECK baked in or declared in compose.
| Container | Probe | Interval | Failure threshold |
|---|---|---|---|
nickapp-backend |
node healthcheck.js (HTTP GET /health) |
30s | 3 retries |
nickapp-frontend |
curl -f http://localhost:8083/ |
30s | 3 retries |
mongodb |
mongosh --eval "db.adminCommand('ping')" |
30s | 3 retries |
redis |
redis-cli -a $REDIS_PASSWORD ping |
30s | 3 retries |
healthcheck.js (backend) is a tiny Node script that does a local HTTP GET to /health and exits 0 / 1.
Inspect health:
docker ps --format "table {{.Names}}\t{{.Status}}"
# Detailed
docker inspect --format='{{json .State.Health}}' nickapp-backend | jq
If a container is unhealthy, Watchtower will not roll it (it expects the new container to pass healthcheck). Investigate with docker logs <container>.
3. Sentry — error tracking
Frontend
@sentry/nextjs ^10.22.0 is wired in via three config files at the repo root:
sentry.client.config.ts— browser SDK (with Session Replay enabled at 10% session / 100% error rate).sentry.server.config.ts— server-rendered components (no Replay).sentry.edge.config.ts— edge runtime (not currently used heavily).
Common settings:
Sentry.init({
dsn: process.env.NEXT_PUBLIC_SENTRY_DSN,
tracesSampleRate: process.env.NODE_ENV === 'production' ? 0.1 : 1.0,
environment: process.env.NODE_ENV || 'development',
enabled: process.env.NODE_ENV === 'production',
ignoreErrors: ['ResizeObserver loop limit exceeded', 'ChunkLoadError', ...],
});
Errors from localhost are filtered out — only prod errors land in the dashboard.
Backend
@sentry/node ^10.22.0 + @sentry/profiling-node ^10.22.0 are initialised first in src/app.ts (before any other import) via src/config/sentry.ts. DSN comes from SENTRY_DSN env var (see Environment Variables#sentry).
What's captured:
- Uncaught exceptions in route handlers
- Promise rejections inside
asyncHandler-wrapped routes - Manually-captured errors via
Sentry.captureException(err) - Performance traces (10% sample rate in prod)
- Profiling samples via
@sentry/profiling-node
Source maps
Frontend uploads source maps to Sentry at build time when SENTRY_AUTH_TOKEN, SENTRY_ORG, and SENTRY_PROJECT are set in the CI env. Without them the build still succeeds but Sentry traces will show minified frames.
Alerts
Configure in the Sentry dashboard (Issues → Alerts) — common alerts:
- Any new issue in production → Slack
- Error frequency > 50/minute → page on-call
- Performance regression on
/api/payments/*traces → email
4. Logs
Backend application logs
Routed through src/utils/logger.ts — currently a thin console.log wrapper with emoji prefixes. Output goes to stdout, captured by Docker:
# Live tail
docker compose -f docker-compose.production.yml logs -f --tail=200 nickapp-backend
# Search for a request
docker logs nickapp-backend 2>&1 | grep "POST /api/payments"
# Pre-filter by date
docker logs --since 1h nickapp-backend
Notable log lines to look for:
| Prefix | Meaning |
|---|---|
✅ Connected to MongoDB |
DB connection established |
🚀 Server running on port 5001 |
App fully started |
🔌 User connected: <id> |
Socket.IO connection |
📥 |
Inbound HTTP request log |
💳 Request Network |
Request Network webhook / API call |
🔐 Webhook verification |
Webhook signature check result |
❌ Error |
Manual error log (also captured by Sentry) |
Nginx access + error logs
Bind-mounted to ./nginx/logs/ on the host:
tail -f /opt/backend/nginx/logs/access.log
tail -f /opt/backend/nginx/logs/error.log
Rotate these via host logrotate to avoid disk fill.
Frontend logs
Next.js logs go to the container stdout:
docker logs -f nickapp-frontend
Browser-side logs that need attention go through Sentry (above) — src/utils/logger.ts in the frontend forwards via Sentry breadcrumbs.
5. Key metrics to watch
Today these are read manually from logs / Sentry. As Prometheus is added, encode them as alerting rules.
Application
| Metric | Where to check | Healthy | Alert |
|---|---|---|---|
| 5xx rate | Sentry, Nginx access.log | < 0.5 % | > 2 % over 5 min |
/health p95 latency |
curl + timer | < 100 ms | > 1 s |
| Login success rate | Sentry custom event | > 95 % | < 90 % |
| Socket disconnect storm | 🔌 User disconnected log frequency |
< 1/s sustained | > 10/s sustained |
| OpenAI 429s | Backend log OpenAI ... 429 |
0 | any |
Payments
| Metric | Where | Healthy | Alert |
|---|---|---|---|
| Payment success rate | db.payments.aggregate([{$group:{_id:"$status",n:{$sum:1}}}]) |
> 95 % completed of 24h-old payments | < 90 % |
| Webhook signature failures | log Webhook verification failed |
0 | > 0 |
| Request Network webhook 4xx | nginx access log /api/payment/request-network/webhook |
0 | any real provider delivery returning 4xx |
| Request Network safety-pending payments | db.payments.find({"metadata.transactionSafety.status":"pending"}) |
explained/short-lived | pending > 10 min without operator note |
| Request Network API errors (5xx) | log + Sentry | 0 | > 5/min sustained |
Payouts stuck in pending > 30 min |
db.payments.find({type:'payout',status:'pending',createdAt:{$lt:ISODate(30 min ago)}}) |
empty | non-empty |
Missing transactionHash after completed |
the same query that drives fix-transaction-hashes.js |
empty | non-empty |
MongoDB
db.serverStatus().connections // active connections; alert if >1000
db.serverStatus().opcounters // ops/sec
db.serverStatus().wiredTiger.cache // cache hit ratio; aim > 95 %
db.currentOp({ secs_running: { $gte: 5 } }) // long-running queries
Redis
docker exec nickapp-redis redis-cli -a "$REDIS_PASSWORD" INFO stats
# Watch: instantaneous_ops_per_sec, keyspace_hits/misses, rejected_connections, evicted_keys
Alert thresholds: rejected_connections > 0, evicted_keys rising while you don't expect cache pressure, latency_ms p99 > 5ms.
Host
| Metric | Tool | Healthy | Alert |
|---|---|---|---|
Disk usage on /var/lib/docker |
df -h |
< 80 % | > 90 % |
/opt/backend/uploads size |
du -sh |
watch trend | bursty growth (>5 GB/day) |
| Memory pressure | free -h, docker stats |
< 80 % | swap actively used |
| Open file descriptors | cat /proc/<pid>/limits |
well under hard limit | nearing limit |
6. Smoke tests after a deploy
Drop these in a runbook for the on-call:
# 1. API health
curl -fsS https://amn.gg/api/health | jq '.success,.version,.environment'
# 2. Login
curl -fsS -X POST https://amn.gg/api/auth/login \
-H "Content-Type: application/json" \
-d '{"email":"admin@marketplace.com","password":"<prod-admin-pwd>"}' \
| jq '.success,.data.user.email'
# 3. Frontend HTML loads
curl -fsS https://amn.gg/ -I | head -1 # expect 200
# 4. Socket.IO handshake
curl -fsS "https://amn.gg/socket.io/?EIO=4&transport=polling" -I | head -1
# 5. Containers healthy
docker ps --filter "name=nickapp-" --format "table {{.Names}}\t{{.Status}}"
Any non-OK → see Incident Response.
7. Future work
- Prometheus + Grafana with Node exporter + Mongo exporter + Redis exporter — for proper time-series.
- OpenTelemetry spans from backend → Sentry / Jaeger.
- Healthcheck endpoint that probes Mongo + Redis and returns
503when degraded. - PagerDuty / OpsGenie wiring from Sentry alerts.
- Synthetic checks (Pingdom / UptimeRobot) hitting
/healthfrom multiple regions.
For now, Sentry + Docker healthchecks + manual log checks cover the basics. See Incident Response for what to do when something fires.