9.5 KiB
title, tags
| title | tags | |
|---|---|---|
| Monitoring |
|
Monitoring
What's instrumented today and what to watch. Today's stack is intentionally lean — health endpoints, Docker healthchecks, Sentry, and access logs. Bigger metric pipelines (Prometheus, Grafana, OpenSearch) are a future addition.
1. Health endpoint
Two paths are registered (both are public, rate-limited, not auth-gated):
GET /health— simple ping used by Docker healthchecks. Returns200 { success, message, timestamp, environment, version }. Does not probe MongoDB or Redis.GET /api/health— deep health check added in commit44579d6(backend v2.6.49). CallsrunHealthChecksfrombackend/src/services/health/healthCheckService.ts. Probes MongoDB, Postgres, Redis, Request Network registry data, and Request Network API reachability. Returns503only whenreport.status === 'down'. As of backend2.8.9, Postgres is a hard dependency only when at least one*_STORE=postgresflag is enabled; otherwise an unconfigured Postgres check is reported as skipped.
GET /api/health response shape (from healthCheckService):
{
"status": "ok",
"version": "2.6.xx",
"timestamp": "...",
"checks": {
"db": { "ok": true, "latencyMs": 4 },
"postgres": {
"ok": true,
"latencyMs": 5,
"configured": true,
"required": true,
"database": "amanat_dev",
"user": "amanat"
},
"redis": { "ok": true, "latencyMs": 1 },
"rnChainRegistry": { "ok": true, "latencyMs": 0, "chainCount": 7 },
"rnTokenRegistry": { "ok": true, "latencyMs": 0, "tokenCount": 12 },
"rnApi": { "ok": true, "latencyMs": 134, "status": 401 }
}
}
Public URL behind Nginx: https://amn.gg/api/health.
2. Docker healthchecks
Each long-lived container has a HEALTHCHECK baked in or declared in compose.
| Container | Probe | Interval | Failure threshold |
|---|---|---|---|
nickapp-backend |
node healthcheck.js (HTTP GET /health) |
30s | 3 retries |
nickapp-frontend |
curl -f http://localhost:8083/ |
30s | 3 retries |
mongodb |
mongosh --eval "db.adminCommand('ping')" |
30s | 3 retries |
redis |
redis-cli -a $REDIS_PASSWORD ping |
30s | 3 retries |
healthcheck.js (backend) is a tiny Node script that does a local HTTP GET to /health and exits 0 / 1.
Inspect health:
docker ps --format "table {{.Names}}\t{{.Status}}"
# Detailed
docker inspect --format='{{json .State.Health}}' nickapp-backend | jq
If a container is unhealthy, Watchtower will not roll it (it expects the new container to pass healthcheck). Investigate with docker logs <container>.
3. Sentry — error tracking
Frontend
@sentry/nextjs ^10.22.0 is wired in via three config files at the repo root:
sentry.client.config.ts— browser SDK (with Session Replay enabled at 10% session / 100% error rate).sentry.server.config.ts— server-rendered components (no Replay).sentry.edge.config.ts— edge runtime (not currently used heavily).
Common settings:
Sentry.init({
dsn: process.env.NEXT_PUBLIC_SENTRY_DSN,
tracesSampleRate: process.env.NODE_ENV === 'production' ? 0.1 : 1.0,
environment: process.env.NODE_ENV || 'development',
enabled: process.env.NODE_ENV === 'production',
ignoreErrors: ['ResizeObserver loop limit exceeded', 'ChunkLoadError', ...],
});
Errors from localhost are filtered out — only prod errors land in the dashboard.
Backend
@sentry/node ^10.22.0 + @sentry/profiling-node ^10.22.0 are initialised first in src/app.ts (before any other import) via src/config/sentry.ts. DSN comes from SENTRY_DSN env var (see Environment Variables#sentry).
What's captured:
- Uncaught exceptions in route handlers
- Promise rejections inside
asyncHandler-wrapped routes - Manually-captured errors via
Sentry.captureException(err) - Performance traces (10% sample rate in prod)
- Profiling samples via
@sentry/profiling-node
Source maps
Frontend uploads source maps to Sentry at build time when SENTRY_AUTH_TOKEN, SENTRY_ORG, and SENTRY_PROJECT are set in the CI env. Without them the build still succeeds but Sentry traces will show minified frames.
Alerts
Configure in the Sentry dashboard (Issues → Alerts) — common alerts:
- Any new issue in production → Slack
- Error frequency > 50/minute → page on-call
- Performance regression on
/api/payments/*traces → email
4. Logs
Backend application logs
Routed through src/utils/logger.ts — currently a thin console.log wrapper with emoji prefixes. Output goes to stdout, captured by Docker:
# Live tail
docker compose -f docker-compose.production.yml logs -f --tail=200 nickapp-backend
# Search for a request
docker logs nickapp-backend 2>&1 | grep "POST /api/payments"
# Pre-filter by date
docker logs --since 1h nickapp-backend
Notable log lines to look for:
| Prefix | Meaning |
|---|---|
✅ Connected to MongoDB |
DB connection established |
🚀 Server running on port 5001 |
App fully started |
🔌 User connected: <id> |
Socket.IO connection |
📥 |
Inbound HTTP request log |
💳 Request Network |
Request Network webhook / API call |
🔐 Webhook verification |
Webhook signature check result |
❌ Error |
Manual error log (also captured by Sentry) |
Nginx access + error logs
Bind-mounted to ./nginx/logs/ on the host:
tail -f /opt/backend/nginx/logs/access.log
tail -f /opt/backend/nginx/logs/error.log
Rotate these via host logrotate to avoid disk fill.
Frontend logs
Next.js logs go to the container stdout:
docker logs -f nickapp-frontend
Browser-side logs that need attention go through Sentry (above) — src/utils/logger.ts in the frontend forwards via Sentry breadcrumbs.
5. Key metrics to watch
Today these are read manually from logs / Sentry. As Prometheus is added, encode them as alerting rules.
Application
| Metric | Where to check | Healthy | Alert |
|---|---|---|---|
| 5xx rate | Sentry, Nginx access.log | < 0.5 % | > 2 % over 5 min |
/health p95 latency |
curl + timer | < 100 ms | > 1 s |
| Login success rate | Sentry custom event | > 95 % | < 90 % |
| Socket disconnect storm | 🔌 User disconnected log frequency |
< 1/s sustained | > 10/s sustained |
| OpenAI 429s | Backend log OpenAI ... 429 |
0 | any |
Payments
| Metric | Where | Healthy | Alert |
|---|---|---|---|
| Payment success rate | db.payments.aggregate([{$group:{_id:"$status",n:{$sum:1}}}]) |
> 95 % completed of 24h-old payments | < 90 % |
| Webhook signature failures | log Webhook verification failed |
0 | > 0 |
| Request Network webhook 4xx | nginx access log /api/payment/request-network/webhook |
0 | any real provider delivery returning 4xx |
| Request Network safety-pending payments | db.payments.find({"metadata.transactionSafety.status":"pending"}) |
explained/short-lived | pending > 10 min without operator note |
| Request Network API errors (5xx) | log + Sentry | 0 | > 5/min sustained |
Payouts stuck in pending > 30 min |
db.payments.find({type:'payout',status:'pending',createdAt:{$lt:ISODate(30 min ago)}}) |
empty | non-empty |
Missing transactionHash after completed |
the same query that drives fix-transaction-hashes.js |
empty | non-empty |
MongoDB
db.serverStatus().connections // active connections; alert if >1000
db.serverStatus().opcounters // ops/sec
db.serverStatus().wiredTiger.cache // cache hit ratio; aim > 95 %
db.currentOp({ secs_running: { $gte: 5 } }) // long-running queries
Redis
docker exec nickapp-redis redis-cli -a "$REDIS_PASSWORD" INFO stats
# Watch: instantaneous_ops_per_sec, keyspace_hits/misses, rejected_connections, evicted_keys
Alert thresholds: rejected_connections > 0, evicted_keys rising while you don't expect cache pressure, latency_ms p99 > 5ms.
Host
| Metric | Tool | Healthy | Alert |
|---|---|---|---|
Disk usage on /var/lib/docker |
df -h |
< 80 % | > 90 % |
/opt/backend/uploads size |
du -sh |
watch trend | bursty growth (>5 GB/day) |
| Memory pressure | free -h, docker stats |
< 80 % | swap actively used |
| Open file descriptors | cat /proc/<pid>/limits |
well under hard limit | nearing limit |
6. Smoke tests after a deploy
Drop these in a runbook for the on-call:
# 1. API health
curl -fsS https://amn.gg/api/health | jq '.success,.version,.environment'
# 2. Login
curl -fsS -X POST https://amn.gg/api/auth/login \
-H "Content-Type: application/json" \
-d '{"email":"admin@marketplace.com","password":"<prod-admin-pwd>"}' \
| jq '.success,.data.user.email'
# 3. Frontend HTML loads
curl -fsS https://amn.gg/ -I | head -1 # expect 200
# 4. Socket.IO handshake
curl -fsS "https://amn.gg/socket.io/?EIO=4&transport=polling" -I | head -1
# 5. Containers healthy
docker ps --filter "name=nickapp-" --format "table {{.Names}}\t{{.Status}}"
Any non-OK → see Incident Response.
7. Future work
- Prometheus + Grafana with Node exporter + Mongo exporter + Redis exporter — for proper time-series.
- OpenTelemetry spans from backend → Sentry / Jaeger.
- Healthcheck endpoint that probes Mongo + Redis and returns
503when degraded. - PagerDuty / OpsGenie wiring from Sentry alerts.
- Synthetic checks (Pingdom / UptimeRobot) hitting
/healthfrom multiple regions.
For now, Sentry + Docker healthchecks + manual log checks cover the basics. See Incident Response for what to do when something fires.