--- title: Monitoring tags: [operations] --- # Monitoring What's instrumented today and what to watch. Today's stack is intentionally lean — health endpoints, Docker healthchecks, Sentry, and access logs. Bigger metric pipelines (Prometheus, Grafana, OpenSearch) are a future addition. --- ## 1. Health endpoint Path: `GET /health` (backend, port `5001`). Defined in `backend/src/app.ts`: ```ts app.get("/health", (req, res) => { res.json({ success: true, message: "Marketplace Backend API is running", timestamp: new Date().toISOString(), environment: config.nodeEnv, version: packageJson.version, }); }); ``` Returns `200` with a JSON envelope as soon as Express is up. Does **not** currently probe MongoDB or Redis — they are checked via separate Docker healthchecks. If you want deep health, extend the endpoint to ping both data stores and return `503` on failure. Public URL behind Nginx: `https://amn.gg/api/health`. --- ## 2. Docker healthchecks Each long-lived container has a `HEALTHCHECK` baked in or declared in compose. | Container | Probe | Interval | Failure threshold | |-----------|-------|----------|-------------------| | `nickapp-backend` | `node healthcheck.js` (HTTP GET `/health`) | 30s | 3 retries | | `nickapp-frontend` | `curl -f http://localhost:8083/` | 30s | 3 retries | | `mongodb` | `mongosh --eval "db.adminCommand('ping')"` | 30s | 3 retries | | `redis` | `redis-cli -a $REDIS_PASSWORD ping` | 30s | 3 retries | `healthcheck.js` (backend) is a tiny Node script that does a local HTTP GET to `/health` and exits 0 / 1. Inspect health: ```bash docker ps --format "table {{.Names}}\t{{.Status}}" # Detailed docker inspect --format='{{json .State.Health}}' nickapp-backend | jq ``` If a container is `unhealthy`, Watchtower will **not** roll it (it expects the new container to pass healthcheck). Investigate with `docker logs `. --- ## 3. Sentry — error tracking ### Frontend `@sentry/nextjs ^10.22.0` is wired in via three config files at the repo root: - `sentry.client.config.ts` — browser SDK (with Session Replay enabled at 10% session / 100% error rate). - `sentry.server.config.ts` — server-rendered components (no Replay). - `sentry.edge.config.ts` — edge runtime (not currently used heavily). Common settings: ```ts Sentry.init({ dsn: process.env.NEXT_PUBLIC_SENTRY_DSN, tracesSampleRate: process.env.NODE_ENV === 'production' ? 0.1 : 1.0, environment: process.env.NODE_ENV || 'development', enabled: process.env.NODE_ENV === 'production', ignoreErrors: ['ResizeObserver loop limit exceeded', 'ChunkLoadError', ...], }); ``` Errors from `localhost` are filtered out — only prod errors land in the dashboard. ### Backend `@sentry/node ^10.22.0` + `@sentry/profiling-node ^10.22.0` are initialised **first** in `src/app.ts` (before any other import) via `src/config/sentry.ts`. DSN comes from `SENTRY_DSN` env var (see [[Environment Variables#sentry]]). What's captured: - Uncaught exceptions in route handlers - Promise rejections inside `asyncHandler`-wrapped routes - Manually-captured errors via `Sentry.captureException(err)` - Performance traces (10% sample rate in prod) - Profiling samples via `@sentry/profiling-node` ### Source maps Frontend uploads source maps to Sentry at build time when `SENTRY_AUTH_TOKEN`, `SENTRY_ORG`, and `SENTRY_PROJECT` are set in the CI env. Without them the build still succeeds but Sentry traces will show minified frames. ### Alerts Configure in the Sentry dashboard (Issues → Alerts) — common alerts: - Any new issue in production → Slack - Error frequency > 50/minute → page on-call - Performance regression on `/api/payments/*` traces → email --- ## 4. Logs ### Backend application logs Routed through `src/utils/logger.ts` — currently a thin `console.log` wrapper with emoji prefixes. Output goes to stdout, captured by Docker: ```bash # Live tail docker compose -f docker-compose.production.yml logs -f --tail=200 nickapp-backend # Search for a request docker logs nickapp-backend 2>&1 | grep "POST /api/payments" # Pre-filter by date docker logs --since 1h nickapp-backend ``` Notable log lines to look for: | Prefix | Meaning | |--------|---------| | `✅ Connected to MongoDB` | DB connection established | | `🚀 Server running on port 5001` | App fully started | | `🔌 User connected: ` | Socket.IO connection | | `📥` | Inbound HTTP request log | | `💳 SHKeeper` | SHKeeper webhook / API call | | `🔐 Webhook verification` | Webhook signature check result | | `❌ Error` | Manual error log (also captured by Sentry) | ### Nginx access + error logs Bind-mounted to `./nginx/logs/` on the host: ```bash tail -f /opt/backend/nginx/logs/access.log tail -f /opt/backend/nginx/logs/error.log ``` Rotate these via host `logrotate` to avoid disk fill. ### Frontend logs Next.js logs go to the container stdout: ```bash docker logs -f nickapp-frontend ``` Browser-side logs that need attention go through Sentry (above) — `src/utils/logger.ts` in the frontend forwards via Sentry breadcrumbs. --- ## 5. Key metrics to watch Today these are read manually from logs / Sentry. As Prometheus is added, encode them as alerting rules. ### Application | Metric | Where to check | Healthy | Alert | |--------|---------------|---------|-------| | 5xx rate | Sentry, Nginx access.log | < 0.5 % | > 2 % over 5 min | | `/health` p95 latency | curl + timer | < 100 ms | > 1 s | | Login success rate | Sentry custom event | > 95 % | < 90 % | | Socket disconnect storm | `🔌 User disconnected` log frequency | < 1/s sustained | > 10/s sustained | | OpenAI 429s | Backend log `OpenAI ... 429` | 0 | any | ### Payments | Metric | Where | Healthy | Alert | |--------|-------|---------|-------| | Payment success rate | `db.payments.aggregate([{$group:{_id:"$status",n:{$sum:1}}}])` | > 95 % completed of 24h-old payments | < 90 % | | Webhook signature failures | log `Webhook verification failed` | 0 | > 0 | | Request Network webhook 4xx | nginx access log `/api/payment/request-network/webhook` | 0 | any real provider delivery returning 4xx | | Request Network safety-pending payments | `db.payments.find({"metadata.transactionSafety.status":"pending"})` | explained/short-lived | pending > 10 min without operator note | | SHKeeper API errors (5xx) | log + Sentry | 0 | > 5/min sustained | | Payouts stuck in `pending` > 30 min | `db.payments.find({type:'payout',status:'pending',createdAt:{$lt:ISODate(30 min ago)}})` | empty | non-empty | | Missing `transactionHash` after `completed` | the same query that drives `fix-transaction-hashes.js` | empty | non-empty | ### MongoDB ```js db.serverStatus().connections // active connections; alert if >1000 db.serverStatus().opcounters // ops/sec db.serverStatus().wiredTiger.cache // cache hit ratio; aim > 95 % db.currentOp({ secs_running: { $gte: 5 } }) // long-running queries ``` ### Redis ```bash docker exec nickapp-redis redis-cli -a "$REDIS_PASSWORD" INFO stats # Watch: instantaneous_ops_per_sec, keyspace_hits/misses, rejected_connections, evicted_keys ``` Alert thresholds: `rejected_connections > 0`, `evicted_keys` rising while you don't expect cache pressure, `latency_ms` p99 > 5ms. ### Host | Metric | Tool | Healthy | Alert | |--------|------|---------|-------| | Disk usage on `/var/lib/docker` | `df -h` | < 80 % | > 90 % | | `/opt/backend/uploads` size | `du -sh` | watch trend | bursty growth (>5 GB/day) | | Memory pressure | `free -h`, `docker stats` | < 80 % | swap actively used | | Open file descriptors | `cat /proc//limits` | well under hard limit | nearing limit | --- ## 6. Smoke tests after a deploy Drop these in a runbook for the on-call: ```bash # 1. API health curl -fsS https://amn.gg/api/health | jq '.success,.version,.environment' # 2. Login curl -fsS -X POST https://amn.gg/api/auth/login \ -H "Content-Type: application/json" \ -d '{"email":"admin@marketplace.com","password":""}' \ | jq '.success,.data.user.email' # 3. Frontend HTML loads curl -fsS https://amn.gg/ -I | head -1 # expect 200 # 4. Socket.IO handshake curl -fsS "https://amn.gg/socket.io/?EIO=4&transport=polling" -I | head -1 # 5. Containers healthy docker ps --filter "name=nickapp-" --format "table {{.Names}}\t{{.Status}}" ``` Any non-OK → see [[Incident Response]]. --- ## 7. Future work - **Prometheus + Grafana** with Node exporter + Mongo exporter + Redis exporter — for proper time-series. - **OpenTelemetry** spans from backend → Sentry / Jaeger. - **Healthcheck endpoint** that probes Mongo + Redis and returns `503` when degraded. - **PagerDuty / OpsGenie** wiring from Sentry alerts. - **Synthetic checks** (Pingdom / UptimeRobot) hitting `/health` from multiple regions. For now, Sentry + Docker healthchecks + manual log checks cover the basics. See [[Incident Response]] for what to do when something fires.