Files
nick-doc/08 - Operations/Monitoring.md
Siavash Sameni a5d71bcc05 docs: sync documentation with latest codebase state
- Update Activity Log with 108 missing commits (48 backend + 60 frontend)
- Update version references: backend v2.8.79, frontend v2.8.94
- Update migration count: 18 migrations (0000-0017)
- Update Telegram Mini App Flow to v2.8.94
- Update Payment Flow - Scanner to 2026-06-05
- Update all architectural and database references

Generated by Mistral Vibe.
Co-Authored-By: Mistral Vibe <vibe@mistral.ai>
2026-06-05 07:34:49 +04:00

286 lines
10 KiB
Markdown

---
title: Monitoring
tags: [operations]
---
# Monitoring
What's instrumented today and what to watch. Today's stack is intentionally lean — health endpoints, Docker healthchecks, Sentry, and access logs. Bigger metric pipelines (Prometheus, Grafana, OpenSearch) are a future addition.
---
## 1. Health endpoint
Two paths are registered (both are public, rate-limited, not auth-gated):
- `GET /health` — simple ping used by Docker healthchecks. Returns `200 { success, message, timestamp, environment, version }`. Does **not** probe MongoDB or Redis.
- `GET /api/health` — deep health check added in commit `44579d6` (backend v2.6.49). Calls `runHealthChecks` from `backend/src/services/health/healthCheckService.ts`. Probes MongoDB, Postgres, Redis, Request Network registry data, and Request Network API reachability. Returns `503` only when `report.status === 'down'`. As of backend `2.8.79`, Postgres is a hard dependency only when at least one `*_STORE=postgres` flag is enabled; otherwise an unconfigured Postgres check is reported as skipped. The Postgres check also reports active store modes so monitoring can distinguish "PG is reachable" from "this runtime is actually using PG-backed stores". As of deployment `38cb75b`, dev Gatus requires all seven PG-capable store modes to be `postgres` and `enabledStoreCount >= 7`.
`GET /api/health` response shape (from `healthCheckService`):
```json
{
"status": "ok",
"version": "2.8.79",
"uptimeSec": 662,
"checks": {
"db": { "ok": true, "latencyMs": 4 },
"postgres": {
"ok": true,
"latencyMs": 5,
"configured": true,
"required": true,
"storeModes": {
"auth": "postgres",
"config": "postgres",
"address": "postgres",
"category": "postgres",
"levelConfig": "postgres",
"shopSettings": "postgres",
"review": "postgres"
},
"enabledStores": [
"auth",
"config",
"address",
"category",
"levelConfig",
"shopSettings",
"review"
],
"enabledStoreCount": 7,
"database": "amanat_dev",
"user": "amanat"
},
"redis": { "ok": true, "latencyMs": 1 },
"rnChainRegistry": { "ok": true, "latencyMs": 0, "chainCount": 7 },
"rnTokenRegistry": { "ok": true, "latencyMs": 0, "tokenCount": 12 },
"rnApi": { "ok": true, "latencyMs": 134, "status": 401 }
}
}
```
Public URL behind Nginx: `https://amn.gg/api/health`.
---
## 2. Docker healthchecks
Each long-lived container has a `HEALTHCHECK` baked in or declared in compose.
| Container | Probe | Interval | Failure threshold |
|-----------|-------|----------|-------------------|
| `nickapp-backend` | `node healthcheck.js` (HTTP GET `/health`) | 30s | 3 retries |
| `nickapp-frontend` | `curl -f http://localhost:8083/` | 30s | 3 retries |
| `mongodb` | `mongosh --eval "db.adminCommand('ping')"` | 30s | 3 retries |
| `redis` | `redis-cli -a $REDIS_PASSWORD ping` | 30s | 3 retries |
`healthcheck.js` (backend) is a tiny Node script that does a local HTTP GET to `/health` and exits 0 / 1.
Inspect health:
```bash
docker ps --format "table {{.Names}}\t{{.Status}}"
# Detailed
docker inspect --format='{{json .State.Health}}' nickapp-backend | jq
```
If a container is `unhealthy`, Watchtower will **not** roll it (it expects the new container to pass healthcheck). Investigate with `docker logs <container>`.
---
## 3. Sentry — error tracking
### Frontend
`@sentry/nextjs ^10.22.0` is wired in via three config files at the repo root:
- `sentry.client.config.ts` — browser SDK (with Session Replay enabled at 10% session / 100% error rate).
- `sentry.server.config.ts` — server-rendered components (no Replay).
- `sentry.edge.config.ts` — edge runtime (not currently used heavily).
Common settings:
```ts
Sentry.init({
dsn: process.env.NEXT_PUBLIC_SENTRY_DSN,
tracesSampleRate: process.env.NODE_ENV === 'production' ? 0.1 : 1.0,
environment: process.env.NODE_ENV || 'development',
enabled: process.env.NODE_ENV === 'production',
ignoreErrors: ['ResizeObserver loop limit exceeded', 'ChunkLoadError', ...],
});
```
Errors from `localhost` are filtered out — only prod errors land in the dashboard.
### Backend
`@sentry/node ^10.22.0` + `@sentry/profiling-node ^10.22.0` are initialised **first** in `src/app.ts` (before any other import) via `src/config/sentry.ts`. DSN comes from `SENTRY_DSN` env var (see [[Environment Variables#sentry]]).
What's captured:
- Uncaught exceptions in route handlers
- Promise rejections inside `asyncHandler`-wrapped routes
- Manually-captured errors via `Sentry.captureException(err)`
- Performance traces (10% sample rate in prod)
- Profiling samples via `@sentry/profiling-node`
### Source maps
Frontend uploads source maps to Sentry at build time when `SENTRY_AUTH_TOKEN`, `SENTRY_ORG`, and `SENTRY_PROJECT` are set in the CI env. Without them the build still succeeds but Sentry traces will show minified frames.
### Alerts
Configure in the Sentry dashboard (Issues → Alerts) — common alerts:
- Any new issue in production → Slack
- Error frequency > 50/minute → page on-call
- Performance regression on `/api/payments/*` traces → email
---
## 4. Logs
### Backend application logs
Routed through `src/utils/logger.ts` — currently a thin `console.log` wrapper with emoji prefixes. Output goes to stdout, captured by Docker:
```bash
# Live tail
docker compose -f docker-compose.production.yml logs -f --tail=200 nickapp-backend
# Search for a request
docker logs nickapp-backend 2>&1 | grep "POST /api/payments"
# Pre-filter by date
docker logs --since 1h nickapp-backend
```
Notable log lines to look for:
| Prefix | Meaning |
|--------|---------|
| `✅ Connected to MongoDB` | DB connection established |
| `🚀 Server running on port 5001` | App fully started |
| `🔌 User connected: <id>` | Socket.IO connection |
| `📥` | Inbound HTTP request log |
| `💳 Request Network` | Request Network webhook / API call |
| `🔐 Webhook verification` | Webhook signature check result |
| `❌ Error` | Manual error log (also captured by Sentry) |
### Nginx access + error logs
Bind-mounted to `./nginx/logs/` on the host:
```bash
tail -f /opt/backend/nginx/logs/access.log
tail -f /opt/backend/nginx/logs/error.log
```
Rotate these via host `logrotate` to avoid disk fill.
### Frontend logs
Next.js logs go to the container stdout:
```bash
docker logs -f nickapp-frontend
```
Browser-side logs that need attention go through Sentry (above) — `src/utils/logger.ts` in the frontend forwards via Sentry breadcrumbs.
---
## 5. Key metrics to watch
Today these are read manually from logs / Sentry. As Prometheus is added, encode them as alerting rules.
### Application
| Metric | Where to check | Healthy | Alert |
|--------|---------------|---------|-------|
| 5xx rate | Sentry, Nginx access.log | < 0.5 % | > 2 % over 5 min |
| `/health` p95 latency | curl + timer | < 100 ms | > 1 s |
| Login success rate | Sentry custom event | > 95 % | < 90 % |
| Socket disconnect storm | `🔌 User disconnected` log frequency | < 1/s sustained | > 10/s sustained |
| OpenAI 429s | Backend log `OpenAI ... 429` | 0 | any |
### Payments
| Metric | Where | Healthy | Alert |
|--------|-------|---------|-------|
| Payment success rate | `db.payments.aggregate([{$group:{_id:"$status",n:{$sum:1}}}])` | > 95 % completed of 24h-old payments | < 90 % |
| Webhook signature failures | log `Webhook verification failed` | 0 | > 0 |
| Request Network webhook 4xx | nginx access log `/api/payment/request-network/webhook` | 0 | any real provider delivery returning 4xx |
| Request Network safety-pending payments | `db.payments.find({"metadata.transactionSafety.status":"pending"})` | explained/short-lived | pending > 10 min without operator note |
| Request Network API errors (5xx) | log + Sentry | 0 | > 5/min sustained |
| Payouts stuck in `pending` > 30 min | `db.payments.find({type:'payout',status:'pending',createdAt:{$lt:ISODate(30 min ago)}})` | empty | non-empty |
| Missing `transactionHash` after `completed` | the same query that drives `fix-transaction-hashes.js` | empty | non-empty |
### MongoDB
```js
db.serverStatus().connections // active connections; alert if >1000
db.serverStatus().opcounters // ops/sec
db.serverStatus().wiredTiger.cache // cache hit ratio; aim > 95 %
db.currentOp({ secs_running: { $gte: 5 } }) // long-running queries
```
### Redis
```bash
docker exec nickapp-redis redis-cli -a "$REDIS_PASSWORD" INFO stats
# Watch: instantaneous_ops_per_sec, keyspace_hits/misses, rejected_connections, evicted_keys
```
Alert thresholds: `rejected_connections > 0`, `evicted_keys` rising while you don't expect cache pressure, `latency_ms` p99 > 5ms.
### Host
| Metric | Tool | Healthy | Alert |
|--------|------|---------|-------|
| Disk usage on `/var/lib/docker` | `df -h` | < 80 % | > 90 % |
| `/opt/backend/uploads` size | `du -sh` | watch trend | bursty growth (>5 GB/day) |
| Memory pressure | `free -h`, `docker stats` | < 80 % | swap actively used |
| Open file descriptors | `cat /proc/<pid>/limits` | well under hard limit | nearing limit |
---
## 6. Smoke tests after a deploy
Drop these in a runbook for the on-call:
```bash
# 1. API health
curl -fsS https://amn.gg/api/health | jq '.success,.version,.environment'
# 2. Login
curl -fsS -X POST https://amn.gg/api/auth/login \
-H "Content-Type: application/json" \
-d '{"email":"admin@marketplace.com","password":"<prod-admin-pwd>"}' \
| jq '.success,.data.user.email'
# 3. Frontend HTML loads
curl -fsS https://amn.gg/ -I | head -1 # expect 200
# 4. Socket.IO handshake
curl -fsS "https://amn.gg/socket.io/?EIO=4&transport=polling" -I | head -1
# 5. Containers healthy
docker ps --filter "name=nickapp-" --format "table {{.Names}}\t{{.Status}}"
```
Any non-OK → see [[Incident Response]].
---
## 7. Future work
- **Prometheus + Grafana** with Node exporter + Mongo exporter + Redis exporter — for proper time-series.
- **OpenTelemetry** spans from backend → Sentry / Jaeger.
- **Healthcheck endpoint** that probes Mongo + Redis and returns `503` when degraded.
- **PagerDuty / OpsGenie** wiring from Sentry alerts.
- **Synthetic checks** (Pingdom / UptimeRobot) hitting `/health` from multiple regions.
For now, Sentry + Docker healthchecks + manual log checks cover the basics. See [[Incident Response]] for what to do when something fires.