Initial commit: nick docs

2026-05-23 20:35:34 +03:30
commit 0da235ae27
90 changed files with 18268 additions and 0 deletions
--- a/Operations/Monitoring.md
+++ b/Operations/Monitoring.md
@@ -0,0 +1,253 @@
+---
+title: Monitoring
+tags: [operations]
+---
+
+# Monitoring
+
+What's instrumented today and what to watch. Today's stack is intentionally lean — health endpoints, Docker healthchecks, Sentry, and access logs. Bigger metric pipelines (Prometheus, Grafana, OpenSearch) are a future addition.
+
+---
+
+## 1. Health endpoint
+
+Path: `GET /health` (backend, port `5001`).
+
+Defined in `backend/src/app.ts`:
+
+```ts
+app.get("/health", (req, res) => {
+  res.json({
+    success: true,
+    message: "Marketplace Backend API is running",
+    timestamp: new Date().toISOString(),
+    environment: config.nodeEnv,
+    version: packageJson.version,
+  });
+});
+```
+
+Returns `200` with a JSON envelope as soon as Express is up. Does **not** currently probe MongoDB or Redis — they are checked via separate Docker healthchecks. If you want deep health, extend the endpoint to ping both data stores and return `503` on failure.
+
+Public URL behind Nginx: `https://amn.gg/api/health`.
+
+---
+
+## 2. Docker healthchecks
+
+Each long-lived container has a `HEALTHCHECK` baked in or declared in compose.
+
+| Container | Probe | Interval | Failure threshold |
+|-----------|-------|----------|-------------------|
+| `nickapp-backend` | `node healthcheck.js` (HTTP GET `/health`) | 30s | 3 retries |
+| `nickapp-frontend` | `curl -f http://localhost:8083/` | 30s | 3 retries |
+| `mongodb` | `mongosh --eval "db.adminCommand('ping')"` | 30s | 3 retries |
+| `redis` | `redis-cli -a $REDIS_PASSWORD ping` | 30s | 3 retries |
+
+`healthcheck.js` (backend) is a tiny Node script that does a local HTTP GET to `/health` and exits 0 / 1.
+
+Inspect health:
+
+```bash
+docker ps --format "table {{.Names}}\t{{.Status}}"
+
+# Detailed
+docker inspect --format='{{json .State.Health}}' nickapp-backend | jq
+```
+
+If a container is `unhealthy`, Watchtower will **not** roll it (it expects the new container to pass healthcheck). Investigate with `docker logs <container>`.
+
+---
+
+## 3. Sentry — error tracking
+
+### Frontend
+
+`@sentry/nextjs ^10.22.0` is wired in via three config files at the repo root:
+
+- `sentry.client.config.ts` — browser SDK (with Session Replay enabled at 10% session / 100% error rate).
+- `sentry.server.config.ts` — server-rendered components (no Replay).
+- `sentry.edge.config.ts` — edge runtime (not currently used heavily).
+
+Common settings:
+
+```ts
+Sentry.init({
+  dsn: process.env.NEXT_PUBLIC_SENTRY_DSN,
+  tracesSampleRate: process.env.NODE_ENV === 'production' ? 0.1 : 1.0,
+  environment: process.env.NODE_ENV || 'development',
+  enabled: process.env.NODE_ENV === 'production',
+  ignoreErrors: ['ResizeObserver loop limit exceeded', 'ChunkLoadError', ...],
+});
+```
+
+Errors from `localhost` are filtered out — only prod errors land in the dashboard.
+
+### Backend
+
+`@sentry/node ^10.22.0` + `@sentry/profiling-node ^10.22.0` are initialised **first** in `src/app.ts` (before any other import) via `src/config/sentry.ts`. DSN comes from `SENTRY_DSN` env var (see [[Environment Variables#sentry]]).
+
+What's captured:
+
+- Uncaught exceptions in route handlers
+- Promise rejections inside `asyncHandler`-wrapped routes
+- Manually-captured errors via `Sentry.captureException(err)`
+- Performance traces (10% sample rate in prod)
+- Profiling samples via `@sentry/profiling-node`
+
+### Source maps
+
+Frontend uploads source maps to Sentry at build time when `SENTRY_AUTH_TOKEN`, `SENTRY_ORG`, and `SENTRY_PROJECT` are set in the CI env. Without them the build still succeeds but Sentry traces will show minified frames.
+
+### Alerts
+
+Configure in the Sentry dashboard (Issues → Alerts) — common alerts:
+
+- Any new issue in production → Slack
+- Error frequency > 50/minute → page on-call
+- Performance regression on `/api/payments/*` traces → email
+
+---
+
+## 4. Logs
+
+### Backend application logs
+
+Routed through `src/utils/logger.ts` — currently a thin `console.log` wrapper with emoji prefixes. Output goes to stdout, captured by Docker:
+
+```bash
+# Live tail
+docker compose -f docker-compose.production.yml logs -f --tail=200 nickapp-backend
+
+# Search for a request
+docker logs nickapp-backend 2>&1 | grep "POST /api/payments"
+
+# Pre-filter by date
+docker logs --since 1h nickapp-backend
+```
+
+Notable log lines to look for:
+
+| Prefix | Meaning |
+|--------|---------|
+| `✅ Connected to MongoDB` | DB connection established |
+| `🚀 Server running on port 5001` | App fully started |
+| `🔌 User connected: <id>` | Socket.IO connection |
+| `📥` | Inbound HTTP request log |
+| `💳 SHKeeper` | SHKeeper webhook / API call |
+| `🔐 Webhook verification` | Webhook signature check result |
+| `❌ Error` | Manual error log (also captured by Sentry) |
+
+### Nginx access + error logs
+
+Bind-mounted to `./nginx/logs/` on the host:
+
+```bash
+tail -f /opt/backend/nginx/logs/access.log
+tail -f /opt/backend/nginx/logs/error.log
+```
+
+Rotate these via host `logrotate` to avoid disk fill.
+
+### Frontend logs
+
+Next.js logs go to the container stdout:
+
+```bash
+docker logs -f nickapp-frontend
+```
+
+Browser-side logs that need attention go through Sentry (above) — `src/utils/logger.ts` in the frontend forwards via Sentry breadcrumbs.
+
+---
+
+## 5. Key metrics to watch
+
+Today these are read manually from logs / Sentry. As Prometheus is added, encode them as alerting rules.
+
+### Application
+
+| Metric | Where to check | Healthy | Alert |
+|--------|---------------|---------|-------|
+| 5xx rate | Sentry, Nginx access.log | < 0.5 % | > 2 % over 5 min |
+| `/health` p95 latency | curl + timer | < 100 ms | > 1 s |
+| Login success rate | Sentry custom event | > 95 % | < 90 % |
+| Socket disconnect storm | `🔌 User disconnected` log frequency | < 1/s sustained | > 10/s sustained |
+| OpenAI 429s | Backend log `OpenAI ... 429` | 0 | any |
+
+### Payments
+
+| Metric | Where | Healthy | Alert |
+|--------|-------|---------|-------|
+| Payment success rate | `db.payments.aggregate([{$group:{_id:"$status",n:{$sum:1}}}])` | > 95 % completed of 24h-old payments | < 90 % |
+| Webhook signature failures | log `Webhook verification failed` | 0 | > 0 |
+| SHKeeper API errors (5xx) | log + Sentry | 0 | > 5/min sustained |
+| Payouts stuck in `pending` > 30 min | `db.payments.find({type:'payout',status:'pending',createdAt:{$lt:ISODate(30 min ago)}})` | empty | non-empty |
+| Missing `transactionHash` after `completed` | the same query that drives `fix-transaction-hashes.js` | empty | non-empty |
+
+### MongoDB
+
+```js
+db.serverStatus().connections           // active connections; alert if >1000
+db.serverStatus().opcounters            // ops/sec
+db.serverStatus().wiredTiger.cache      // cache hit ratio; aim > 95 %
+db.currentOp({ secs_running: { $gte: 5 } })  // long-running queries
+```
+
+### Redis
+
+```bash
+docker exec nickapp-redis redis-cli -a "$REDIS_PASSWORD" INFO stats
+# Watch: instantaneous_ops_per_sec, keyspace_hits/misses, rejected_connections, evicted_keys
+```
+
+Alert thresholds: `rejected_connections > 0`, `evicted_keys` rising while you don't expect cache pressure, `latency_ms` p99 > 5ms.
+
+### Host
+
+| Metric | Tool | Healthy | Alert |
+|--------|------|---------|-------|
+| Disk usage on `/var/lib/docker` | `df -h` | < 80 % | > 90 % |
+| `/opt/backend/uploads` size | `du -sh` | watch trend | bursty growth (>5 GB/day) |
+| Memory pressure | `free -h`, `docker stats` | < 80 % | swap actively used |
+| Open file descriptors | `cat /proc/<pid>/limits` | well under hard limit | nearing limit |
+
+---
+
+## 6. Smoke tests after a deploy
+
+Drop these in a runbook for the on-call:
+
+```bash
+# 1. API health
+curl -fsS https://amn.gg/api/health | jq '.success,.version,.environment'
+
+# 2. Login
+curl -fsS -X POST https://amn.gg/api/auth/login \
+  -H "Content-Type: application/json" \
+  -d '{"email":"admin@marketplace.com","password":"<prod-admin-pwd>"}' \
+  | jq '.success,.data.user.email'
+
+# 3. Frontend HTML loads
+curl -fsS https://amn.gg/ -I | head -1   # expect 200
+
+# 4. Socket.IO handshake
+curl -fsS "https://amn.gg/socket.io/?EIO=4&transport=polling" -I | head -1
+
+# 5. Containers healthy
+docker ps --filter "name=nickapp-" --format "table {{.Names}}\t{{.Status}}"
+```
+
+Any non-OK → see [[Incident Response]].
+
+---
+
+## 7. Future work
+
+- **Prometheus + Grafana** with Node exporter + Mongo exporter + Redis exporter — for proper time-series.
+- **OpenTelemetry** spans from backend → Sentry / Jaeger.
+- **Healthcheck endpoint** that probes Mongo + Redis and returns `503` when degraded.
+- **PagerDuty / OpsGenie** wiring from Sentry alerts.
+- **Synthetic checks** (Pingdom / UptimeRobot) hitting `/health` from multiple regions.
+
+For now, Sentry + Docker healthchecks + manual log checks cover the basics. See [[Incident Response]] for what to do when something fires.