Full-codebase-audit 2026-05-30 outputs: - Audit report: 09 - Audits/Full Codebase Audit - 2026-05-30.md - 81 issue files ISSUE-055..135 (decisions + 1 skipped no-brainer). - Scanner docs from scratch (was zero): architecture, data model, API ref, payment flow, operations runbook + repo README. - Doc-sync updates across API reference, data models, flows, design system. - Secret Rotation Runbook (08 - Operations) for the exposed credentials. - Reusable workflow guide (07 - Development) + .claude/workflows/full-codebase-audit.js. Issues remain status:open intentionally — the code fixes are uncommitted-then-committed working-tree changes per repo and aren't "resolved" until merged/deployed. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
253 lines
9.0 KiB
Markdown
253 lines
9.0 KiB
Markdown
---
|
|
title: Monitoring
|
|
tags: [operations]
|
|
---
|
|
|
|
# Monitoring
|
|
|
|
What's instrumented today and what to watch. Today's stack is intentionally lean — health endpoints, Docker healthchecks, Sentry, and access logs. Bigger metric pipelines (Prometheus, Grafana, OpenSearch) are a future addition.
|
|
|
|
---
|
|
|
|
## 1. Health endpoint
|
|
|
|
Two paths are registered (both are public, rate-limited, not auth-gated):
|
|
|
|
- `GET /health` — simple ping used by Docker healthchecks. Returns `200 { success, message, timestamp, environment, version }`. Does **not** probe MongoDB or Redis.
|
|
- `GET /api/health` — deep health check added in commit `44579d6` (backend v2.6.49). Calls `runHealthChecks` from `backend/src/services/health/healthCheckService.ts`. Probes MongoDB and Redis, collects memory/uptime stats, and returns a structured report. Returns `503` when `report.status === 'down'`.
|
|
|
|
`GET /api/health` response shape (from `healthCheckService`):
|
|
```json
|
|
{
|
|
"status": "ok",
|
|
"version": "2.6.xx",
|
|
"timestamp": "...",
|
|
"checks": { "mongodb": "ok", "redis": "ok", "uptime": 3600, "memoryMB": 120 }
|
|
}
|
|
```
|
|
|
|
Public URL behind Nginx: `https://amn.gg/api/health`.
|
|
|
|
---
|
|
|
|
## 2. Docker healthchecks
|
|
|
|
Each long-lived container has a `HEALTHCHECK` baked in or declared in compose.
|
|
|
|
| Container | Probe | Interval | Failure threshold |
|
|
|-----------|-------|----------|-------------------|
|
|
| `nickapp-backend` | `node healthcheck.js` (HTTP GET `/health`) | 30s | 3 retries |
|
|
| `nickapp-frontend` | `curl -f http://localhost:8083/` | 30s | 3 retries |
|
|
| `mongodb` | `mongosh --eval "db.adminCommand('ping')"` | 30s | 3 retries |
|
|
| `redis` | `redis-cli -a $REDIS_PASSWORD ping` | 30s | 3 retries |
|
|
|
|
`healthcheck.js` (backend) is a tiny Node script that does a local HTTP GET to `/health` and exits 0 / 1.
|
|
|
|
Inspect health:
|
|
|
|
```bash
|
|
docker ps --format "table {{.Names}}\t{{.Status}}"
|
|
|
|
# Detailed
|
|
docker inspect --format='{{json .State.Health}}' nickapp-backend | jq
|
|
```
|
|
|
|
If a container is `unhealthy`, Watchtower will **not** roll it (it expects the new container to pass healthcheck). Investigate with `docker logs <container>`.
|
|
|
|
---
|
|
|
|
## 3. Sentry — error tracking
|
|
|
|
### Frontend
|
|
|
|
`@sentry/nextjs ^10.22.0` is wired in via three config files at the repo root:
|
|
|
|
- `sentry.client.config.ts` — browser SDK (with Session Replay enabled at 10% session / 100% error rate).
|
|
- `sentry.server.config.ts` — server-rendered components (no Replay).
|
|
- `sentry.edge.config.ts` — edge runtime (not currently used heavily).
|
|
|
|
Common settings:
|
|
|
|
```ts
|
|
Sentry.init({
|
|
dsn: process.env.NEXT_PUBLIC_SENTRY_DSN,
|
|
tracesSampleRate: process.env.NODE_ENV === 'production' ? 0.1 : 1.0,
|
|
environment: process.env.NODE_ENV || 'development',
|
|
enabled: process.env.NODE_ENV === 'production',
|
|
ignoreErrors: ['ResizeObserver loop limit exceeded', 'ChunkLoadError', ...],
|
|
});
|
|
```
|
|
|
|
Errors from `localhost` are filtered out — only prod errors land in the dashboard.
|
|
|
|
### Backend
|
|
|
|
`@sentry/node ^10.22.0` + `@sentry/profiling-node ^10.22.0` are initialised **first** in `src/app.ts` (before any other import) via `src/config/sentry.ts`. DSN comes from `SENTRY_DSN` env var (see [[Environment Variables#sentry]]).
|
|
|
|
What's captured:
|
|
|
|
- Uncaught exceptions in route handlers
|
|
- Promise rejections inside `asyncHandler`-wrapped routes
|
|
- Manually-captured errors via `Sentry.captureException(err)`
|
|
- Performance traces (10% sample rate in prod)
|
|
- Profiling samples via `@sentry/profiling-node`
|
|
|
|
### Source maps
|
|
|
|
Frontend uploads source maps to Sentry at build time when `SENTRY_AUTH_TOKEN`, `SENTRY_ORG`, and `SENTRY_PROJECT` are set in the CI env. Without them the build still succeeds but Sentry traces will show minified frames.
|
|
|
|
### Alerts
|
|
|
|
Configure in the Sentry dashboard (Issues → Alerts) — common alerts:
|
|
|
|
- Any new issue in production → Slack
|
|
- Error frequency > 50/minute → page on-call
|
|
- Performance regression on `/api/payments/*` traces → email
|
|
|
|
---
|
|
|
|
## 4. Logs
|
|
|
|
### Backend application logs
|
|
|
|
Routed through `src/utils/logger.ts` — currently a thin `console.log` wrapper with emoji prefixes. Output goes to stdout, captured by Docker:
|
|
|
|
```bash
|
|
# Live tail
|
|
docker compose -f docker-compose.production.yml logs -f --tail=200 nickapp-backend
|
|
|
|
# Search for a request
|
|
docker logs nickapp-backend 2>&1 | grep "POST /api/payments"
|
|
|
|
# Pre-filter by date
|
|
docker logs --since 1h nickapp-backend
|
|
```
|
|
|
|
Notable log lines to look for:
|
|
|
|
| Prefix | Meaning |
|
|
|--------|---------|
|
|
| `✅ Connected to MongoDB` | DB connection established |
|
|
| `🚀 Server running on port 5001` | App fully started |
|
|
| `🔌 User connected: <id>` | Socket.IO connection |
|
|
| `📥` | Inbound HTTP request log |
|
|
| `💳 Request Network` | Request Network webhook / API call |
|
|
| `🔐 Webhook verification` | Webhook signature check result |
|
|
| `❌ Error` | Manual error log (also captured by Sentry) |
|
|
|
|
### Nginx access + error logs
|
|
|
|
Bind-mounted to `./nginx/logs/` on the host:
|
|
|
|
```bash
|
|
tail -f /opt/backend/nginx/logs/access.log
|
|
tail -f /opt/backend/nginx/logs/error.log
|
|
```
|
|
|
|
Rotate these via host `logrotate` to avoid disk fill.
|
|
|
|
### Frontend logs
|
|
|
|
Next.js logs go to the container stdout:
|
|
|
|
```bash
|
|
docker logs -f nickapp-frontend
|
|
```
|
|
|
|
Browser-side logs that need attention go through Sentry (above) — `src/utils/logger.ts` in the frontend forwards via Sentry breadcrumbs.
|
|
|
|
---
|
|
|
|
## 5. Key metrics to watch
|
|
|
|
Today these are read manually from logs / Sentry. As Prometheus is added, encode them as alerting rules.
|
|
|
|
### Application
|
|
|
|
| Metric | Where to check | Healthy | Alert |
|
|
|--------|---------------|---------|-------|
|
|
| 5xx rate | Sentry, Nginx access.log | < 0.5 % | > 2 % over 5 min |
|
|
| `/health` p95 latency | curl + timer | < 100 ms | > 1 s |
|
|
| Login success rate | Sentry custom event | > 95 % | < 90 % |
|
|
| Socket disconnect storm | `🔌 User disconnected` log frequency | < 1/s sustained | > 10/s sustained |
|
|
| OpenAI 429s | Backend log `OpenAI ... 429` | 0 | any |
|
|
|
|
### Payments
|
|
|
|
| Metric | Where | Healthy | Alert |
|
|
|--------|-------|---------|-------|
|
|
| Payment success rate | `db.payments.aggregate([{$group:{_id:"$status",n:{$sum:1}}}])` | > 95 % completed of 24h-old payments | < 90 % |
|
|
| Webhook signature failures | log `Webhook verification failed` | 0 | > 0 |
|
|
| Request Network webhook 4xx | nginx access log `/api/payment/request-network/webhook` | 0 | any real provider delivery returning 4xx |
|
|
| Request Network safety-pending payments | `db.payments.find({"metadata.transactionSafety.status":"pending"})` | explained/short-lived | pending > 10 min without operator note |
|
|
| Request Network API errors (5xx) | log + Sentry | 0 | > 5/min sustained |
|
|
| Payouts stuck in `pending` > 30 min | `db.payments.find({type:'payout',status:'pending',createdAt:{$lt:ISODate(30 min ago)}})` | empty | non-empty |
|
|
| Missing `transactionHash` after `completed` | the same query that drives `fix-transaction-hashes.js` | empty | non-empty |
|
|
|
|
### MongoDB
|
|
|
|
```js
|
|
db.serverStatus().connections // active connections; alert if >1000
|
|
db.serverStatus().opcounters // ops/sec
|
|
db.serverStatus().wiredTiger.cache // cache hit ratio; aim > 95 %
|
|
db.currentOp({ secs_running: { $gte: 5 } }) // long-running queries
|
|
```
|
|
|
|
### Redis
|
|
|
|
```bash
|
|
docker exec nickapp-redis redis-cli -a "$REDIS_PASSWORD" INFO stats
|
|
# Watch: instantaneous_ops_per_sec, keyspace_hits/misses, rejected_connections, evicted_keys
|
|
```
|
|
|
|
Alert thresholds: `rejected_connections > 0`, `evicted_keys` rising while you don't expect cache pressure, `latency_ms` p99 > 5ms.
|
|
|
|
### Host
|
|
|
|
| Metric | Tool | Healthy | Alert |
|
|
|--------|------|---------|-------|
|
|
| Disk usage on `/var/lib/docker` | `df -h` | < 80 % | > 90 % |
|
|
| `/opt/backend/uploads` size | `du -sh` | watch trend | bursty growth (>5 GB/day) |
|
|
| Memory pressure | `free -h`, `docker stats` | < 80 % | swap actively used |
|
|
| Open file descriptors | `cat /proc/<pid>/limits` | well under hard limit | nearing limit |
|
|
|
|
---
|
|
|
|
## 6. Smoke tests after a deploy
|
|
|
|
Drop these in a runbook for the on-call:
|
|
|
|
```bash
|
|
# 1. API health
|
|
curl -fsS https://amn.gg/api/health | jq '.success,.version,.environment'
|
|
|
|
# 2. Login
|
|
curl -fsS -X POST https://amn.gg/api/auth/login \
|
|
-H "Content-Type: application/json" \
|
|
-d '{"email":"admin@marketplace.com","password":"<prod-admin-pwd>"}' \
|
|
| jq '.success,.data.user.email'
|
|
|
|
# 3. Frontend HTML loads
|
|
curl -fsS https://amn.gg/ -I | head -1 # expect 200
|
|
|
|
# 4. Socket.IO handshake
|
|
curl -fsS "https://amn.gg/socket.io/?EIO=4&transport=polling" -I | head -1
|
|
|
|
# 5. Containers healthy
|
|
docker ps --filter "name=nickapp-" --format "table {{.Names}}\t{{.Status}}"
|
|
```
|
|
|
|
Any non-OK → see [[Incident Response]].
|
|
|
|
---
|
|
|
|
## 7. Future work
|
|
|
|
- **Prometheus + Grafana** with Node exporter + Mongo exporter + Redis exporter — for proper time-series.
|
|
- **OpenTelemetry** spans from backend → Sentry / Jaeger.
|
|
- **Healthcheck endpoint** that probes Mongo + Redis and returns `503` when degraded.
|
|
- **PagerDuty / OpsGenie** wiring from Sentry alerts.
|
|
- **Synthetic checks** (Pingdom / UptimeRobot) hitting `/health` from multiple regions.
|
|
|
|
For now, Sentry + Docker healthchecks + manual log checks cover the basics. See [[Incident Response]] for what to do when something fires.
|