Initial commit: nick docs

2026-05-23 20:35:34 +03:30
commit 0da235ae27
90 changed files with 18268 additions and 0 deletions
--- a/Operations/Backup
+++ b/Operations/Backup
@@ -0,0 +1,315 @@
+---
+title: Backup & Recovery
+tags: [operations]
+---
+
+# Backup & Recovery
+
+How to keep the marketplace recoverable from data loss. Covers MongoDB, Redis, the `uploads/` directory, and environment secrets, plus the disaster-recovery runbook.
+
+---
+
+## 1. RTO / RPO targets
+
+| Asset | RPO (data loss tolerated) | RTO (downtime tolerated) | Backup cadence |
+|-------|---------------------------|--------------------------|----------------|
+| MongoDB | 1 hour | 1 hour | Hourly `mongodump` + nightly offsite |
+| `uploads/` directory | 24 hours | 2 hours | Nightly `rsync` to offsite |
+| Redis | 1 hour (regeneratable) | 0 minutes (app survives empty cache) | Nightly RDB snapshot |
+| Production `.env` | n/a (manual) | 5 minutes | Stored in 1Password / Bitwarden vault |
+| Container images | n/a (CI rebuilds) | 15 minutes | Tagged in registry by version |
+
+Adjust these targets when product SLAs change.
+
+---
+
+## 2. MongoDB
+
+### 2.1 Dump
+
+```bash
+#!/usr/bin/env bash
+# scripts/backup-mongo.sh — run hourly via cron
+set -euo pipefail
+
+STAMP=$(date -u +%FT%H%M%SZ)
+DEST=/var/backups/mongo
+mkdir -p "$DEST"
+
+docker exec nickapp-mongodb \
+  mongodump --db=marketplace --archive --gzip \
+  > "$DEST/marketplace-$STAMP.gz"
+
+# Keep last 24 hourly + 14 daily
+find "$DEST" -name 'marketplace-*.gz' -mtime +14 -delete
+```
+
+Cron entry:
+
+```
+0 * * * * /usr/local/bin/backup-mongo.sh >> /var/log/backup-mongo.log 2>&1
+```
+
+### 2.2 Offsite
+
+Push the most recent dump to S3 (or Backblaze B2, or `rclone` to any provider) nightly:
+
+```bash
+aws s3 cp "$DEST"/marketplace-*.gz \
+  "s3://marketplace-backups/mongo/" \
+  --recursive --exclude "*" --include "marketplace-*.gz" \
+  --storage-class STANDARD_IA
+```
+
+Set a 90-day lifecycle policy on the bucket to age out old copies.
+
+### 2.3 Restore
+
+> [!warning] Restoring is **destructive** to the current data. Always practise on a staging clone before doing it for real.
+
+```bash
+# Restore against an empty database (fresh container)
+docker exec -i nickapp-mongodb \
+  mongorestore --archive --gzip --drop \
+  < /var/backups/mongo/marketplace-2026-05-20T0300Z.gz
+
+# Verify
+docker exec nickapp-mongodb mongosh \
+  --eval "use marketplace; db.users.countDocuments()"
+```
+
+For partial restore (single collection):
+
+```bash
+docker exec -i nickapp-mongodb \
+  mongorestore --archive --gzip --drop \
+  --nsInclude='marketplace.payments' \
+  < /var/backups/mongo/marketplace-2026-05-20T0300Z.gz
+```
+
+### 2.4 Validate backups
+
+A monthly drill — restore the latest dump into a throwaway container and run smoke queries:
+
+```bash
+docker run --rm -v $(pwd)/marketplace-latest.gz:/dump.gz mongo:8.2 \
+  sh -c "mongorestore --archive=/dump.gz --gzip && mongosh --eval 'db.getMongo().getDBNames()'"
+```
+
+If validation fails, treat as a sev-2 incident (see [[Incident Response]]).
+
+---
+
+## 3. Redis
+
+Redis data is regeneratable — losing it means logged-out users + cold caches, no business data lost. Still cheap to back up.
+
+### 3.1 Snapshot
+
+```bash
+# Trigger a save and copy out
+docker exec nickapp-redis redis-cli -a "$REDIS_PASSWORD" BGSAVE
+sleep 5
+docker cp nickapp-redis:/data/dump.rdb /var/backups/redis/redis-$(date -u +%FT%H%M%SZ).rdb
+```
+
+Daily cron is sufficient.
+
+### 3.2 Restore
+
+```bash
+# Stop redis, drop the RDB into the volume, start
+docker compose -f docker-compose.production.yml stop redis
+docker cp /var/backups/redis/redis-2026-05-20T0300Z.rdb nickapp-redis:/data/dump.rdb
+docker compose -f docker-compose.production.yml start redis
+```
+
+If you've enabled AOF, also copy `appendonly.aof`. See [[Database Operations#persistence]].
+
+---
+
+## 4. `uploads/` directory
+
+Stored on the host at `/opt/backend/uploads/` and bind-mounted into both backend and nginx containers. This is where every user upload lives — losing it means broken images, missing dispute evidence, and unhappy users.
+
+### 4.1 Nightly sync
+
+```bash
+rsync -av --delete /opt/backend/uploads/ \
+  s3://marketplace-backups/uploads/
+
+# Or rclone to any provider
+rclone sync /opt/backend/uploads/ backblaze:marketplace-uploads --transfers 8
+```
+
+Cron:
+
+```
+30 3 * * * /usr/local/bin/backup-uploads.sh >> /var/log/backup-uploads.log 2>&1
+```
+
+### 4.2 Restore
+
+```bash
+rsync -av s3://marketplace-backups/uploads/ /opt/backend/uploads/
+# fix ownership for the marketplace container (uid 1001)
+chown -R 1001:1001 /opt/backend/uploads
+```
+
+Restart the backend container so any in-flight uploads find the right directory layout.
+
+---
+
+## 5. Secrets & configuration
+
+### 5.1 `.env` files
+
+The production `.env` lives at `/opt/backend/.env`. It is **not** version-controlled and **not** in any standard backup. Source of truth: the team password manager (1Password / Bitwarden vault).
+
+After any change:
+
+1. Update the host file.
+2. Update the vault entry with the new value, a one-line "why", and the date.
+3. `docker compose -f docker-compose.production.yml up -d` to apply.
+
+### 5.2 SSL certs
+
+If you run a host-level Caddy / Nginx with Let's Encrypt, certs auto-renew. Back up `/var/lib/caddy/.local/share/caddy/` (Caddy) or `/etc/letsencrypt/` (Certbot) — useful if you migrate hosts.
+
+### 5.3 Container registry credentials
+
+`/root/.docker/config.json` on the production host holds the `git.manko.yoga` login Watchtower uses. Recreate after a rebuild:
+
+```bash
+docker login git.manko.yoga -u manawenuz
+```
+
+---
+
+## 6. Disaster recovery runbook
+
+> Scenario: production host is unrecoverable (disk failure, cloud provider lost the VM, etc.).
+
+### Phase 1 — Provision
+
+1. Spin up a new VM matching the previous spec (≥ 4 vCPU, 8 GB RAM, 100 GB SSD).
+2. Install Docker Engine + compose plugin.
+3. Restore DNS pointing or stand up a temporary subdomain (`recovery.amn.gg`).
+
+### Phase 2 — Code
+
+```bash
+cd /opt
+git clone ssh://git@git.manko.yoga:222/nick/backend.git
+git clone ssh://git@git.manko.yoga:222/nick/frontend.git
+cd backend && git checkout main
+```
+
+### Phase 3 — Config
+
+```bash
+# Restore .env from the vault
+nano /opt/backend/.env
+
+# Restore nginx config
+mkdir -p nginx/logs
+# copy nginx.conf from the vault / repo / your laptop
+```
+
+### Phase 4 — Data
+
+```bash
+# Mongo
+mkdir -p /var/backups/mongo
+aws s3 cp s3://marketplace-backups/mongo/marketplace-LATEST.gz /var/backups/mongo/
+
+# Uploads
+mkdir -p /opt/backend/uploads
+aws s3 sync s3://marketplace-backups/uploads/ /opt/backend/uploads/
+chown -R 1001:1001 /opt/backend/uploads
+
+# Redis (optional — empty is fine)
+mkdir -p /var/backups/redis
+aws s3 cp s3://marketplace-backups/redis/redis-LATEST.rdb /var/backups/redis/
+```
+
+### Phase 5 — Start stack
+
+```bash
+cd /opt/backend
+docker login git.manko.yoga -u manawenuz
+docker compose -f docker-compose.production.yml up -d
+# wait ~60s
+docker compose -f docker-compose.production.yml ps
+```
+
+### Phase 6 — Restore data into running containers
+
+```bash
+# Mongo
+docker exec -i nickapp-mongodb \
+  mongorestore --archive --gzip --drop \
+  < /var/backups/mongo/marketplace-LATEST.gz
+
+# Redis
+docker compose stop redis
+docker cp /var/backups/redis/redis-LATEST.rdb nickapp-redis:/data/dump.rdb
+docker compose start redis
+```
+
+### Phase 7 — Verify
+
+```bash
+curl -fsS http://localhost:8083/api/health | jq
+docker exec nickapp-mongodb mongosh --eval "use marketplace; db.users.countDocuments()"
+docker compose logs --tail=200 nickapp-backend | grep -E "✅|❌"
+```
+
+### Phase 8 — Restart Watchtower & cut over DNS
+
+```bash
+docker run -d --name watchtower --restart unless-stopped \
+  -v /var/run/docker.sock:/var/run/docker.sock \
+  -v /root/.docker/config.json:/config.json \
+  -e WATCHTOWER_POLL_INTERVAL=300 \
+  -e WATCHTOWER_LABEL_ENABLE=true \
+  containrrr/watchtower
+
+# Update DNS for amn.gg / dev.amn.gg to the new host's IP
+```
+
+### Phase 9 — Post-mortem
+
+Write a post-mortem (template in [[Incident Response#postmortem-template]]) and update this runbook with anything that surprised you.
+
+---
+
+## 7. Quick-reference commands
+
+```bash
+# Mongo dump
+docker exec nickapp-mongodb mongodump --db=marketplace --archive --gzip > backup.gz
+# Mongo restore
+docker exec -i nickapp-mongodb mongorestore --archive --gzip --drop < backup.gz
+
+# Redis snapshot
+docker exec nickapp-redis redis-cli -a "$REDIS_PASSWORD" BGSAVE
+docker cp nickapp-redis:/data/dump.rdb redis.rdb
+
+# Uploads to S3
+rclone sync /opt/backend/uploads/ s3:marketplace-backups/uploads/
+
+# Restore .env
+# Pull from vault, paste into /opt/backend/.env, docker compose up -d
+```
+
+---
+
+## 8. Testing the plan
+
+> [!tip] Backups are not real until they've been restored. Drill quarterly:
+>
+> 1. Spin up a throwaway VM.
+> 2. Walk Phases 2–7 of the DR runbook with the most recent backups.
+> 3. Time it. If RTO is busted, fix the gap before the next drill.
+> 4. Capture lessons in this file.
--- a/Operations/CI-CD
+++ b/Operations/CI-CD
@@ -0,0 +1,259 @@
+---
+title: CI-CD Pipeline
+tags: [operations]
+---
+
+# CI/CD Pipeline
+
+How code goes from a push to a running container in production. The CI is **Gitea Actions** running on the same Gitea instance that hosts the repos. The CD is **Watchtower** on the production host (covered in [[Deployment]]).
+
+---
+
+## 1. Where workflows live
+
+| Repo | Path | Files |
+|------|------|-------|
+| Backend | `.gitea/workflows/` | `docker-build-simple.yml`, `docker-build-dev.yml`, `docker-build-no-cache.yml` |
+| Frontend | `.gitea/workflows/` | `deploy.yml`, `devDeploy.yml` |
+
+Gitea Actions speaks the same YAML dialect as GitHub Actions — most third-party actions (`actions/checkout@v4`, `docker/login-action@v3`, `docker/build-push-action@v5`) work unchanged.
+
+---
+
+## 2. Required secrets
+
+Configured per repo at **Settings → Actions → Secrets**.
+
+| Secret | Repo | Purpose |
+|--------|------|---------|
+| `GITEATOKEN` | both | Personal access token for the `manawenuz` user with `write:packages` scope. Used by every workflow to log into the container registry at `git.manko.yoga`. |
+| `SENTRY_AUTH_TOKEN` | frontend | (Optional) For source-map upload during Next.js build. Skipped if absent. |
+
+The registry itself is implicit: `git.manko.yoga` with `manawenuz` as the user. Image paths are `git.manko.yoga/manawenuz/<image>`.
+
+> [!warning] If `GITEATOKEN` expires or is rotated, all workflows fail at the `docker/login-action` step. Rotate proactively (annual reminder).
+
+---
+
+## 3. Backend workflows
+
+### `docker-build-simple.yml` — manual build
+
+```yaml
+name: Manual Build and Push Docker Image
+
+on:
+  workflow_dispatch:
+    inputs:
+      version:
+        description: 'Version to build (leave empty for package.json)'
+        required: false
+        type: string
+```
+
+- **Trigger.** Manual only (via Gitea UI → Actions → "Run workflow").
+- **Steps.** Checkout → buildx → `docker login` → read version (input or `package.json`) → build `Dockerfile.prod` → push tags `:<version>` and `:dev` → echo result.
+- **When to use.** Cutting an ad-hoc build of a specific commit without merging to a branch. The `:dev` tag is overwritten — production (`:latest`) is **not** touched.
+- **Cache.** Uses `type=gha` cache to speed up subsequent runs.
+
+### `docker-build-dev.yml` — dev branch auto-build
+
+```yaml
+on:
+  push:
+    branches: [ development ]
+    tags: [ 'v*' ]
+```
+
+- **Trigger.** Every push to `development` and every tag matching `v*`.
+- **Tags pushed.** `:dev-<package-version>` + moving `:dev`.
+- **Effect.** Refreshes the dev image. The production Watchtower **does not** watch `:dev`, so this is safe to push as often as you want.
+
+### `docker-build-no-cache.yml` — production build
+
+```yaml
+on:
+  push:
+    branches: [ main, master ]
+    tags: [ 'v*' ]
+```
+
+- **Trigger.** Every push to `main` (or `master`) and every `v*` tag.
+- **Tags pushed.** `:<package-version>` + moving `:latest`.
+- **Effect.** Watchtower polls `:latest`, detects the new digest, restarts `nickapp-backend` on the production host. See [[Deployment#routine-deploy]].
+- **No cache.** The file is named "No Cache" but actually does not pass `cache-from`/`cache-to`, so each build is from scratch. Slower (~5–8 min) but eliminates a class of stale-layer bugs. The `simple` workflow uses GHA cache for speed.
+
+> [!tip] If you need to invalidate a cached layer in the `simple` workflow, run `no-cache` once — the resulting tag overwrites the registry digest and `simple`'s next run will start from a cleaner base.
+
+---
+
+## 4. Frontend workflows
+
+Both workflows share the same shape: spin up a `node:22` container, run a deploy shell script that does `docker login + build + push`.
+
+### `deploy.yml` — production
+
+```yaml
+on:
+  push:
+    branches: [ main, master ]
+  workflow_dispatch:
+```
+
+Calls `./scripts/deploy.sh` — see [[Scripts#deployment]]. The script:
+
+1. Reads `package.json` version.
+2. `docker login git.manko.yoga -u manawenuz -p $GITEATOKEN`.
+3. Builds `git.manko.yoga/manawenuz/escrow-frontend:<version>` and `:latest` from `Dockerfile`.
+4. Pushes both tags.
+
+`:latest` is what production Watchtower watches → live deploy follows automatically.
+
+### `devDeploy.yml` — development branch
+
+Same as `deploy.yml` but triggered on `development` and runs `./scripts/deployDev.sh`, which pushes only `:dev`.
+
+---
+
+## 5. End-to-end timeline (production deploy)
+
+```
+t=0      Developer merges PR → main
+t+5s     Gitea webhook fires
+t+10s    Gitea Actions runner pulls repo, starts container
+t+30s    docker/setup-buildx-action initialised
+t+45s    docker/login-action authenticated
+t+2-5m   docker/build-push-action builds Dockerfile.prod
+t+5m     Push to git.manko.yoga/manawenuz/escrow-backend:latest
+t+5m+    Watchtower (next poll, up to 5 min) detects new digest
+t+10m    Watchtower stops old container, starts new one
+t+10m40s start_period=40s elapses, healthcheck passes
+t+11m    Nginx routes traffic to the new container
+```
+
+**Typical SLA: 10–12 minutes from merge to live.** For an emergency rollback see [[Deployment#roll-back]].
+
+---
+
+## 6. Versioning automation
+
+Tied to `backend/scripts/auto-version.sh` + `ai-enhanced.sh` (and the frontend mirror). Full reference in [[Git Workflow#versioning]] and [[Scripts#auto-version-sh]].
+
+In short:
+
+```bash
+# Developer side, on the branch they're releasing:
+npm run smart-release
+# → AI analyses last commit, picks bump (major/minor/patch/skip)
+# → bumps package.json
+# → commits "chore: bump version to vX.Y.Z"
+# → tags vX.Y.Z
+# → git push && git push --tags
+```
+
+The push to `main` (or the `v*` tag) then triggers `docker-build-no-cache.yml`, which:
+
+- Reads the new version from `package.json` (`node -p "require('./package.json').version"`)
+- Builds and pushes `:<version>` + `:latest`
+
+So both the **image tag** and the **git tag** carry the same `vX.Y.Z` — easy to correlate when investigating an issue.
+
+---
+
+## 7. Adding tests to the pipeline
+
+The workflows today only build + push; they do **not** run Jest or Playwright. To gate releases on tests, add a `test` job before the build:
+
+```yaml
+jobs:
+  test:
+    runs-on: ubuntu-latest
+    container: node:22
+    steps:
+      - uses: actions/checkout@v4
+      - run: yarn install --frozen-lockfile
+      - run: yarn lint
+      - run: yarn test --ci --runInBand
+      - run: yarn test:e2e   # if a service container is available
+
+  build-and-push:
+    needs: test
+    runs-on: ubuntu-latest
+    # ...existing steps...
+```
+
+Or run lint + typecheck as a pre-gate using a separate workflow that triggers on PR opened/synchronised.
+
+---
+
+## 8. Inspecting a build
+
+In Gitea: **Actions → workflow → run** to see real-time logs.
+
+Useful CLI for the registry from your laptop:
+
+```bash
+# List images and tags
+curl -s -u "manawenuz:$GITEATOKEN" \
+  "https://git.manko.yoga/v2/manawenuz/escrow-backend/tags/list" | jq
+
+# Pull a specific tag
+docker login git.manko.yoga -u manawenuz
+docker pull git.manko.yoga/manawenuz/escrow-backend:2.6.3
+```
+
+---
+
+## 9. Self-hosted runner notes
+
+Gitea Actions can use either built-in `act_runner` or your own. Currently the workflows are written for `runs-on: ubuntu-latest`, which the act_runner supplies via a generic Ubuntu container. If you need:
+
+- More CPU/RAM for builds → register a beefier self-hosted runner and change `runs-on:` to its label.
+- A Docker-in-Docker setup (frontend `deploy.yml` does this with `options: --privileged`) — confirm the runner trusts the workflow.
+
+---
+
+## 10. Failure modes & remediation
+
+| Failure | Most likely cause | Fix |
+|---------|------------------|-----|
+| `unauthorized: authentication required` at push | `GITEATOKEN` expired or lacks `write:packages` | Rotate the token, update the repo secret |
+| `Cannot perform an interactive login from a non TTY device` | Old docker-login-action version | Bump to `docker/login-action@v3` |
+| Build hangs at `yarn install` | npm registry timeout | Increase `network-timeout` (already 600000); re-run |
+| Image pushed but Watchtower doesn't roll | Watchtower can't reach the registry | `docker logs watchtower`; verify `/root/.docker/config.json` is mounted into the container |
+| New container fails healthcheck | App crash on boot | `docker logs nickapp-backend`; check env vars, follow [[Incident Response]] |
+| Multi-arch warnings about platform | Build runner is arm64 but prod is amd64 | Add `--platform=linux/amd64` to `docker/build-push-action` inputs |
+| Image size grew suddenly | Dev dep crept into prod stage | Audit `Dockerfile.prod` for missing `--production` flag in the runtime stage |
+
+---
+
+## 11. Pipeline diagram
+
+```
+                Push to development                          Push to main
+                       │                                          │
+                       ▼                                          ▼
+       ┌───────────────────────────┐            ┌───────────────────────────┐
+       │ docker-build-dev.yml      │            │ docker-build-no-cache.yml │
+       │ (backend)                 │            │ (backend)                 │
+       │ devDeploy.yml (frontend)  │            │ deploy.yml (frontend)     │
+       └───────────────┬───────────┘            └───────────────┬───────────┘
+                       │                                        │
+                push :<version>,:dev               push :<version>,:latest
+                       │                                        │
+                       ▼                                        ▼
+            git.manko.yoga/manawenuz/...:dev      git.manko.yoga/manawenuz/...:latest
+                       │                                        │
+                       │                                        ▼
+                       │                            ┌──────────────────────────┐
+                       │                            │       Watchtower         │
+                       │                            │ (poll every 5 minutes)   │
+                       │                            └──────────────┬───────────┘
+                       │                                           │
+                  manual pull on staging                      restart containers
+                                                                   │
+                                                                   ▼
+                                                              Production live
+```
+
+Cross-links: [[Deployment]] for what happens on the host, [[Git Workflow]] for what happens upstream, [[Scripts]] for the deploy shell scripts.
--- a/Operations/Database
+++ b/Operations/Database
@@ -0,0 +1,301 @@
+---
+title: Database Operations
+tags: [operations]
+---
+
+# Database Operations
+
+Day-to-day operations for the two stateful services: **MongoDB 8.2** (primary data store) and **Redis 8** (cache, rate-limit counters, ephemeral session data).
+
+For schema details see [[Data Models]]. For backup procedures and disaster recovery see [[Backup & Recovery]].
+
+---
+
+## 1. MongoDB
+
+### 1.1 Connection
+
+| Env | URI in compose | Auth |
+|-----|---------------|------|
+| Dev | `mongodb://mongodb:27017` | none |
+| Prod | `mongodb://mongodb:27017` (private network) or with creds via `.env` | typically none on the private network, but enable `--auth` if exposed |
+
+The DB name comes from `DB_NAME` (e.g. `marketplace`). See [[Environment Variables#database]].
+
+Connect from a shell inside the host:
+
+```bash
+# Dev
+docker exec -it nickdev-mongodb mongosh
+
+# Prod
+docker exec -it nickapp-mongodb mongosh
+> use marketplace
+> show collections
+```
+
+If auth is enabled:
+
+```bash
+docker exec -it nickapp-mongodb mongosh \
+  -u "$MONGO_INITDB_ROOT_USERNAME" -p "$MONGO_INITDB_ROOT_PASSWORD" \
+  --authenticationDatabase admin
+```
+
+### 1.2 Init scripts (`mongo-init/`)
+
+The production compose bind-mounts `./mongo-init` into `/docker-entrypoint-initdb.d`. Mongo runs `*.js` and `*.sh` from this folder **only on a fresh datadir** (first boot of a new volume). Use this to:
+
+- Create application users (`db.createUser({...})`)
+- Bootstrap collections + indexes that must exist before the app starts
+
+Example `mongo-init/01-create-user.js`:
+
+```js
+db = db.getSiblingDB('marketplace');
+db.createUser({
+  user: 'marketplace_app',
+  pwd: process.env.MARKETPLACE_APP_PWD,
+  roles: [{ role: 'readWrite', db: 'marketplace' }],
+});
+```
+
+> [!warning] These scripts do **not** run when you restart an existing container. To force re-init, drop the `mongodb_data` volume — which destroys all data. Plan accordingly.
+
+### 1.3 Indexes
+
+Indexes are declared in Mongoose schemas under `backend/src/models/`. The app calls `Model.createIndexes()` on connection (via the model's `syncIndexes`/`ensureIndexes` lifecycle). Highlights:
+
+| Collection | Key indexes |
+|------------|-------------|
+| `users` | `email` (unique), `googleId` (sparse), `role`, `createdAt` |
+| `addresses` | `userId` + compound for primary lookup |
+| `purchaserequests` | `buyerId`, `status`, `createdAt`, text index on `title`+`description` |
+| `selleroffers` | `requestId`, `sellerId`, `status` |
+| `payments` | `providerPaymentId` (unique sparse), `userId`, `status`, `createdAt`, `transactionHash` |
+| `chats` | `participants` (array), `updatedAt` |
+| `notifications` | `userId` + `read`, `createdAt` |
+| `tempverifications` | TTL on `expiresAt` (auto-deletes expired OTPs) |
+
+To verify a specific collection:
+
+```js
+db.payments.getIndexes()
+```
+
+To add a new index without code-gen — preferred path is to declare it in the Mongoose schema and ship a deploy. For emergency hotfixes:
+
+```js
+db.payments.createIndex({ providerPaymentId: 1 }, { unique: true, sparse: true });
+```
+
+### 1.4 TTL indexes
+
+Currently used on `tempverifications.expiresAt` (5-minute auto-purge of email OTPs / passkey challenges). Mongo's TTL monitor runs every 60 seconds — purge isn't immediate.
+
+If you add more TTL indexes:
+
+```js
+db.notifications.createIndex({ createdAt: 1 }, { expireAfterSeconds: 60 * 60 * 24 * 90 });  // 90 days
+```
+
+### 1.5 Backup with `mongodump`
+
+```bash
+# Connect into the container, dump locally, copy out
+docker exec nickapp-mongodb sh -c \
+  "mongodump --db=marketplace --archive=/tmp/marketplace-$(date +%F).archive --gzip"
+docker cp nickapp-mongodb:/tmp/marketplace-$(date +%F).archive ./backups/
+
+# Or stream directly to host
+docker exec nickapp-mongodb \
+  mongodump --db=marketplace --archive --gzip \
+  > ./backups/marketplace-$(date +%F).gz
+```
+
+For full details (retention, RTO/RPO, offsite copies) see [[Backup & Recovery]].
+
+### 1.6 Restore
+
+```bash
+# Restore an archive to an empty database
+docker exec -i nickapp-mongodb \
+  mongorestore --archive --gzip --drop \
+  < ./backups/marketplace-2026-05-20.gz
+```
+
+`--drop` drops each collection before restoring. Omit it to merge.
+
+> [!warning] Restoring is **destructive** to current data. Always practise on a staging clone first.
+
+### 1.7 Migrations
+
+There is no formal migration framework. Two patterns are used:
+
+- **Mongoose schema changes** are forward-compatible (new optional fields default to `undefined`). Older documents will still load.
+- **Data backfills** are one-shot scripts in `backend/src/scripts/` (e.g. `migrateUserPoints.ts`, `fix-transaction-hashes.js`, `fix-dispute-sellers.js`).
+
+Pattern for a new migration:
+
+1. Add a `src/seeds/migrate<Thing>.ts` script that is idempotent (use `$exists: false` guards).
+2. Run on staging, confirm.
+3. Take a backup ([[Backup & Recovery]]).
+4. Run in production: `docker exec -it nickapp-backend node dist/seeds/migrate<Thing>.js`.
+5. Commit the script (it serves as a record of what changed).
+
+### 1.8 Common admin queries
+
+```js
+// Count by collection
+db.users.countDocuments({ role: 'buyer' })
+
+// Disk usage per collection
+db.runCommand({ collStats: 'payments', scale: 1024*1024 }).size
+
+// Slow queries
+db.setProfilingLevel(1, { slowms: 200 })   // log queries > 200ms
+db.system.profile.find().sort({ ts: -1 }).limit(10)
+
+// Lock contention
+db.serverStatus().locks
+```
+
+### 1.9 Seeding production safely
+
+Seed scripts are designed to be idempotent for **categories** but **destructive** for users/addresses. Don't run `seed:all` in production.
+
+Safe in production:
+
+```bash
+docker exec -it nickapp-backend node dist/seeds/seedCategories.js
+docker exec -it nickapp-backend node dist/seeds/seedLevels.js
+```
+
+Optional auto-seed on startup: set `AUTO_SEED_ON_START=true` in `.env`. The bootstrap code only seeds when no non-admin users exist — safe to leave on.
+
+> [!warning] **Never** run `seed:all` or `seed:users` against production. They drop the existing `users` and `addresses` collections.
+
+---
+
+## 2. Redis
+
+### 2.1 Connection
+
+Dev: `redis://redis:6379` (no password).
+Prod: `redis://:<REDIS_PASSWORD>@redis:6379`. The compose command line is `redis-server --requirepass "$REDIS_PASSWORD"`.
+
+Inspect:
+
+```bash
+docker exec -it nickapp-redis redis-cli -a "$REDIS_PASSWORD"
+> INFO server
+> DBSIZE
+> KEYS *           # prod-unsafe on large datasets, use SCAN
+```
+
+### 2.2 What we store
+
+- **Rate-limit counters** for `express-rate-limit`
+- **Session data** for refresh-token tracking and revocation lists
+- **Socket.IO adapter state** (when scaled horizontally — currently single-node)
+- **Application caches** (TTL'd keys for expensive aggregates)
+- **Idempotency keys** for webhook deduplication
+
+Key prefixes follow `<service>:<entity>:<id>`. E.g. `payment:idem:<requestId>`, `auth:refresh:<userId>`.
+
+### 2.3 Persistence
+
+Redis 8 defaults to **RDB snapshots** + optional **AOF**. Our compose uses the default config:
+
+- RDB snapshot triggers: `save 3600 1`, `save 300 100`, `save 60 10000`.
+- AOF is **disabled** by default.
+- RDB file lives at `/data/dump.rdb` inside the `redis_data` volume.
+
+**To enable AOF** for stronger durability, override the command in `docker-compose.production.yml`:
+
+```yaml
+redis:
+  command: ["sh","-lc","redis-server --requirepass \"$${REDIS_PASSWORD}\" --appendonly yes --appendfsync everysec"]
+```
+
+`appendfsync everysec` is the common compromise: at most 1 second of writes lost on crash, with negligible perf impact.
+
+### 2.4 Eviction policy
+
+Default is `noeviction` — Redis refuses writes when memory is full. For our use (caches that can be regenerated), set:
+
+```bash
+docker exec nickapp-redis redis-cli -a "$REDIS_PASSWORD" \
+  CONFIG SET maxmemory 256mb
+docker exec nickapp-redis redis-cli -a "$REDIS_PASSWORD" \
+  CONFIG SET maxmemory-policy allkeys-lru
+```
+
+Persist by adding to a custom `redis.conf` mounted at `/usr/local/etc/redis/redis.conf` (then change the compose `command:` to `["redis-server","/usr/local/etc/redis/redis.conf","--requirepass",...]`).
+
+### 2.5 Backup
+
+Redis backups are usually unnecessary (the data is regeneratable) but still cheap:
+
+```bash
+# Snapshot now
+docker exec nickapp-redis redis-cli -a "$REDIS_PASSWORD" BGSAVE
+docker cp nickapp-redis:/data/dump.rdb ./backups/redis-$(date +%F).rdb
+```
+
+`BGSAVE` is non-blocking (forks). For AOF, copy `/data/appendonly.aof` too.
+
+### 2.6 Cache flush
+
+When deploying breaking changes to cached schemas:
+
+```bash
+# Flush everything (DEV ONLY)
+docker exec nickapp-redis redis-cli -a "$REDIS_PASSWORD" FLUSHALL
+
+# Targeted (safer)
+docker exec nickapp-redis redis-cli -a "$REDIS_PASSWORD" \
+  --scan --pattern 'payment:idem:*' | \
+  xargs -L 1 docker exec nickapp-redis redis-cli -a "$REDIS_PASSWORD" DEL
+```
+
+> [!warning] `FLUSHALL` will sign out every user with an active refresh token and reset every rate-limit counter. Avoid in production unless that is what you want.
+
+### 2.7 Monitoring
+
+```bash
+docker exec nickapp-redis redis-cli -a "$REDIS_PASSWORD" INFO stats
+docker exec nickapp-redis redis-cli -a "$REDIS_PASSWORD" INFO memory
+docker exec nickapp-redis redis-cli -a "$REDIS_PASSWORD" SLOWLOG GET 10
+```
+
+Watch `evicted_keys`, `keyspace_misses`, `rejected_connections` — see [[Monitoring]] for thresholds.
+
+---
+
+## 3. Maintenance windows
+
+For both DBs, schedule a window when:
+
+- Bumping major version (Mongo 8 → 9, Redis 8 → 9)
+- Restoring from backup
+- Running a destructive migration
+
+Suggested checklist:
+
+1. Announce in #ops Slack / status page.
+2. Trigger `mongodump` (see [[Backup & Recovery]]).
+3. Stop the backend container so writes stop: `docker compose stop nickapp-backend`.
+4. Perform the operation.
+5. Restart backend: `docker compose start nickapp-backend`.
+6. Verify health: `curl https://amn.gg/api/health`.
+7. Close window.
+
+---
+
+## 4. Cross-links
+
+- [[Backup & Recovery]] — formal backup/restore procedures, RTO/RPO targets, offsite storage.
+- [[Monitoring]] — what metrics to watch (slow queries, evictions, replication lag).
+- [[Incident Response]] — runbooks for "MongoDB unreachable" and "Redis unreachable".
+- [[Data Models]] — schema details for every collection.
--- a/Operations/Deployment.md
+++ b/Operations/Deployment.md
@@ -0,0 +1,255 @@
+---
+title: Deployment
+tags: [operations]
+---
+
+# Deployment
+
+How the production stack runs and gets updated on the live host. The stack is fully containerised and self-updates via Watchtower from the Gitea container registry.
+
+---
+
+## 1. Topology
+
+```
+                                  ┌─────────────────────────┐
+            HTTPS  443 ──────────►│   External SSL term.    │
+                                  │ (DNS amn.gg, dev.amn.gg)│
+                                  └────────────┬────────────┘
+                                               │ HTTP 80 (in-VPC)
+                                               ▼
+                            ┌──────────────────────────────────┐
+                            │           Nginx container        │
+                            │       (nickapp-nginx, port 80)   │
+                            └─┬───────────────────┬────────────┘
+                              │                   │
+                              │ /                 │ /api  /socket.io
+                              ▼                   ▼
+                ┌─────────────────────┐  ┌──────────────────────────┐
+                │ nickapp-frontend    │  │ nickapp-backend          │
+                │ Next.js, port 8083  │  │ Express 5, port 5001     │
+                └─────────────────────┘  └──────┬────────────┬──────┘
+                                                │            │
+                                                ▼            ▼
+                                         ┌──────────┐  ┌──────────┐
+                                         │ mongodb  │  │  redis   │
+                                         │  8.2     │  │  8       │
+                                         └──────────┘  └──────────┘
+
+                            ┌──────────────────────────────────┐
+                            │            Watchtower            │
+                            │ Polls registry → restarts        │
+                            │ containers labelled enable=true  │
+                            └──────────────────────────────────┘
+```
+
+All containers run on the **`default`** Docker network defined by `docker-compose.production.yml`. Watchtower runs as a sidecar container on the same host.
+
+DNS resolves both `amn.gg` and `dev.amn.gg` to the production host's public IP. SSL termination happens **outside** the compose stack (typically via the hosting provider's edge or a host-level reverse proxy), and traffic is forwarded as HTTP to the `nginx` container on port `80` (mapped to host `8083`).
+
+---
+
+## 2. Compose file
+
+`backend/docker-compose.production.yml` is the single source of truth. Services:
+
+| Service | Image | Ports | Volumes | Notes |
+|---------|-------|-------|---------|-------|
+| `nginx` | `nginx:alpine` | `8083:80` | `./nginx/nginx.conf`, `./nginx/logs`, `./uploads` (served as `/uploads`) | Reverse proxy |
+| `nickapp-backend` | `nickapp-backend:latest` (build from `Dockerfile.prod`) | not exposed externally | `./uploads:/app/uploads` | Labelled for Watchtower |
+| `nickapp-frontend` | `nickapp-frontend:latest` (build from `../frontend/Dockerfile`) | `expose: 8083` | — | Labelled for Watchtower |
+| `mongodb` | `mongo:8.2` | not exposed | `mongodb_data:/data/db`, `./mongo-init:/docker-entrypoint-initdb.d` | Healthcheck via `mongosh ping` |
+| `redis` | `redis:8-alpine` | not exposed | `redis_data:/data` | Started with `--requirepass "$REDIS_PASSWORD"` |
+
+Healthchecks are configured for backend (`curl /health`), frontend (`curl /`), Mongo (`mongosh ping`), and Redis (`redis-cli -a $REDIS_PASSWORD ping`). See [[Monitoring]].
+
+Watchtower polls images labelled `com.centurylinklabs.watchtower.enable=true` — currently `nickapp-backend` and `nickapp-frontend`. MongoDB and Redis are **not** auto-updated.
+
+---
+
+## 3. Registry & images
+
+| Image | Registry path |
+|-------|---------------|
+| Backend prod | `git.manko.yoga/manawenuz/escrow-backend:latest` |
+| Backend dev | `git.manko.yoga/manawenuz/escrow-backend:dev` |
+| Backend tagged | `git.manko.yoga/manawenuz/escrow-backend:<package-version>` |
+| Frontend | `git.manko.yoga/manawenuz/escrow-frontend:latest` and `:<version>` |
+
+`docker-compose.production.yml` currently builds locally on first up (`build: context: .`). Once images are in the registry the file can be switched to `image: git.manko.yoga/manawenuz/escrow-backend:latest` to let Watchtower pull straight from there.
+
+> [!tip] To pin a specific version while debugging, edit the compose file to `image: git.manko.yoga/manawenuz/escrow-backend:2.6.3` and re-run `docker compose up -d`. Remove the Watchtower label or the agent will undo it on next poll.
+
+---
+
+## 4. Watchtower
+
+Watchtower runs as its own container (managed outside the compose file) with `WATCHTOWER_LABEL_ENABLE=true` so it only touches services that opt in. On each poll cycle (default 5 minutes, configurable via `WATCHTOWER_POLL_INTERVAL`) it:
+
+1. Pulls the latest digest for each enabled service's image.
+2. Compares to the running container's digest.
+3. If different, stops the container, removes it, and starts a new one from the new image, preserving all named volumes.
+
+Configuration knobs typically set on the host:
+
+```bash
+docker run -d --name watchtower \
+  --restart unless-stopped \
+  -v /var/run/docker.sock:/var/run/docker.sock \
+  -v /root/.docker/config.json:/config.json \   # so it can pull from the private Gitea registry
+  -e WATCHTOWER_POLL_INTERVAL=300 \
+  -e WATCHTOWER_LABEL_ENABLE=true \
+  -e WATCHTOWER_CLEANUP=true \
+  -e WATCHTOWER_INCLUDE_RESTARTING=true \
+  containrrr/watchtower
+```
+
+The `~/.docker/config.json` must have a valid login for `git.manko.yoga` (created via `docker login git.manko.yoga -u manawenuz`).
+
+---
+
+## 5. First-time deploy (cold start)
+
+> [!warning] Run these steps on a fresh production host. They are destructive on an existing one. See [[Backup & Recovery]] before touching live data.
+
+### Prerequisites on the host
+
+- Ubuntu 22.04+ (or any systemd Linux), Docker Engine 24+, `docker compose` plugin
+- `git` installed
+- DNS `amn.gg` + `dev.amn.gg` already pointing here
+- An SSL terminator (Caddy / Nginx / Cloudflare) reverse-proxying to host port `8083`
+- Registry login: `docker login git.manko.yoga -u manawenuz`
+
+### Steps
+
+```bash
+# 1. Clone both repos as siblings (compose references ../frontend)
+cd /opt
+git clone ssh://git@git.manko.yoga:222/nick/backend.git
+git clone ssh://git@git.manko.yoga:222/nick/frontend.git
+cd backend
+git checkout main
+
+# 2. Create the production .env
+sudo nano .env       # fill from Environment Variables doc; production values, real secrets
+
+# 3. Provision the nginx config + uploads dir
+mkdir -p nginx/logs uploads mongo-init
+sudo cp /path/to/nginx.conf nginx/nginx.conf
+# (the nginx.conf forwards /api/* and /socket.io/* to nickapp-backend:5001,
+#  forwards /uploads/* to /uploads (volume), and everything else to nickapp-frontend:8083)
+
+# 4. Build & start the stack
+docker compose -f docker-compose.production.yml up --build -d
+
+# 5. Verify
+docker compose -f docker-compose.production.yml ps
+docker compose -f docker-compose.production.yml logs -f --tail=200
+curl -fsS http://localhost:8083/api/health | jq .
+
+# 6. Seed initial data (optional — if AUTO_SEED_ON_START=true is set, it's already done)
+docker compose -f docker-compose.production.yml exec nickapp-backend node dist/scripts/seedCategories.js
+
+# 7. Start Watchtower (one-time)
+docker run -d --name watchtower --restart unless-stopped \
+  -v /var/run/docker.sock:/var/run/docker.sock \
+  -v /root/.docker/config.json:/config.json \
+  -e WATCHTOWER_POLL_INTERVAL=300 \
+  -e WATCHTOWER_LABEL_ENABLE=true \
+  -e WATCHTOWER_CLEANUP=true \
+  containrrr/watchtower
+```
+
+### SSL / TLS
+
+Termination happens at the edge — outside the compose stack. The two common setups:
+
+- **Caddy on the host** forwarding `amn.gg` and `dev.amn.gg` to `127.0.0.1:8083`. Caddy handles Let's Encrypt automatically.
+- **Cloudflare Full (strict)** in front of the host. Use Cloudflare Origin certificates on the host's Caddy/Nginx.
+
+Either way, the compose stack itself sees only HTTP on port 80 inside the nginx container. The `nginx.conf` should set `proxy_set_header X-Forwarded-Proto $http_x_forwarded_proto` and the backend already trusts the proxy when `NODE_ENV=production` (see `trust proxy` block in `src/app.ts`).
+
+---
+
+## 6. Routine deploy (after first deploy)
+
+The normal flow is **fully automatic**:
+
+1. Developer merges PR to `main` (see [[Git Workflow]]).
+2. Gitea Actions runs `.gitea/workflows/docker-build-no-cache.yml` (backend) or `deploy.yml` (frontend). The workflow builds the production image and pushes `:latest` + `:<version>` to the registry. See [[CI-CD Pipeline]].
+3. Watchtower polls the registry, sees a new digest, restarts the container.
+4. Healthcheck on the new container passes after `start_period=40s`, traffic resumes.
+
+Total time from merge to live: **5–10 minutes** depending on Watchtower poll interval and image size.
+
+### Force an immediate deploy
+
+If you don't want to wait for the poll:
+
+```bash
+# On the production host:
+cd /opt/backend
+docker login git.manko.yoga -u manawenuz       # if creds expired
+docker compose -f docker-compose.production.yml pull nickapp-backend nickapp-frontend
+docker compose -f docker-compose.production.yml up -d nickapp-backend nickapp-frontend
+```
+
+The `up -d` will detect changed images and restart only the affected containers.
+
+### Roll back
+
+```bash
+# Find available versions
+docker images git.manko.yoga/manawenuz/escrow-backend
+
+# Pin to the previous tag in the compose file
+sed -i 's|escrow-backend:latest|escrow-backend:2.6.2|' docker-compose.production.yml
+
+# Re-up
+docker compose -f docker-compose.production.yml up -d nickapp-backend
+
+# Disable Watchtower for the affected service until you're ready to resume
+docker compose ... restart   # no-op if you removed the watchtower label
+```
+
+> [!warning] Watchtower will undo a pin to a non-`latest` tag on its next poll if the container still has the `watchtower.enable=true` label. Either remove the label temporarily or pause Watchtower (`docker stop watchtower`).
+
+---
+
+## 7. Logs
+
+```bash
+# All services
+docker compose -f docker-compose.production.yml logs -f --tail=300
+
+# Single service
+docker compose -f docker-compose.production.yml logs -f nickapp-backend
+
+# Nginx access log
+tail -f /opt/backend/nginx/logs/access.log
+```
+
+Backend logs are also captured by Sentry breadcrumbs when an error occurs — see [[Monitoring]].
+
+---
+
+## 8. Maintenance window
+
+Plan a 5-minute window when bumping major versions or running migrations:
+
+```bash
+# Announce + drain
+# (set a maintenance banner in the frontend if possible)
+
+# Take a backup first
+./scripts/backup-mongo.sh    # or per Backup & Recovery
+
+# Pull new images, restart
+docker compose -f docker-compose.production.yml pull
+docker compose -f docker-compose.production.yml up -d
+
+# Verify
+curl -fsS https://amn.gg/api/health
+```
+
+If anything goes sideways, follow [[Incident Response]].
--- a/Operations/Docker
+++ b/Operations/Docker
@@ -0,0 +1,381 @@
+---
+title: Docker Setup
+tags: [operations]
+---
+
+# Docker Setup
+
+Walk-through of every Dockerfile, compose file, volume, and network used by the marketplace stack. Cross-references [[Deployment]] for the live-host configuration and [[Local Setup]] for developer use.
+
+---
+
+## 1. Backend — `Dockerfile.dev`
+
+Path: `/Users/mojtabaheidari/code/backend/Dockerfile.dev`
+
+```dockerfile
+FROM node:22-alpine
+RUN corepack enable
+WORKDIR /app
+COPY package.json ./
+RUN yarn install --frozen-lockfile
+COPY . .
+RUN mkdir -p uploads/{avatars,documents,products,temp}
+RUN addgroup -g 1001 -S nodejs && adduser -S marketplace -u 1001
+RUN chown -R marketplace:nodejs /app
+USER marketplace
+EXPOSE 5001
+HEALTHCHECK --interval=30s --timeout=3s --start-period=5s --retries=3 \
+  CMD node healthcheck.js
+CMD ["yarn", "dev"]
+```
+
+Notes:
+
+- **Base.** `node:22-alpine` — small, glibc-musl. Corepack is enabled to use the pinned Yarn 1.22.22.
+- **Install.** `yarn install --frozen-lockfile` brings dev dependencies (needed for `ts-node` + `nodemon` hot reload).
+- **Uploads scaffold.** Creates the four canonical upload directories so the API doesn't have to `mkdir` at runtime.
+- **Non-root user.** Process runs as `marketplace` (uid `1001`). Defence-in-depth.
+- **Healthcheck.** `healthcheck.js` does a local HTTP GET to `/health` (see [[Monitoring]]).
+- **CMD.** `yarn dev` → `nodemon --exec ts-node src/app.ts`. Source code is mounted from the host so saves trigger restarts.
+
+Used by `docker-compose.dev.yml`. Not pushed to the registry — dev images are local.
+
+---
+
+## 2. Backend — `Dockerfile.prod`
+
+Path: `/Users/mojtabaheidari/code/backend/Dockerfile.prod`
+
+Multi-stage build to keep the runtime image small and free of build tooling.
+
+```dockerfile
+# ---- builder ----
+FROM node:22-alpine AS builder
+RUN corepack enable
+WORKDIR /app
+COPY package.json ./
+COPY healthcheck.js ./
+RUN yarn install --frozen-lockfile
+COPY . .
+RUN yarn build              # tsc → ./dist
+
+# ---- production ----
+FROM node:22-alpine AS production
+RUN corepack enable
+WORKDIR /app
+COPY package.json ./
+COPY healthcheck.js ./
+RUN yarn install --frozen-lockfile --production && yarn cache clean
+COPY --from=builder /app/dist ./dist
+RUN mkdir -p uploads/{avatars,documents,products,temp}
+RUN addgroup -g 1001 -S nodejs && adduser -S marketplace -u 1001
+RUN chown -R marketplace:nodejs /app
+USER marketplace
+EXPOSE 5001
+HEALTHCHECK --interval=30s --timeout=3s --start-period=5s --retries=3 \
+  CMD node healthcheck.js
+CMD ["node", "dist/app.js"]
+```
+
+Notes:
+
+- **Two stages.** `builder` compiles TS to JS; `production` keeps only the compiled output + production deps. Final image is ~150 MB.
+- **No dev deps.** `--production` flag in the second stage trims away TypeScript, Jest, ts-node etc.
+- **Same non-root pattern.** `marketplace:nodejs` (uid 1001).
+- **CMD.** Plain `node dist/app.js` — no transpilation at runtime.
+- **Uploads.** The directory is created inside the image, then the running container mounts `/app/uploads` from a host volume in compose (overrides the embedded dir).
+
+Built and pushed by `.gitea/workflows/docker-build-no-cache.yml` (and friends — see [[CI-CD Pipeline]]). The resulting image is `git.manko.yoga/manawenuz/escrow-backend:<version>` + `:latest`.
+
+---
+
+## 3. Frontend — `Dockerfile` (production)
+
+Path: `/Users/mojtabaheidari/code/frontend/Dockerfile`
+
+Multi-stage Next.js **standalone** build.
+
+```dockerfile
+# ---- builder ----
+FROM node:22-alpine AS builder
+# (NEXT_PUBLIC_* vars set here so they bake into the bundle)
+ENV NEXT_PUBLIC_API_URL=https://dev.amn.gg/api
+ENV NEXT_PUBLIC_BACKEND_URL=https://dev.amn.gg
+# ...more ENV lines (see file)...
+
+RUN apk add --no-cache git python3 make g++ py3-pip
+RUN corepack enable
+WORKDIR /app
+COPY package.json yarn.lock* ./
+RUN yarn install --frozen-lockfile --production=false --network-timeout 600000
+COPY src ./src
+COPY public ./public
+COPY next.config.ts tsconfig.json ./
+COPY *.config.mjs ./
+ENV NODE_ENV=production
+ENV NEXT_TELEMETRY_DISABLED=1
+RUN yarn build              # produces .next/standalone + .next/static
+
+# ---- runner ----
+FROM node:22-alpine AS runner
+RUN apk add --no-cache curl
+RUN addgroup --system --gid 1001 nodejs && adduser --system --uid 1001 nextjs
+WORKDIR /app
+COPY --from=builder --chown=nextjs:nodejs /app/.next/standalone ./
+COPY --from=builder --chown=nextjs:nodejs /app/public ./public
+ENV PORT=8083 HOSTNAME="0.0.0.0" NODE_ENV=production
+ENV NEXT_PUBLIC_SENTRY_DSN=https://...sentry.io/...
+USER nextjs
+EXPOSE 8083
+HEALTHCHECK --interval=30s --timeout=3s --start-period=5s --retries=3 \
+  CMD curl -f http://localhost:8083 || exit 1
+CMD ["node", "server.js"]
+```
+
+Notes:
+
+- **Baked env vars.** `NEXT_PUBLIC_*` variables are set as `ENV` in the builder stage so Next inlines them into the static bundle at build time. To deploy to a different domain you must rebuild — there is no runtime override for `NEXT_PUBLIC_*`. See [[Environment Variables#how-env-is-loaded]].
+- **System packages.** `git python3 make g++ py3-pip` are needed by `node-gyp` for native modules (e.g. `sharp`, `@google-cloud/local-auth`).
+- **Standalone output.** `next.config.ts` sets `output: 'standalone'`, so the runner stage copies only `.next/standalone/` and `public/` — a self-contained tree with a built-in `server.js`. Final runtime image: ~250 MB.
+- **Non-root.** `nextjs` (uid 1001).
+- **`server.js`** is generated by Next.js — it embeds the necessary Node modules and starts the production server.
+
+---
+
+## 4. Frontend — `Dockerfile.dev`
+
+Path: `/Users/mojtabaheidari/code/frontend/Dockerfile.dev`
+
+```dockerfile
+FROM node:22-alpine
+RUN apk add --no-cache git python3 make g++ py3-pip
+RUN corepack enable
+WORKDIR /app
+COPY package.json yarn.lock* ./
+RUN yarn config set network-timeout 600000 && \
+    yarn config set network-concurrency 1 && \
+    yarn install --frozen-lockfile --network-timeout 600000
+COPY . .
+EXPOSE 3000
+CMD ["yarn", "dev:docker"]
+```
+
+Notes:
+
+- Listens on **port 3000** in dev (matches the legacy convention).
+- `yarn dev:docker` is a variant of `dev` that binds 0.0.0.0 so the container is reachable from the host.
+- No multi-stage — speed > size.
+
+Used for local development if you choose to run the frontend in Docker instead of via `yarn dev`. Most developers run frontend natively for HMR speed; backend in Docker for parity.
+
+---
+
+## 5. `docker-compose.dev.yml`
+
+Path: `/Users/mojtabaheidari/code/backend/docker-compose.dev.yml`
+
+```yaml
+name: nickapp-development
+
+services:
+  nickdev-backend:
+    build: { context: ., dockerfile: Dockerfile.dev }
+    container_name: nickdev-backend
+    env_file: [.env.local]
+    ports: ["5001:5001"]
+    volumes:
+      - ./src:/app/src
+      - ./uploads:/app/uploads
+    depends_on: [mongodb, redis]
+    restart: unless-stopped
+    networks: [nickapp-network]
+
+  mongodb:
+    image: mongo:8.2
+    container_name: nickdev-mongodb
+    ports: ["27017:27017"]
+    env_file: [.env.local]
+    volumes: [mongodb_data:/data/db]
+    restart: unless-stopped
+    networks: [nickapp-network]
+
+  redis:
+    image: redis:8-alpine
+    container_name: nickdev-redis
+    env_file: [.env.local]
+    command: redis-server
+    volumes: [redis_data:/data]
+    restart: unless-stopped
+    networks: [nickapp-network]
+
+networks:
+  nickapp-network: { driver: bridge }
+
+volumes:
+  mongodb_data:
+  redis_data:
+```
+
+Highlights:
+
+- **No auth on Mongo/Redis in dev.** Mongo runs default; Redis runs plain `redis-server`.
+- **Source mounted.** `./src` is volume-mounted into the backend container so hot reload works.
+- **Uploads mounted.** `./uploads` on the host is bind-mounted to `/app/uploads` so files survive container restarts.
+- **Port mappings:** `5001` (backend) + `27017` (Mongo) exposed to host. Redis is **not** exposed by default.
+- **Network.** `nickapp-network` bridge — Mongo/Redis are reachable as `mongodb` / `redis` from the backend container.
+
+---
+
+## 6. `docker-compose.production.yml`
+
+Path: `/Users/mojtabaheidari/code/backend/docker-compose.production.yml`
+
+Five services. Reproducing only the most important bits — full file lives in the repo and is summarised in [[Deployment#compose-file]].
+
+```yaml
+name: nickapp-production
+
+services:
+  nginx:
+    image: nginx:alpine
+    container_name: nickapp-nginx
+    ports: ["8083:80"]
+    volumes:
+      - ./nginx/nginx.conf:/etc/nginx/nginx.conf:ro
+      - ./nginx/logs:/var/log/nginx
+      - ./uploads:/uploads
+    depends_on: [nickapp-backend, nickapp-frontend]
+    networks: [default]
+
+  nickapp-backend:
+    build: { context: ., dockerfile: Dockerfile.prod }
+    image: nickapp-backend:latest
+    container_name: nickapp-backend
+    platform: linux/amd64
+    env_file: [.env]
+    volumes: [./uploads:/app/uploads]
+    depends_on: [mongodb, redis]
+    networks: [default]
+    healthcheck: { test: ["CMD","curl","-f","http://localhost:5001/health"], ... }
+    labels: ["com.centurylinklabs.watchtower.enable=true"]
+
+  mongodb:
+    image: mongo:8.2
+    container_name: nickapp-mongodb
+    env_file: [.env]
+    volumes:
+      - mongodb_data:/data/db
+      - ./mongo-init:/docker-entrypoint-initdb.d
+    healthcheck: { test: ["CMD","mongosh","--eval","db.adminCommand('ping')"], ... }
+
+  redis:
+    image: redis:8-alpine
+    container_name: nickapp-redis
+    env_file: [.env]
+    command: ["sh","-lc","redis-server --requirepass \"$${REDIS_PASSWORD}\""]
+    volumes: [redis_data:/data]
+    healthcheck: { test: ["CMD","redis-cli","-a","$${REDIS_PASSWORD}","ping"], ... }
+
+  nickapp-frontend:
+    build: { context: ../frontend, dockerfile: Dockerfile }
+    image: nickapp-frontend:latest
+    container_name: nickapp-frontend
+    platform: linux/amd64
+    env_file: [.env]
+    environment: [PORT=8083, NODE_ENV=production]
+    expose: ["8083"]
+    healthcheck: { test: ["CMD","curl","-f","http://localhost:8083/"], ... }
+    labels: ["com.centurylinklabs.watchtower.enable=true"]
+
+networks:
+  default: { driver: bridge }
+
+volumes:
+  mongodb_data:
+  redis_data:
+```
+
+Key differences from dev:
+
+- **Nginx** added as the public entry point.
+- **Backend and frontend** are labelled for **Watchtower** auto-updates.
+- **Mongo and Redis** are **not** Watchtower-managed — their major versions need manual planning + backup ([[Backup & Recovery]]).
+- **Redis password** is read from `.env` (escaped `$$` so docker compose doesn't expand it).
+- **Frontend build context** points at `../frontend` — the two repos must live as siblings on disk.
+- **No host port mapping** for backend/frontend — they are reached only via the nginx container.
+- **platform: linux/amd64** is pinned because production hosts are x86_64; ARM developers must `--platform=linux/amd64` if they build locally for prod.
+
+---
+
+## 7. Volumes
+
+| Volume | Mount point | Lifecycle | Notes |
+|--------|-------------|-----------|-------|
+| `mongodb_data` (named) | `/data/db` in `mongodb` | Persistent | The whole database. Back up via `mongodump`. |
+| `redis_data` (named) | `/data` in `redis` | Persistent | RDB snapshots + AOF if configured. |
+| `./uploads` (bind) | `/app/uploads` in backend, `/uploads` in nginx | Persistent on host | User-uploaded files. Critical — back up the directory. |
+| `./nginx/nginx.conf` (bind, RO) | `/etc/nginx/nginx.conf` | Static | Reverse-proxy config. |
+| `./nginx/logs` (bind) | `/var/log/nginx` | Append-only on host | Access + error logs. |
+| `./mongo-init` (bind, RO) | `/docker-entrypoint-initdb.d` | One-time | JS files Mongo runs **only on a fresh datadir** to create initial users / indexes. |
+
+Inspect named volumes:
+
+```bash
+docker volume ls
+docker volume inspect nickapp-production_mongodb_data
+```
+
+> [!warning] `docker compose down -v` deletes named volumes. Never run this in production unless you've backed up first.
+
+---
+
+## 8. Networks
+
+- **Dev:** `nickapp-network` bridge. All three services join it; the backend reaches `mongodb` and `redis` by container name.
+- **Prod:** the default compose network (also a bridge), named `nickapp-production_default`. Same DNS-by-container-name semantics. Nginx talks to `nickapp-backend:5001` and `nickapp-frontend:8083` over this network.
+
+Inspect:
+
+```bash
+docker network ls
+docker network inspect nickapp-production_default
+```
+
+---
+
+## 9. Image build & push from a developer machine
+
+For a production-parity build locally (without going through CI):
+
+```bash
+cd ~/code/backend
+docker build --platform=linux/amd64 -f Dockerfile.prod \
+  -t git.manko.yoga/manawenuz/escrow-backend:test .
+
+# Sanity-check size + run
+docker images git.manko.yoga/manawenuz/escrow-backend
+docker run --rm -p 5001:5001 --env-file .env.local \
+  git.manko.yoga/manawenuz/escrow-backend:test
+```
+
+For the official path (build + push to registry) use `./scripts/build-and-push.sh` — see [[Scripts#build-and-push-sh]] — or rely on [[CI-CD Pipeline]] to do it on every push.
+
+---
+
+## 10. Image cleanup
+
+Builds accumulate. Periodically prune:
+
+```bash
+docker system prune -a -f
+docker volume prune -f          # ⚠ removes unused named volumes — check first
+docker builder prune -a -f      # buildx cache
+
+# scripted (backend)
+npm run docker:clean
+```
+
+`docker:clean` runs `docker system prune -a -f && docker volume prune -f` — confirm you don't need anything before you run it.
+
+> [!warning] `docker volume prune` will delete `mongodb_data` and `redis_data` if their compose project is currently `down`. Always run `docker compose up -d` first to keep the volumes "in use".
--- a/Operations/Incident
+++ b/Operations/Incident
@@ -0,0 +1,393 @@
+---
+title: Incident Response
+tags: [operations]
+---
+
+# Incident Response
+
+Runbooks for the most likely production incidents, plus communication templates and a post-mortem template. Use this page during an active incident — keep [[Monitoring]], [[Database Operations]], and [[Backup & Recovery]] open in adjacent tabs.
+
+---
+
+## 1. Severity matrix
+
+| Sev | Meaning | Response time | Examples |
+|-----|---------|---------------|----------|
+| **Sev 1** | Site fully down or unable to process payments | 15 min | Backend container in crashloop; Mongo unreachable; SHKeeper API permanently failing |
+| **Sev 2** | Major feature broken for a large share of users | 1 hour | Email sending broken; Redis disk full; chat undelivered |
+| **Sev 3** | Minor / cosmetic issue, isolated user reports | next business day | Single failed webhook; one user can't upload PDF |
+| **Sev 4** | No user impact, hygiene item | backlog | Backup older than 24h; disk > 80%; missed deploy |
+
+Escalate one sev higher if more than 10 reports inside 5 minutes.
+
+---
+
+## 2. First 5 minutes — always do this
+
+1. **Acknowledge.** Reply in the on-call channel that you are taking it.
+2. **Open Sentry.** Filter to the last 15 minutes for new issue spikes.
+3. **Open the host shell.** `ssh prod` ready.
+4. **Health endpoint.** `curl -fsS https://amn.gg/api/health` → does it respond?
+5. **Container status.** `docker ps --format "table {{.Names}}\t{{.Status}}"`.
+6. **Recent deploy?** Was the `:latest` tag bumped in the last 30 min? If yes, **roll back first** (see [[Deployment#roll-back]]) and investigate after stability is restored.
+
+If you can't form a hypothesis in 5 minutes, **roll back to the previous image tag** anyway. Stability before forensics.
+
+---
+
+## 3. Common incidents
+
+### 3.1 Backend down (crashloop, no response on /health)
+
+**Symptoms.** `https://amn.gg/api/health` times out or 5xx; `nickapp-backend` shows `Restarting` in `docker ps`.
+
+**Runbook.**
+
+```bash
+# 1. Inspect last lines
+docker logs --tail=200 nickapp-backend
+
+# 2. Common causes:
+#    - Missing env var (`process.env.X!` throws on first read)
+#    - MongoDB unreachable (see 3.2)
+#    - Port conflict
+#    - Out of memory (look for OOMKilled)
+docker inspect nickapp-backend | jq '.[0].State'
+
+# 3. If OOM: increase memory limit in compose, restart
+#    If missing env: add to /opt/backend/.env, then `docker compose up -d`
+
+# 4. If recent deploy: roll back
+sed -i 's|:latest|:<previous-version>|' docker-compose.production.yml
+docker compose up -d nickapp-backend
+# Pause Watchtower for nickapp-backend so it doesn't re-pull
+docker stop watchtower
+```
+
+**Communication.** Post in #incidents using the template in §4.
+
+---
+
+### 3.2 MongoDB unreachable
+
+**Symptoms.** Backend logs show `MongoNetworkError`, `MongooseServerSelectionError`, or `Could not connect to server`.
+
+**Runbook.**
+
+```bash
+# 1. Container alive?
+docker ps -a --filter "name=mongodb"
+
+# 2. If exited:
+docker logs --tail=200 nickapp-mongodb
+# Common: corrupt journal, disk full, OOM
+
+# 3. Disk check
+df -h /var/lib/docker
+
+# 4. If disk full:
+#    - prune old container logs: docker system prune
+#    - rotate logs if needed
+#    - extend volume
+
+# 5. Restart
+docker compose -f docker-compose.production.yml up -d mongodb
+
+# 6. Verify
+docker exec nickapp-mongodb mongosh --eval "db.adminCommand('ping')"
+
+# 7. If data is corrupt, restore from latest dump — see Backup & Recovery
+```
+
+> [!warning] If Mongo is corrupted and you must restore, **stop the backend container first** to prevent partial writes during restore. See [[Database Operations#restore]].
+
+---
+
+### 3.3 Redis unreachable
+
+**Symptoms.** Logs show `ECONNREFUSED redis:6379` or `NOAUTH Authentication required`. Rate limits stop working, refresh tokens can't be revoked, but most read flows still work.
+
+**Runbook.**
+
+```bash
+# 1. Container alive?
+docker ps -a --filter "name=redis"
+
+# 2. If down:
+docker logs --tail=200 nickapp-redis
+docker compose -f docker-compose.production.yml up -d redis
+
+# 3. Auth issue?
+docker exec nickapp-redis redis-cli -a "$REDIS_PASSWORD" PING
+# Should return PONG
+
+# 4. If `$REDIS_PASSWORD` mismatch between .env and command:
+nano /opt/backend/.env       # confirm REDIS_PASSWORD set
+docker compose up -d redis backend
+
+# 5. If memory full + noeviction policy → rejecting writes:
+docker exec nickapp-redis redis-cli -a "$REDIS_PASSWORD" CONFIG SET maxmemory-policy allkeys-lru
+```
+
+The app gracefully degrades when Redis is unreachable for short windows — don't panic, but fix within an hour.
+
+---
+
+### 3.4 SHKeeper API down (payments blocked)
+
+**Symptoms.** Backend logs show repeated `SHKeeper request failed: ECONNREFUSED` or non-2xx responses from `$SHKEEPER_API_URL`. Buyers see "Payment unavailable" in checkout. Sev 1 — money is involved.
+
+**Runbook.**
+
+```bash
+# 1. Confirm SHKeeper itself is reachable
+curl -fsS -H "X-Shkeeper-Api-Key: $SHKEEPER_API_KEY" \
+  "$SHKEEPER_API_URL/api/v1/healthcheck"
+
+# 2. If 5xx from SHKeeper → it's their side
+#    - Check their status page / contact provider
+#    - Toggle a banner in the frontend warning buyers
+#    - Consider switching SHKEEPER_FORCE_PAYOUT_DEMO=true so QA still works
+#      (do NOT do this for real customer money)
+
+# 3. If our network can't reach it:
+#    - test from the host: curl from the host vs from inside the container
+docker exec nickapp-backend curl -v "$SHKEEPER_API_URL"
+#    - DNS / firewall changes?
+
+# 4. While blocked, monitor stuck payments
+docker exec nickapp-mongodb mongosh --eval \
+  "use marketplace; db.payments.find({status:'pending', createdAt:{\$lt: new Date(Date.now() - 30*60*1000)}}).count()"
+
+# 5. Once SHKeeper is back, the app retries automatically. Verify the
+#    backlog drains. If a payment is stuck > 24h, manually verify against
+#    SHKeeper and use fix-transaction-hashes.js if needed.
+```
+
+**Always communicate.** Even short payment outages erode trust — post a status update.
+
+---
+
+### 3.5 Email delivery failure
+
+**Symptoms.** Logs show `SMTPError` from `nodemailer`. Password resets, welcome emails, dispute notifications fail. Sev 2.
+
+**Runbook.**
+
+```bash
+# 1. Test SMTP credentials from the container
+docker exec nickapp-backend node -e "
+  const nm = require('nodemailer');
+  nm.createTransport({
+    host: process.env.SMTP_HOST,
+    port: Number(process.env.SMTP_PORT),
+    secure: process.env.SMTP_SECURE === 'true',
+    auth: { user: process.env.SMTP_USER, pass: process.env.SMTP_PASS },
+  }).verify().then(console.log).catch(console.error);
+"
+
+# 2. If auth failed → password rotated by provider, update SMTP_PASS in .env
+# 3. If connection timed out → provider rate-limit; switch provider/sender
+# 4. If specific domains bounce → check SPF / DKIM / DMARC records for amn.gg
+```
+
+Users can still operate the app without email; queue critical emails for retry once SMTP is restored.
+
+---
+
+### 3.6 WebSocket disconnect storm
+
+**Symptoms.** Backend logs flood with `🔌 User connected/disconnected` cycling; clients spinning on chat / notification badges. Sev 2.
+
+**Runbook.**
+
+```bash
+# 1. Confirm symptoms
+docker logs --tail=500 nickapp-backend | grep -c "🔌"
+
+# 2. Check Nginx access log for socket.io polling spam
+tail -f /opt/backend/nginx/logs/access.log | grep socket.io
+
+# 3. Common causes:
+#    - Nginx not configured for WebSocket upgrade (returns 502 → client falls back to polling → reconnect loop)
+#    - Client clock skew breaking JWT validation on every reconnect
+#    - Redis adapter mis-configured (if scaled horizontally — not the case today)
+
+# 4. Quick mitigation: increase Nginx proxy_read_timeout
+#    Permanent: ensure nginx.conf has:
+#      proxy_http_version 1.1;
+#      proxy_set_header Upgrade $http_upgrade;
+#      proxy_set_header Connection "upgrade";
+
+# 5. Restart nginx + backend
+docker compose restart nginx nickapp-backend
+```
+
+---
+
+### 3.7 Suspicious activity / abuse
+
+**Symptoms.** Sentry alerts on unusual error volume from one IP; rate-limit logs spiking; reports of brute-force on `/api/auth/login`.
+
+**Runbook.**
+
+```bash
+# 1. Identify the offender
+tail -n 10000 /opt/backend/nginx/logs/access.log \
+  | awk '{print $1}' | sort | uniq -c | sort -rn | head
+
+# 2. Block at the edge (Cloudflare / host firewall)
+#    Or use `ufw deny from <ip>` on the host
+
+# 3. Confirm rate limits in app
+grep "RATE_LIMIT" /opt/backend/.env
+# Defaults: 100 req / 15 min per IP. Tighten if abuse continues.
+
+# 4. If the abuse targets a specific user account:
+docker exec -it nickapp-backend node -e "
+  // disable the user via mongoose
+  require('./dist/infrastructure/database/connection').connectDatabase()
+    .then(() => require('./dist/models').User.updateOne({email:'attacker@x.com'}, {$set:{disabled:true}}))
+    .then(console.log)
+"
+
+# 5. Preserve evidence: copy access logs to /var/incidents/<date>/
+```
+
+If user data may have leaked, treat as sev 1 and follow your data-breach disclosure process.
+
+---
+
+## 4. Communication templates
+
+### Initial incident notification
+
+```
+🚨 [SEV-X] <one-line summary>
+Started: <time UTC>
+Impact: <which users / features>
+Status: investigating
+On-call: <@you>
+Updates: every 15 minutes in this thread
+```
+
+### Mid-incident update
+
+```
+[SEV-X] Update <n>
+Time: <UTC>
+Status: <investigating / mitigating / monitoring>
+What we know: <facts>
+What we're trying: <action>
+Next update: <time>
+```
+
+### Resolution
+
+```
+✅ [SEV-X] Resolved
+Started: <UTC>
+Ended:   <UTC>
+Duration: <minutes>
+Impact: <users / features / requests affected>
+Root cause: <one sentence>
+Permanent fix: <PR / ticket>
+Postmortem: <doc link, by <date>>
+```
+
+### Customer-facing status
+
+```
+We're investigating an issue affecting <feature> that started at <time>.
+We'll post an update by <time + 15 min>.
+```
+
+Avoid speculation in customer-facing copy. Say "investigating", "applying fix", "monitoring", "resolved" — and nothing else until you actually know.
+
+---
+
+## 5. Post-mortem template
+
+Use within 5 business days of any sev 1 or sev 2.
+
+```markdown
+---
+title: Post-mortem — <short title>
+date: <YYYY-MM-DD>
+severity: SEV-X
+duration: <minutes>
+authors: [<names>]
+tags: [postmortem]
+---
+
+## Summary
+One paragraph: what broke, who was affected, how long, how it was fixed.
+
+## Timeline (UTC)
+- HH:MM — first signal (alert, user report)
+- HH:MM — on-call ack
+- HH:MM — hypothesis: ...
+- HH:MM — mitigation deployed
+- HH:MM — verified resolved
+- HH:MM — incident closed
+
+## Impact
+- Users affected: <count or %>
+- Features affected: <list>
+- Money affected: <if payments>
+- Data loss: <yes/no — describe>
+
+## Root cause
+Honest, blameless. Distinguish trigger vs underlying cause.
+
+## What went well
+- ...
+
+## What went poorly
+- ...
+
+## Where we got lucky
+- ...
+
+## Action items
+| # | Item | Owner | Due | Ticket |
+|---|------|-------|-----|--------|
+| 1 | Add /health probe for MongoDB | @x | 2026-06-01 | OPS-123 |
+| 2 | Tighten rate limit on /auth/login | @y | 2026-05-30 | OPS-124 |
+
+## Detection improvements
+What new alert / dashboard would have caught this earlier?
+
+## Process improvements
+What runbook / docs need updating? Update [[Incident Response]] right now.
+```
+
+Store postmortems alongside this vault — suggested path `/Users/mojtabaheidari/code/docs/08 - Operations/postmortems/YYYY-MM-DD-<slug>.md`.
+
+---
+
+## 6. Escalation contacts
+
+(Fill in for your team; placeholder structure below.)
+
+| Role | Primary | Backup | Channel |
+|------|---------|--------|---------|
+| On-call engineer | <name> | <name> | #incidents |
+| Payments lead | <name> | <name> | DM |
+| Infrastructure | <name> | <name> | DM |
+| Product / customer comms | <name> | <name> | #customer-comms |
+| SHKeeper provider contact | <email> | — | email |
+| SMTP provider | <email> | — | email |
+
+---
+
+## 7. After every incident
+
+- [ ] Updated this page with any new gotchas?
+- [ ] Updated [[Monitoring]] with new metrics/alerts to add?
+- [ ] Updated [[Backup & Recovery]] if backup gaps were exposed?
+- [ ] Action items tracked?
+- [ ] Customer comms sent (if user-impacting)?
+- [ ] Post-mortem published?
+
+Cross-links: [[Deployment]] for rollback steps, [[Database Operations]] for DB diagnostics, [[Backup & Recovery]] for restore procedures, [[Monitoring]] for metrics to watch.
--- a/Operations/Monitoring.md
+++ b/Operations/Monitoring.md
@@ -0,0 +1,253 @@
+---
+title: Monitoring
+tags: [operations]
+---
+
+# Monitoring
+
+What's instrumented today and what to watch. Today's stack is intentionally lean — health endpoints, Docker healthchecks, Sentry, and access logs. Bigger metric pipelines (Prometheus, Grafana, OpenSearch) are a future addition.
+
+---
+
+## 1. Health endpoint
+
+Path: `GET /health` (backend, port `5001`).
+
+Defined in `backend/src/app.ts`:
+
+```ts
+app.get("/health", (req, res) => {
+  res.json({
+    success: true,
+    message: "Marketplace Backend API is running",
+    timestamp: new Date().toISOString(),
+    environment: config.nodeEnv,
+    version: packageJson.version,
+  });
+});
+```
+
+Returns `200` with a JSON envelope as soon as Express is up. Does **not** currently probe MongoDB or Redis — they are checked via separate Docker healthchecks. If you want deep health, extend the endpoint to ping both data stores and return `503` on failure.
+
+Public URL behind Nginx: `https://amn.gg/api/health`.
+
+---
+
+## 2. Docker healthchecks
+
+Each long-lived container has a `HEALTHCHECK` baked in or declared in compose.
+
+| Container | Probe | Interval | Failure threshold |
+|-----------|-------|----------|-------------------|
+| `nickapp-backend` | `node healthcheck.js` (HTTP GET `/health`) | 30s | 3 retries |
+| `nickapp-frontend` | `curl -f http://localhost:8083/` | 30s | 3 retries |
+| `mongodb` | `mongosh --eval "db.adminCommand('ping')"` | 30s | 3 retries |
+| `redis` | `redis-cli -a $REDIS_PASSWORD ping` | 30s | 3 retries |
+
+`healthcheck.js` (backend) is a tiny Node script that does a local HTTP GET to `/health` and exits 0 / 1.
+
+Inspect health:
+
+```bash
+docker ps --format "table {{.Names}}\t{{.Status}}"
+
+# Detailed
+docker inspect --format='{{json .State.Health}}' nickapp-backend | jq
+```
+
+If a container is `unhealthy`, Watchtower will **not** roll it (it expects the new container to pass healthcheck). Investigate with `docker logs <container>`.
+
+---
+
+## 3. Sentry — error tracking
+
+### Frontend
+
+`@sentry/nextjs ^10.22.0` is wired in via three config files at the repo root:
+
+- `sentry.client.config.ts` — browser SDK (with Session Replay enabled at 10% session / 100% error rate).
+- `sentry.server.config.ts` — server-rendered components (no Replay).
+- `sentry.edge.config.ts` — edge runtime (not currently used heavily).
+
+Common settings:
+
+```ts
+Sentry.init({
+  dsn: process.env.NEXT_PUBLIC_SENTRY_DSN,
+  tracesSampleRate: process.env.NODE_ENV === 'production' ? 0.1 : 1.0,
+  environment: process.env.NODE_ENV || 'development',
+  enabled: process.env.NODE_ENV === 'production',
+  ignoreErrors: ['ResizeObserver loop limit exceeded', 'ChunkLoadError', ...],
+});
+```
+
+Errors from `localhost` are filtered out — only prod errors land in the dashboard.
+
+### Backend
+
+`@sentry/node ^10.22.0` + `@sentry/profiling-node ^10.22.0` are initialised **first** in `src/app.ts` (before any other import) via `src/config/sentry.ts`. DSN comes from `SENTRY_DSN` env var (see [[Environment Variables#sentry]]).
+
+What's captured:
+
+- Uncaught exceptions in route handlers
+- Promise rejections inside `asyncHandler`-wrapped routes
+- Manually-captured errors via `Sentry.captureException(err)`
+- Performance traces (10% sample rate in prod)
+- Profiling samples via `@sentry/profiling-node`
+
+### Source maps
+
+Frontend uploads source maps to Sentry at build time when `SENTRY_AUTH_TOKEN`, `SENTRY_ORG`, and `SENTRY_PROJECT` are set in the CI env. Without them the build still succeeds but Sentry traces will show minified frames.
+
+### Alerts
+
+Configure in the Sentry dashboard (Issues → Alerts) — common alerts:
+
+- Any new issue in production → Slack
+- Error frequency > 50/minute → page on-call
+- Performance regression on `/api/payments/*` traces → email
+
+---
+
+## 4. Logs
+
+### Backend application logs
+
+Routed through `src/utils/logger.ts` — currently a thin `console.log` wrapper with emoji prefixes. Output goes to stdout, captured by Docker:
+
+```bash
+# Live tail
+docker compose -f docker-compose.production.yml logs -f --tail=200 nickapp-backend
+
+# Search for a request
+docker logs nickapp-backend 2>&1 | grep "POST /api/payments"
+
+# Pre-filter by date
+docker logs --since 1h nickapp-backend
+```
+
+Notable log lines to look for:
+
+| Prefix | Meaning |
+|--------|---------|
+| `✅ Connected to MongoDB` | DB connection established |
+| `🚀 Server running on port 5001` | App fully started |
+| `🔌 User connected: <id>` | Socket.IO connection |
+| `📥` | Inbound HTTP request log |
+| `💳 SHKeeper` | SHKeeper webhook / API call |
+| `🔐 Webhook verification` | Webhook signature check result |
+| `❌ Error` | Manual error log (also captured by Sentry) |
+
+### Nginx access + error logs
+
+Bind-mounted to `./nginx/logs/` on the host:
+
+```bash
+tail -f /opt/backend/nginx/logs/access.log
+tail -f /opt/backend/nginx/logs/error.log
+```
+
+Rotate these via host `logrotate` to avoid disk fill.
+
+### Frontend logs
+
+Next.js logs go to the container stdout:
+
+```bash
+docker logs -f nickapp-frontend
+```
+
+Browser-side logs that need attention go through Sentry (above) — `src/utils/logger.ts` in the frontend forwards via Sentry breadcrumbs.
+
+---
+
+## 5. Key metrics to watch
+
+Today these are read manually from logs / Sentry. As Prometheus is added, encode them as alerting rules.
+
+### Application
+
+| Metric | Where to check | Healthy | Alert |
+|--------|---------------|---------|-------|
+| 5xx rate | Sentry, Nginx access.log | < 0.5 % | > 2 % over 5 min |
+| `/health` p95 latency | curl + timer | < 100 ms | > 1 s |
+| Login success rate | Sentry custom event | > 95 % | < 90 % |
+| Socket disconnect storm | `🔌 User disconnected` log frequency | < 1/s sustained | > 10/s sustained |
+| OpenAI 429s | Backend log `OpenAI ... 429` | 0 | any |
+
+### Payments
+
+| Metric | Where | Healthy | Alert |
+|--------|-------|---------|-------|
+| Payment success rate | `db.payments.aggregate([{$group:{_id:"$status",n:{$sum:1}}}])` | > 95 % completed of 24h-old payments | < 90 % |
+| Webhook signature failures | log `Webhook verification failed` | 0 | > 0 |
+| SHKeeper API errors (5xx) | log + Sentry | 0 | > 5/min sustained |
+| Payouts stuck in `pending` > 30 min | `db.payments.find({type:'payout',status:'pending',createdAt:{$lt:ISODate(30 min ago)}})` | empty | non-empty |
+| Missing `transactionHash` after `completed` | the same query that drives `fix-transaction-hashes.js` | empty | non-empty |
+
+### MongoDB
+
+```js
+db.serverStatus().connections           // active connections; alert if >1000
+db.serverStatus().opcounters            // ops/sec
+db.serverStatus().wiredTiger.cache      // cache hit ratio; aim > 95 %
+db.currentOp({ secs_running: { $gte: 5 } })  // long-running queries
+```
+
+### Redis
+
+```bash
+docker exec nickapp-redis redis-cli -a "$REDIS_PASSWORD" INFO stats
+# Watch: instantaneous_ops_per_sec, keyspace_hits/misses, rejected_connections, evicted_keys
+```
+
+Alert thresholds: `rejected_connections > 0`, `evicted_keys` rising while you don't expect cache pressure, `latency_ms` p99 > 5ms.
+
+### Host
+
+| Metric | Tool | Healthy | Alert |
+|--------|------|---------|-------|
+| Disk usage on `/var/lib/docker` | `df -h` | < 80 % | > 90 % |
+| `/opt/backend/uploads` size | `du -sh` | watch trend | bursty growth (>5 GB/day) |
+| Memory pressure | `free -h`, `docker stats` | < 80 % | swap actively used |
+| Open file descriptors | `cat /proc/<pid>/limits` | well under hard limit | nearing limit |
+
+---
+
+## 6. Smoke tests after a deploy
+
+Drop these in a runbook for the on-call:
+
+```bash
+# 1. API health
+curl -fsS https://amn.gg/api/health | jq '.success,.version,.environment'
+
+# 2. Login
+curl -fsS -X POST https://amn.gg/api/auth/login \
+  -H "Content-Type: application/json" \
+  -d '{"email":"admin@marketplace.com","password":"<prod-admin-pwd>"}' \
+  | jq '.success,.data.user.email'
+
+# 3. Frontend HTML loads
+curl -fsS https://amn.gg/ -I | head -1   # expect 200
+
+# 4. Socket.IO handshake
+curl -fsS "https://amn.gg/socket.io/?EIO=4&transport=polling" -I | head -1
+
+# 5. Containers healthy
+docker ps --filter "name=nickapp-" --format "table {{.Names}}\t{{.Status}}"
+```
+
+Any non-OK → see [[Incident Response]].
+
+---
+
+## 7. Future work
+
+- **Prometheus + Grafana** with Node exporter + Mongo exporter + Redis exporter — for proper time-series.
+- **OpenTelemetry** spans from backend → Sentry / Jaeger.
+- **Healthcheck endpoint** that probes Mongo + Redis and returns `503` when degraded.
+- **PagerDuty / OpsGenie** wiring from Sentry alerts.
+- **Synthetic checks** (Pingdom / UptimeRobot) hitting `/health` from multiple regions.
+
+For now, Sentry + Docker healthchecks + manual log checks cover the basics. See [[Incident Response]] for what to do when something fires.