Initial commit: nick docs
This commit is contained in:
315
08 - Operations/Backup & Recovery.md
Normal file
315
08 - Operations/Backup & Recovery.md
Normal file
@@ -0,0 +1,315 @@
|
||||
---
|
||||
title: Backup & Recovery
|
||||
tags: [operations]
|
||||
---
|
||||
|
||||
# Backup & Recovery
|
||||
|
||||
How to keep the marketplace recoverable from data loss. Covers MongoDB, Redis, the `uploads/` directory, and environment secrets, plus the disaster-recovery runbook.
|
||||
|
||||
---
|
||||
|
||||
## 1. RTO / RPO targets
|
||||
|
||||
| Asset | RPO (data loss tolerated) | RTO (downtime tolerated) | Backup cadence |
|
||||
|-------|---------------------------|--------------------------|----------------|
|
||||
| MongoDB | 1 hour | 1 hour | Hourly `mongodump` + nightly offsite |
|
||||
| `uploads/` directory | 24 hours | 2 hours | Nightly `rsync` to offsite |
|
||||
| Redis | 1 hour (regeneratable) | 0 minutes (app survives empty cache) | Nightly RDB snapshot |
|
||||
| Production `.env` | n/a (manual) | 5 minutes | Stored in 1Password / Bitwarden vault |
|
||||
| Container images | n/a (CI rebuilds) | 15 minutes | Tagged in registry by version |
|
||||
|
||||
Adjust these targets when product SLAs change.
|
||||
|
||||
---
|
||||
|
||||
## 2. MongoDB
|
||||
|
||||
### 2.1 Dump
|
||||
|
||||
```bash
|
||||
#!/usr/bin/env bash
|
||||
# scripts/backup-mongo.sh — run hourly via cron
|
||||
set -euo pipefail
|
||||
|
||||
STAMP=$(date -u +%FT%H%M%SZ)
|
||||
DEST=/var/backups/mongo
|
||||
mkdir -p "$DEST"
|
||||
|
||||
docker exec nickapp-mongodb \
|
||||
mongodump --db=marketplace --archive --gzip \
|
||||
> "$DEST/marketplace-$STAMP.gz"
|
||||
|
||||
# Keep last 24 hourly + 14 daily
|
||||
find "$DEST" -name 'marketplace-*.gz' -mtime +14 -delete
|
||||
```
|
||||
|
||||
Cron entry:
|
||||
|
||||
```
|
||||
0 * * * * /usr/local/bin/backup-mongo.sh >> /var/log/backup-mongo.log 2>&1
|
||||
```
|
||||
|
||||
### 2.2 Offsite
|
||||
|
||||
Push the most recent dump to S3 (or Backblaze B2, or `rclone` to any provider) nightly:
|
||||
|
||||
```bash
|
||||
aws s3 cp "$DEST"/marketplace-*.gz \
|
||||
"s3://marketplace-backups/mongo/" \
|
||||
--recursive --exclude "*" --include "marketplace-*.gz" \
|
||||
--storage-class STANDARD_IA
|
||||
```
|
||||
|
||||
Set a 90-day lifecycle policy on the bucket to age out old copies.
|
||||
|
||||
### 2.3 Restore
|
||||
|
||||
> [!warning] Restoring is **destructive** to the current data. Always practise on a staging clone before doing it for real.
|
||||
|
||||
```bash
|
||||
# Restore against an empty database (fresh container)
|
||||
docker exec -i nickapp-mongodb \
|
||||
mongorestore --archive --gzip --drop \
|
||||
< /var/backups/mongo/marketplace-2026-05-20T0300Z.gz
|
||||
|
||||
# Verify
|
||||
docker exec nickapp-mongodb mongosh \
|
||||
--eval "use marketplace; db.users.countDocuments()"
|
||||
```
|
||||
|
||||
For partial restore (single collection):
|
||||
|
||||
```bash
|
||||
docker exec -i nickapp-mongodb \
|
||||
mongorestore --archive --gzip --drop \
|
||||
--nsInclude='marketplace.payments' \
|
||||
< /var/backups/mongo/marketplace-2026-05-20T0300Z.gz
|
||||
```
|
||||
|
||||
### 2.4 Validate backups
|
||||
|
||||
A monthly drill — restore the latest dump into a throwaway container and run smoke queries:
|
||||
|
||||
```bash
|
||||
docker run --rm -v $(pwd)/marketplace-latest.gz:/dump.gz mongo:8.2 \
|
||||
sh -c "mongorestore --archive=/dump.gz --gzip && mongosh --eval 'db.getMongo().getDBNames()'"
|
||||
```
|
||||
|
||||
If validation fails, treat as a sev-2 incident (see [[Incident Response]]).
|
||||
|
||||
---
|
||||
|
||||
## 3. Redis
|
||||
|
||||
Redis data is regeneratable — losing it means logged-out users + cold caches, no business data lost. Still cheap to back up.
|
||||
|
||||
### 3.1 Snapshot
|
||||
|
||||
```bash
|
||||
# Trigger a save and copy out
|
||||
docker exec nickapp-redis redis-cli -a "$REDIS_PASSWORD" BGSAVE
|
||||
sleep 5
|
||||
docker cp nickapp-redis:/data/dump.rdb /var/backups/redis/redis-$(date -u +%FT%H%M%SZ).rdb
|
||||
```
|
||||
|
||||
Daily cron is sufficient.
|
||||
|
||||
### 3.2 Restore
|
||||
|
||||
```bash
|
||||
# Stop redis, drop the RDB into the volume, start
|
||||
docker compose -f docker-compose.production.yml stop redis
|
||||
docker cp /var/backups/redis/redis-2026-05-20T0300Z.rdb nickapp-redis:/data/dump.rdb
|
||||
docker compose -f docker-compose.production.yml start redis
|
||||
```
|
||||
|
||||
If you've enabled AOF, also copy `appendonly.aof`. See [[Database Operations#persistence]].
|
||||
|
||||
---
|
||||
|
||||
## 4. `uploads/` directory
|
||||
|
||||
Stored on the host at `/opt/backend/uploads/` and bind-mounted into both backend and nginx containers. This is where every user upload lives — losing it means broken images, missing dispute evidence, and unhappy users.
|
||||
|
||||
### 4.1 Nightly sync
|
||||
|
||||
```bash
|
||||
rsync -av --delete /opt/backend/uploads/ \
|
||||
s3://marketplace-backups/uploads/
|
||||
|
||||
# Or rclone to any provider
|
||||
rclone sync /opt/backend/uploads/ backblaze:marketplace-uploads --transfers 8
|
||||
```
|
||||
|
||||
Cron:
|
||||
|
||||
```
|
||||
30 3 * * * /usr/local/bin/backup-uploads.sh >> /var/log/backup-uploads.log 2>&1
|
||||
```
|
||||
|
||||
### 4.2 Restore
|
||||
|
||||
```bash
|
||||
rsync -av s3://marketplace-backups/uploads/ /opt/backend/uploads/
|
||||
# fix ownership for the marketplace container (uid 1001)
|
||||
chown -R 1001:1001 /opt/backend/uploads
|
||||
```
|
||||
|
||||
Restart the backend container so any in-flight uploads find the right directory layout.
|
||||
|
||||
---
|
||||
|
||||
## 5. Secrets & configuration
|
||||
|
||||
### 5.1 `.env` files
|
||||
|
||||
The production `.env` lives at `/opt/backend/.env`. It is **not** version-controlled and **not** in any standard backup. Source of truth: the team password manager (1Password / Bitwarden vault).
|
||||
|
||||
After any change:
|
||||
|
||||
1. Update the host file.
|
||||
2. Update the vault entry with the new value, a one-line "why", and the date.
|
||||
3. `docker compose -f docker-compose.production.yml up -d` to apply.
|
||||
|
||||
### 5.2 SSL certs
|
||||
|
||||
If you run a host-level Caddy / Nginx with Let's Encrypt, certs auto-renew. Back up `/var/lib/caddy/.local/share/caddy/` (Caddy) or `/etc/letsencrypt/` (Certbot) — useful if you migrate hosts.
|
||||
|
||||
### 5.3 Container registry credentials
|
||||
|
||||
`/root/.docker/config.json` on the production host holds the `git.manko.yoga` login Watchtower uses. Recreate after a rebuild:
|
||||
|
||||
```bash
|
||||
docker login git.manko.yoga -u manawenuz
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 6. Disaster recovery runbook
|
||||
|
||||
> Scenario: production host is unrecoverable (disk failure, cloud provider lost the VM, etc.).
|
||||
|
||||
### Phase 1 — Provision
|
||||
|
||||
1. Spin up a new VM matching the previous spec (≥ 4 vCPU, 8 GB RAM, 100 GB SSD).
|
||||
2. Install Docker Engine + compose plugin.
|
||||
3. Restore DNS pointing or stand up a temporary subdomain (`recovery.amn.gg`).
|
||||
|
||||
### Phase 2 — Code
|
||||
|
||||
```bash
|
||||
cd /opt
|
||||
git clone ssh://git@git.manko.yoga:222/nick/backend.git
|
||||
git clone ssh://git@git.manko.yoga:222/nick/frontend.git
|
||||
cd backend && git checkout main
|
||||
```
|
||||
|
||||
### Phase 3 — Config
|
||||
|
||||
```bash
|
||||
# Restore .env from the vault
|
||||
nano /opt/backend/.env
|
||||
|
||||
# Restore nginx config
|
||||
mkdir -p nginx/logs
|
||||
# copy nginx.conf from the vault / repo / your laptop
|
||||
```
|
||||
|
||||
### Phase 4 — Data
|
||||
|
||||
```bash
|
||||
# Mongo
|
||||
mkdir -p /var/backups/mongo
|
||||
aws s3 cp s3://marketplace-backups/mongo/marketplace-LATEST.gz /var/backups/mongo/
|
||||
|
||||
# Uploads
|
||||
mkdir -p /opt/backend/uploads
|
||||
aws s3 sync s3://marketplace-backups/uploads/ /opt/backend/uploads/
|
||||
chown -R 1001:1001 /opt/backend/uploads
|
||||
|
||||
# Redis (optional — empty is fine)
|
||||
mkdir -p /var/backups/redis
|
||||
aws s3 cp s3://marketplace-backups/redis/redis-LATEST.rdb /var/backups/redis/
|
||||
```
|
||||
|
||||
### Phase 5 — Start stack
|
||||
|
||||
```bash
|
||||
cd /opt/backend
|
||||
docker login git.manko.yoga -u manawenuz
|
||||
docker compose -f docker-compose.production.yml up -d
|
||||
# wait ~60s
|
||||
docker compose -f docker-compose.production.yml ps
|
||||
```
|
||||
|
||||
### Phase 6 — Restore data into running containers
|
||||
|
||||
```bash
|
||||
# Mongo
|
||||
docker exec -i nickapp-mongodb \
|
||||
mongorestore --archive --gzip --drop \
|
||||
< /var/backups/mongo/marketplace-LATEST.gz
|
||||
|
||||
# Redis
|
||||
docker compose stop redis
|
||||
docker cp /var/backups/redis/redis-LATEST.rdb nickapp-redis:/data/dump.rdb
|
||||
docker compose start redis
|
||||
```
|
||||
|
||||
### Phase 7 — Verify
|
||||
|
||||
```bash
|
||||
curl -fsS http://localhost:8083/api/health | jq
|
||||
docker exec nickapp-mongodb mongosh --eval "use marketplace; db.users.countDocuments()"
|
||||
docker compose logs --tail=200 nickapp-backend | grep -E "✅|❌"
|
||||
```
|
||||
|
||||
### Phase 8 — Restart Watchtower & cut over DNS
|
||||
|
||||
```bash
|
||||
docker run -d --name watchtower --restart unless-stopped \
|
||||
-v /var/run/docker.sock:/var/run/docker.sock \
|
||||
-v /root/.docker/config.json:/config.json \
|
||||
-e WATCHTOWER_POLL_INTERVAL=300 \
|
||||
-e WATCHTOWER_LABEL_ENABLE=true \
|
||||
containrrr/watchtower
|
||||
|
||||
# Update DNS for amn.gg / dev.amn.gg to the new host's IP
|
||||
```
|
||||
|
||||
### Phase 9 — Post-mortem
|
||||
|
||||
Write a post-mortem (template in [[Incident Response#postmortem-template]]) and update this runbook with anything that surprised you.
|
||||
|
||||
---
|
||||
|
||||
## 7. Quick-reference commands
|
||||
|
||||
```bash
|
||||
# Mongo dump
|
||||
docker exec nickapp-mongodb mongodump --db=marketplace --archive --gzip > backup.gz
|
||||
# Mongo restore
|
||||
docker exec -i nickapp-mongodb mongorestore --archive --gzip --drop < backup.gz
|
||||
|
||||
# Redis snapshot
|
||||
docker exec nickapp-redis redis-cli -a "$REDIS_PASSWORD" BGSAVE
|
||||
docker cp nickapp-redis:/data/dump.rdb redis.rdb
|
||||
|
||||
# Uploads to S3
|
||||
rclone sync /opt/backend/uploads/ s3:marketplace-backups/uploads/
|
||||
|
||||
# Restore .env
|
||||
# Pull from vault, paste into /opt/backend/.env, docker compose up -d
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 8. Testing the plan
|
||||
|
||||
> [!tip] Backups are not real until they've been restored. Drill quarterly:
|
||||
>
|
||||
> 1. Spin up a throwaway VM.
|
||||
> 2. Walk Phases 2–7 of the DR runbook with the most recent backups.
|
||||
> 3. Time it. If RTO is busted, fix the gap before the next drill.
|
||||
> 4. Capture lessons in this file.
|
||||
259
08 - Operations/CI-CD Pipeline.md
Normal file
259
08 - Operations/CI-CD Pipeline.md
Normal file
@@ -0,0 +1,259 @@
|
||||
---
|
||||
title: CI-CD Pipeline
|
||||
tags: [operations]
|
||||
---
|
||||
|
||||
# CI/CD Pipeline
|
||||
|
||||
How code goes from a push to a running container in production. The CI is **Gitea Actions** running on the same Gitea instance that hosts the repos. The CD is **Watchtower** on the production host (covered in [[Deployment]]).
|
||||
|
||||
---
|
||||
|
||||
## 1. Where workflows live
|
||||
|
||||
| Repo | Path | Files |
|
||||
|------|------|-------|
|
||||
| Backend | `.gitea/workflows/` | `docker-build-simple.yml`, `docker-build-dev.yml`, `docker-build-no-cache.yml` |
|
||||
| Frontend | `.gitea/workflows/` | `deploy.yml`, `devDeploy.yml` |
|
||||
|
||||
Gitea Actions speaks the same YAML dialect as GitHub Actions — most third-party actions (`actions/checkout@v4`, `docker/login-action@v3`, `docker/build-push-action@v5`) work unchanged.
|
||||
|
||||
---
|
||||
|
||||
## 2. Required secrets
|
||||
|
||||
Configured per repo at **Settings → Actions → Secrets**.
|
||||
|
||||
| Secret | Repo | Purpose |
|
||||
|--------|------|---------|
|
||||
| `GITEATOKEN` | both | Personal access token for the `manawenuz` user with `write:packages` scope. Used by every workflow to log into the container registry at `git.manko.yoga`. |
|
||||
| `SENTRY_AUTH_TOKEN` | frontend | (Optional) For source-map upload during Next.js build. Skipped if absent. |
|
||||
|
||||
The registry itself is implicit: `git.manko.yoga` with `manawenuz` as the user. Image paths are `git.manko.yoga/manawenuz/<image>`.
|
||||
|
||||
> [!warning] If `GITEATOKEN` expires or is rotated, all workflows fail at the `docker/login-action` step. Rotate proactively (annual reminder).
|
||||
|
||||
---
|
||||
|
||||
## 3. Backend workflows
|
||||
|
||||
### `docker-build-simple.yml` — manual build
|
||||
|
||||
```yaml
|
||||
name: Manual Build and Push Docker Image
|
||||
|
||||
on:
|
||||
workflow_dispatch:
|
||||
inputs:
|
||||
version:
|
||||
description: 'Version to build (leave empty for package.json)'
|
||||
required: false
|
||||
type: string
|
||||
```
|
||||
|
||||
- **Trigger.** Manual only (via Gitea UI → Actions → "Run workflow").
|
||||
- **Steps.** Checkout → buildx → `docker login` → read version (input or `package.json`) → build `Dockerfile.prod` → push tags `:<version>` and `:dev` → echo result.
|
||||
- **When to use.** Cutting an ad-hoc build of a specific commit without merging to a branch. The `:dev` tag is overwritten — production (`:latest`) is **not** touched.
|
||||
- **Cache.** Uses `type=gha` cache to speed up subsequent runs.
|
||||
|
||||
### `docker-build-dev.yml` — dev branch auto-build
|
||||
|
||||
```yaml
|
||||
on:
|
||||
push:
|
||||
branches: [ development ]
|
||||
tags: [ 'v*' ]
|
||||
```
|
||||
|
||||
- **Trigger.** Every push to `development` and every tag matching `v*`.
|
||||
- **Tags pushed.** `:dev-<package-version>` + moving `:dev`.
|
||||
- **Effect.** Refreshes the dev image. The production Watchtower **does not** watch `:dev`, so this is safe to push as often as you want.
|
||||
|
||||
### `docker-build-no-cache.yml` — production build
|
||||
|
||||
```yaml
|
||||
on:
|
||||
push:
|
||||
branches: [ main, master ]
|
||||
tags: [ 'v*' ]
|
||||
```
|
||||
|
||||
- **Trigger.** Every push to `main` (or `master`) and every `v*` tag.
|
||||
- **Tags pushed.** `:<package-version>` + moving `:latest`.
|
||||
- **Effect.** Watchtower polls `:latest`, detects the new digest, restarts `nickapp-backend` on the production host. See [[Deployment#routine-deploy]].
|
||||
- **No cache.** The file is named "No Cache" but actually does not pass `cache-from`/`cache-to`, so each build is from scratch. Slower (~5–8 min) but eliminates a class of stale-layer bugs. The `simple` workflow uses GHA cache for speed.
|
||||
|
||||
> [!tip] If you need to invalidate a cached layer in the `simple` workflow, run `no-cache` once — the resulting tag overwrites the registry digest and `simple`'s next run will start from a cleaner base.
|
||||
|
||||
---
|
||||
|
||||
## 4. Frontend workflows
|
||||
|
||||
Both workflows share the same shape: spin up a `node:22` container, run a deploy shell script that does `docker login + build + push`.
|
||||
|
||||
### `deploy.yml` — production
|
||||
|
||||
```yaml
|
||||
on:
|
||||
push:
|
||||
branches: [ main, master ]
|
||||
workflow_dispatch:
|
||||
```
|
||||
|
||||
Calls `./scripts/deploy.sh` — see [[Scripts#deployment]]. The script:
|
||||
|
||||
1. Reads `package.json` version.
|
||||
2. `docker login git.manko.yoga -u manawenuz -p $GITEATOKEN`.
|
||||
3. Builds `git.manko.yoga/manawenuz/escrow-frontend:<version>` and `:latest` from `Dockerfile`.
|
||||
4. Pushes both tags.
|
||||
|
||||
`:latest` is what production Watchtower watches → live deploy follows automatically.
|
||||
|
||||
### `devDeploy.yml` — development branch
|
||||
|
||||
Same as `deploy.yml` but triggered on `development` and runs `./scripts/deployDev.sh`, which pushes only `:dev`.
|
||||
|
||||
---
|
||||
|
||||
## 5. End-to-end timeline (production deploy)
|
||||
|
||||
```
|
||||
t=0 Developer merges PR → main
|
||||
t+5s Gitea webhook fires
|
||||
t+10s Gitea Actions runner pulls repo, starts container
|
||||
t+30s docker/setup-buildx-action initialised
|
||||
t+45s docker/login-action authenticated
|
||||
t+2-5m docker/build-push-action builds Dockerfile.prod
|
||||
t+5m Push to git.manko.yoga/manawenuz/escrow-backend:latest
|
||||
t+5m+ Watchtower (next poll, up to 5 min) detects new digest
|
||||
t+10m Watchtower stops old container, starts new one
|
||||
t+10m40s start_period=40s elapses, healthcheck passes
|
||||
t+11m Nginx routes traffic to the new container
|
||||
```
|
||||
|
||||
**Typical SLA: 10–12 minutes from merge to live.** For an emergency rollback see [[Deployment#roll-back]].
|
||||
|
||||
---
|
||||
|
||||
## 6. Versioning automation
|
||||
|
||||
Tied to `backend/scripts/auto-version.sh` + `ai-enhanced.sh` (and the frontend mirror). Full reference in [[Git Workflow#versioning]] and [[Scripts#auto-version-sh]].
|
||||
|
||||
In short:
|
||||
|
||||
```bash
|
||||
# Developer side, on the branch they're releasing:
|
||||
npm run smart-release
|
||||
# → AI analyses last commit, picks bump (major/minor/patch/skip)
|
||||
# → bumps package.json
|
||||
# → commits "chore: bump version to vX.Y.Z"
|
||||
# → tags vX.Y.Z
|
||||
# → git push && git push --tags
|
||||
```
|
||||
|
||||
The push to `main` (or the `v*` tag) then triggers `docker-build-no-cache.yml`, which:
|
||||
|
||||
- Reads the new version from `package.json` (`node -p "require('./package.json').version"`)
|
||||
- Builds and pushes `:<version>` + `:latest`
|
||||
|
||||
So both the **image tag** and the **git tag** carry the same `vX.Y.Z` — easy to correlate when investigating an issue.
|
||||
|
||||
---
|
||||
|
||||
## 7. Adding tests to the pipeline
|
||||
|
||||
The workflows today only build + push; they do **not** run Jest or Playwright. To gate releases on tests, add a `test` job before the build:
|
||||
|
||||
```yaml
|
||||
jobs:
|
||||
test:
|
||||
runs-on: ubuntu-latest
|
||||
container: node:22
|
||||
steps:
|
||||
- uses: actions/checkout@v4
|
||||
- run: yarn install --frozen-lockfile
|
||||
- run: yarn lint
|
||||
- run: yarn test --ci --runInBand
|
||||
- run: yarn test:e2e # if a service container is available
|
||||
|
||||
build-and-push:
|
||||
needs: test
|
||||
runs-on: ubuntu-latest
|
||||
# ...existing steps...
|
||||
```
|
||||
|
||||
Or run lint + typecheck as a pre-gate using a separate workflow that triggers on PR opened/synchronised.
|
||||
|
||||
---
|
||||
|
||||
## 8. Inspecting a build
|
||||
|
||||
In Gitea: **Actions → workflow → run** to see real-time logs.
|
||||
|
||||
Useful CLI for the registry from your laptop:
|
||||
|
||||
```bash
|
||||
# List images and tags
|
||||
curl -s -u "manawenuz:$GITEATOKEN" \
|
||||
"https://git.manko.yoga/v2/manawenuz/escrow-backend/tags/list" | jq
|
||||
|
||||
# Pull a specific tag
|
||||
docker login git.manko.yoga -u manawenuz
|
||||
docker pull git.manko.yoga/manawenuz/escrow-backend:2.6.3
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 9. Self-hosted runner notes
|
||||
|
||||
Gitea Actions can use either built-in `act_runner` or your own. Currently the workflows are written for `runs-on: ubuntu-latest`, which the act_runner supplies via a generic Ubuntu container. If you need:
|
||||
|
||||
- More CPU/RAM for builds → register a beefier self-hosted runner and change `runs-on:` to its label.
|
||||
- A Docker-in-Docker setup (frontend `deploy.yml` does this with `options: --privileged`) — confirm the runner trusts the workflow.
|
||||
|
||||
---
|
||||
|
||||
## 10. Failure modes & remediation
|
||||
|
||||
| Failure | Most likely cause | Fix |
|
||||
|---------|------------------|-----|
|
||||
| `unauthorized: authentication required` at push | `GITEATOKEN` expired or lacks `write:packages` | Rotate the token, update the repo secret |
|
||||
| `Cannot perform an interactive login from a non TTY device` | Old docker-login-action version | Bump to `docker/login-action@v3` |
|
||||
| Build hangs at `yarn install` | npm registry timeout | Increase `network-timeout` (already 600000); re-run |
|
||||
| Image pushed but Watchtower doesn't roll | Watchtower can't reach the registry | `docker logs watchtower`; verify `/root/.docker/config.json` is mounted into the container |
|
||||
| New container fails healthcheck | App crash on boot | `docker logs nickapp-backend`; check env vars, follow [[Incident Response]] |
|
||||
| Multi-arch warnings about platform | Build runner is arm64 but prod is amd64 | Add `--platform=linux/amd64` to `docker/build-push-action` inputs |
|
||||
| Image size grew suddenly | Dev dep crept into prod stage | Audit `Dockerfile.prod` for missing `--production` flag in the runtime stage |
|
||||
|
||||
---
|
||||
|
||||
## 11. Pipeline diagram
|
||||
|
||||
```
|
||||
Push to development Push to main
|
||||
│ │
|
||||
▼ ▼
|
||||
┌───────────────────────────┐ ┌───────────────────────────┐
|
||||
│ docker-build-dev.yml │ │ docker-build-no-cache.yml │
|
||||
│ (backend) │ │ (backend) │
|
||||
│ devDeploy.yml (frontend) │ │ deploy.yml (frontend) │
|
||||
└───────────────┬───────────┘ └───────────────┬───────────┘
|
||||
│ │
|
||||
push :<version>,:dev push :<version>,:latest
|
||||
│ │
|
||||
▼ ▼
|
||||
git.manko.yoga/manawenuz/...:dev git.manko.yoga/manawenuz/...:latest
|
||||
│ │
|
||||
│ ▼
|
||||
│ ┌──────────────────────────┐
|
||||
│ │ Watchtower │
|
||||
│ │ (poll every 5 minutes) │
|
||||
│ └──────────────┬───────────┘
|
||||
│ │
|
||||
manual pull on staging restart containers
|
||||
│
|
||||
▼
|
||||
Production live
|
||||
```
|
||||
|
||||
Cross-links: [[Deployment]] for what happens on the host, [[Git Workflow]] for what happens upstream, [[Scripts]] for the deploy shell scripts.
|
||||
301
08 - Operations/Database Operations.md
Normal file
301
08 - Operations/Database Operations.md
Normal file
@@ -0,0 +1,301 @@
|
||||
---
|
||||
title: Database Operations
|
||||
tags: [operations]
|
||||
---
|
||||
|
||||
# Database Operations
|
||||
|
||||
Day-to-day operations for the two stateful services: **MongoDB 8.2** (primary data store) and **Redis 8** (cache, rate-limit counters, ephemeral session data).
|
||||
|
||||
For schema details see [[Data Models]]. For backup procedures and disaster recovery see [[Backup & Recovery]].
|
||||
|
||||
---
|
||||
|
||||
## 1. MongoDB
|
||||
|
||||
### 1.1 Connection
|
||||
|
||||
| Env | URI in compose | Auth |
|
||||
|-----|---------------|------|
|
||||
| Dev | `mongodb://mongodb:27017` | none |
|
||||
| Prod | `mongodb://mongodb:27017` (private network) or with creds via `.env` | typically none on the private network, but enable `--auth` if exposed |
|
||||
|
||||
The DB name comes from `DB_NAME` (e.g. `marketplace`). See [[Environment Variables#database]].
|
||||
|
||||
Connect from a shell inside the host:
|
||||
|
||||
```bash
|
||||
# Dev
|
||||
docker exec -it nickdev-mongodb mongosh
|
||||
|
||||
# Prod
|
||||
docker exec -it nickapp-mongodb mongosh
|
||||
> use marketplace
|
||||
> show collections
|
||||
```
|
||||
|
||||
If auth is enabled:
|
||||
|
||||
```bash
|
||||
docker exec -it nickapp-mongodb mongosh \
|
||||
-u "$MONGO_INITDB_ROOT_USERNAME" -p "$MONGO_INITDB_ROOT_PASSWORD" \
|
||||
--authenticationDatabase admin
|
||||
```
|
||||
|
||||
### 1.2 Init scripts (`mongo-init/`)
|
||||
|
||||
The production compose bind-mounts `./mongo-init` into `/docker-entrypoint-initdb.d`. Mongo runs `*.js` and `*.sh` from this folder **only on a fresh datadir** (first boot of a new volume). Use this to:
|
||||
|
||||
- Create application users (`db.createUser({...})`)
|
||||
- Bootstrap collections + indexes that must exist before the app starts
|
||||
|
||||
Example `mongo-init/01-create-user.js`:
|
||||
|
||||
```js
|
||||
db = db.getSiblingDB('marketplace');
|
||||
db.createUser({
|
||||
user: 'marketplace_app',
|
||||
pwd: process.env.MARKETPLACE_APP_PWD,
|
||||
roles: [{ role: 'readWrite', db: 'marketplace' }],
|
||||
});
|
||||
```
|
||||
|
||||
> [!warning] These scripts do **not** run when you restart an existing container. To force re-init, drop the `mongodb_data` volume — which destroys all data. Plan accordingly.
|
||||
|
||||
### 1.3 Indexes
|
||||
|
||||
Indexes are declared in Mongoose schemas under `backend/src/models/`. The app calls `Model.createIndexes()` on connection (via the model's `syncIndexes`/`ensureIndexes` lifecycle). Highlights:
|
||||
|
||||
| Collection | Key indexes |
|
||||
|------------|-------------|
|
||||
| `users` | `email` (unique), `googleId` (sparse), `role`, `createdAt` |
|
||||
| `addresses` | `userId` + compound for primary lookup |
|
||||
| `purchaserequests` | `buyerId`, `status`, `createdAt`, text index on `title`+`description` |
|
||||
| `selleroffers` | `requestId`, `sellerId`, `status` |
|
||||
| `payments` | `providerPaymentId` (unique sparse), `userId`, `status`, `createdAt`, `transactionHash` |
|
||||
| `chats` | `participants` (array), `updatedAt` |
|
||||
| `notifications` | `userId` + `read`, `createdAt` |
|
||||
| `tempverifications` | TTL on `expiresAt` (auto-deletes expired OTPs) |
|
||||
|
||||
To verify a specific collection:
|
||||
|
||||
```js
|
||||
db.payments.getIndexes()
|
||||
```
|
||||
|
||||
To add a new index without code-gen — preferred path is to declare it in the Mongoose schema and ship a deploy. For emergency hotfixes:
|
||||
|
||||
```js
|
||||
db.payments.createIndex({ providerPaymentId: 1 }, { unique: true, sparse: true });
|
||||
```
|
||||
|
||||
### 1.4 TTL indexes
|
||||
|
||||
Currently used on `tempverifications.expiresAt` (5-minute auto-purge of email OTPs / passkey challenges). Mongo's TTL monitor runs every 60 seconds — purge isn't immediate.
|
||||
|
||||
If you add more TTL indexes:
|
||||
|
||||
```js
|
||||
db.notifications.createIndex({ createdAt: 1 }, { expireAfterSeconds: 60 * 60 * 24 * 90 }); // 90 days
|
||||
```
|
||||
|
||||
### 1.5 Backup with `mongodump`
|
||||
|
||||
```bash
|
||||
# Connect into the container, dump locally, copy out
|
||||
docker exec nickapp-mongodb sh -c \
|
||||
"mongodump --db=marketplace --archive=/tmp/marketplace-$(date +%F).archive --gzip"
|
||||
docker cp nickapp-mongodb:/tmp/marketplace-$(date +%F).archive ./backups/
|
||||
|
||||
# Or stream directly to host
|
||||
docker exec nickapp-mongodb \
|
||||
mongodump --db=marketplace --archive --gzip \
|
||||
> ./backups/marketplace-$(date +%F).gz
|
||||
```
|
||||
|
||||
For full details (retention, RTO/RPO, offsite copies) see [[Backup & Recovery]].
|
||||
|
||||
### 1.6 Restore
|
||||
|
||||
```bash
|
||||
# Restore an archive to an empty database
|
||||
docker exec -i nickapp-mongodb \
|
||||
mongorestore --archive --gzip --drop \
|
||||
< ./backups/marketplace-2026-05-20.gz
|
||||
```
|
||||
|
||||
`--drop` drops each collection before restoring. Omit it to merge.
|
||||
|
||||
> [!warning] Restoring is **destructive** to current data. Always practise on a staging clone first.
|
||||
|
||||
### 1.7 Migrations
|
||||
|
||||
There is no formal migration framework. Two patterns are used:
|
||||
|
||||
- **Mongoose schema changes** are forward-compatible (new optional fields default to `undefined`). Older documents will still load.
|
||||
- **Data backfills** are one-shot scripts in `backend/src/scripts/` (e.g. `migrateUserPoints.ts`, `fix-transaction-hashes.js`, `fix-dispute-sellers.js`).
|
||||
|
||||
Pattern for a new migration:
|
||||
|
||||
1. Add a `src/seeds/migrate<Thing>.ts` script that is idempotent (use `$exists: false` guards).
|
||||
2. Run on staging, confirm.
|
||||
3. Take a backup ([[Backup & Recovery]]).
|
||||
4. Run in production: `docker exec -it nickapp-backend node dist/seeds/migrate<Thing>.js`.
|
||||
5. Commit the script (it serves as a record of what changed).
|
||||
|
||||
### 1.8 Common admin queries
|
||||
|
||||
```js
|
||||
// Count by collection
|
||||
db.users.countDocuments({ role: 'buyer' })
|
||||
|
||||
// Disk usage per collection
|
||||
db.runCommand({ collStats: 'payments', scale: 1024*1024 }).size
|
||||
|
||||
// Slow queries
|
||||
db.setProfilingLevel(1, { slowms: 200 }) // log queries > 200ms
|
||||
db.system.profile.find().sort({ ts: -1 }).limit(10)
|
||||
|
||||
// Lock contention
|
||||
db.serverStatus().locks
|
||||
```
|
||||
|
||||
### 1.9 Seeding production safely
|
||||
|
||||
Seed scripts are designed to be idempotent for **categories** but **destructive** for users/addresses. Don't run `seed:all` in production.
|
||||
|
||||
Safe in production:
|
||||
|
||||
```bash
|
||||
docker exec -it nickapp-backend node dist/seeds/seedCategories.js
|
||||
docker exec -it nickapp-backend node dist/seeds/seedLevels.js
|
||||
```
|
||||
|
||||
Optional auto-seed on startup: set `AUTO_SEED_ON_START=true` in `.env`. The bootstrap code only seeds when no non-admin users exist — safe to leave on.
|
||||
|
||||
> [!warning] **Never** run `seed:all` or `seed:users` against production. They drop the existing `users` and `addresses` collections.
|
||||
|
||||
---
|
||||
|
||||
## 2. Redis
|
||||
|
||||
### 2.1 Connection
|
||||
|
||||
Dev: `redis://redis:6379` (no password).
|
||||
Prod: `redis://:<REDIS_PASSWORD>@redis:6379`. The compose command line is `redis-server --requirepass "$REDIS_PASSWORD"`.
|
||||
|
||||
Inspect:
|
||||
|
||||
```bash
|
||||
docker exec -it nickapp-redis redis-cli -a "$REDIS_PASSWORD"
|
||||
> INFO server
|
||||
> DBSIZE
|
||||
> KEYS * # prod-unsafe on large datasets, use SCAN
|
||||
```
|
||||
|
||||
### 2.2 What we store
|
||||
|
||||
- **Rate-limit counters** for `express-rate-limit`
|
||||
- **Session data** for refresh-token tracking and revocation lists
|
||||
- **Socket.IO adapter state** (when scaled horizontally — currently single-node)
|
||||
- **Application caches** (TTL'd keys for expensive aggregates)
|
||||
- **Idempotency keys** for webhook deduplication
|
||||
|
||||
Key prefixes follow `<service>:<entity>:<id>`. E.g. `payment:idem:<requestId>`, `auth:refresh:<userId>`.
|
||||
|
||||
### 2.3 Persistence
|
||||
|
||||
Redis 8 defaults to **RDB snapshots** + optional **AOF**. Our compose uses the default config:
|
||||
|
||||
- RDB snapshot triggers: `save 3600 1`, `save 300 100`, `save 60 10000`.
|
||||
- AOF is **disabled** by default.
|
||||
- RDB file lives at `/data/dump.rdb` inside the `redis_data` volume.
|
||||
|
||||
**To enable AOF** for stronger durability, override the command in `docker-compose.production.yml`:
|
||||
|
||||
```yaml
|
||||
redis:
|
||||
command: ["sh","-lc","redis-server --requirepass \"$${REDIS_PASSWORD}\" --appendonly yes --appendfsync everysec"]
|
||||
```
|
||||
|
||||
`appendfsync everysec` is the common compromise: at most 1 second of writes lost on crash, with negligible perf impact.
|
||||
|
||||
### 2.4 Eviction policy
|
||||
|
||||
Default is `noeviction` — Redis refuses writes when memory is full. For our use (caches that can be regenerated), set:
|
||||
|
||||
```bash
|
||||
docker exec nickapp-redis redis-cli -a "$REDIS_PASSWORD" \
|
||||
CONFIG SET maxmemory 256mb
|
||||
docker exec nickapp-redis redis-cli -a "$REDIS_PASSWORD" \
|
||||
CONFIG SET maxmemory-policy allkeys-lru
|
||||
```
|
||||
|
||||
Persist by adding to a custom `redis.conf` mounted at `/usr/local/etc/redis/redis.conf` (then change the compose `command:` to `["redis-server","/usr/local/etc/redis/redis.conf","--requirepass",...]`).
|
||||
|
||||
### 2.5 Backup
|
||||
|
||||
Redis backups are usually unnecessary (the data is regeneratable) but still cheap:
|
||||
|
||||
```bash
|
||||
# Snapshot now
|
||||
docker exec nickapp-redis redis-cli -a "$REDIS_PASSWORD" BGSAVE
|
||||
docker cp nickapp-redis:/data/dump.rdb ./backups/redis-$(date +%F).rdb
|
||||
```
|
||||
|
||||
`BGSAVE` is non-blocking (forks). For AOF, copy `/data/appendonly.aof` too.
|
||||
|
||||
### 2.6 Cache flush
|
||||
|
||||
When deploying breaking changes to cached schemas:
|
||||
|
||||
```bash
|
||||
# Flush everything (DEV ONLY)
|
||||
docker exec nickapp-redis redis-cli -a "$REDIS_PASSWORD" FLUSHALL
|
||||
|
||||
# Targeted (safer)
|
||||
docker exec nickapp-redis redis-cli -a "$REDIS_PASSWORD" \
|
||||
--scan --pattern 'payment:idem:*' | \
|
||||
xargs -L 1 docker exec nickapp-redis redis-cli -a "$REDIS_PASSWORD" DEL
|
||||
```
|
||||
|
||||
> [!warning] `FLUSHALL` will sign out every user with an active refresh token and reset every rate-limit counter. Avoid in production unless that is what you want.
|
||||
|
||||
### 2.7 Monitoring
|
||||
|
||||
```bash
|
||||
docker exec nickapp-redis redis-cli -a "$REDIS_PASSWORD" INFO stats
|
||||
docker exec nickapp-redis redis-cli -a "$REDIS_PASSWORD" INFO memory
|
||||
docker exec nickapp-redis redis-cli -a "$REDIS_PASSWORD" SLOWLOG GET 10
|
||||
```
|
||||
|
||||
Watch `evicted_keys`, `keyspace_misses`, `rejected_connections` — see [[Monitoring]] for thresholds.
|
||||
|
||||
---
|
||||
|
||||
## 3. Maintenance windows
|
||||
|
||||
For both DBs, schedule a window when:
|
||||
|
||||
- Bumping major version (Mongo 8 → 9, Redis 8 → 9)
|
||||
- Restoring from backup
|
||||
- Running a destructive migration
|
||||
|
||||
Suggested checklist:
|
||||
|
||||
1. Announce in #ops Slack / status page.
|
||||
2. Trigger `mongodump` (see [[Backup & Recovery]]).
|
||||
3. Stop the backend container so writes stop: `docker compose stop nickapp-backend`.
|
||||
4. Perform the operation.
|
||||
5. Restart backend: `docker compose start nickapp-backend`.
|
||||
6. Verify health: `curl https://amn.gg/api/health`.
|
||||
7. Close window.
|
||||
|
||||
---
|
||||
|
||||
## 4. Cross-links
|
||||
|
||||
- [[Backup & Recovery]] — formal backup/restore procedures, RTO/RPO targets, offsite storage.
|
||||
- [[Monitoring]] — what metrics to watch (slow queries, evictions, replication lag).
|
||||
- [[Incident Response]] — runbooks for "MongoDB unreachable" and "Redis unreachable".
|
||||
- [[Data Models]] — schema details for every collection.
|
||||
255
08 - Operations/Deployment.md
Normal file
255
08 - Operations/Deployment.md
Normal file
@@ -0,0 +1,255 @@
|
||||
---
|
||||
title: Deployment
|
||||
tags: [operations]
|
||||
---
|
||||
|
||||
# Deployment
|
||||
|
||||
How the production stack runs and gets updated on the live host. The stack is fully containerised and self-updates via Watchtower from the Gitea container registry.
|
||||
|
||||
---
|
||||
|
||||
## 1. Topology
|
||||
|
||||
```
|
||||
┌─────────────────────────┐
|
||||
HTTPS 443 ──────────►│ External SSL term. │
|
||||
│ (DNS amn.gg, dev.amn.gg)│
|
||||
└────────────┬────────────┘
|
||||
│ HTTP 80 (in-VPC)
|
||||
▼
|
||||
┌──────────────────────────────────┐
|
||||
│ Nginx container │
|
||||
│ (nickapp-nginx, port 80) │
|
||||
└─┬───────────────────┬────────────┘
|
||||
│ │
|
||||
│ / │ /api /socket.io
|
||||
▼ ▼
|
||||
┌─────────────────────┐ ┌──────────────────────────┐
|
||||
│ nickapp-frontend │ │ nickapp-backend │
|
||||
│ Next.js, port 8083 │ │ Express 5, port 5001 │
|
||||
└─────────────────────┘ └──────┬────────────┬──────┘
|
||||
│ │
|
||||
▼ ▼
|
||||
┌──────────┐ ┌──────────┐
|
||||
│ mongodb │ │ redis │
|
||||
│ 8.2 │ │ 8 │
|
||||
└──────────┘ └──────────┘
|
||||
|
||||
┌──────────────────────────────────┐
|
||||
│ Watchtower │
|
||||
│ Polls registry → restarts │
|
||||
│ containers labelled enable=true │
|
||||
└──────────────────────────────────┘
|
||||
```
|
||||
|
||||
All containers run on the **`default`** Docker network defined by `docker-compose.production.yml`. Watchtower runs as a sidecar container on the same host.
|
||||
|
||||
DNS resolves both `amn.gg` and `dev.amn.gg` to the production host's public IP. SSL termination happens **outside** the compose stack (typically via the hosting provider's edge or a host-level reverse proxy), and traffic is forwarded as HTTP to the `nginx` container on port `80` (mapped to host `8083`).
|
||||
|
||||
---
|
||||
|
||||
## 2. Compose file
|
||||
|
||||
`backend/docker-compose.production.yml` is the single source of truth. Services:
|
||||
|
||||
| Service | Image | Ports | Volumes | Notes |
|
||||
|---------|-------|-------|---------|-------|
|
||||
| `nginx` | `nginx:alpine` | `8083:80` | `./nginx/nginx.conf`, `./nginx/logs`, `./uploads` (served as `/uploads`) | Reverse proxy |
|
||||
| `nickapp-backend` | `nickapp-backend:latest` (build from `Dockerfile.prod`) | not exposed externally | `./uploads:/app/uploads` | Labelled for Watchtower |
|
||||
| `nickapp-frontend` | `nickapp-frontend:latest` (build from `../frontend/Dockerfile`) | `expose: 8083` | — | Labelled for Watchtower |
|
||||
| `mongodb` | `mongo:8.2` | not exposed | `mongodb_data:/data/db`, `./mongo-init:/docker-entrypoint-initdb.d` | Healthcheck via `mongosh ping` |
|
||||
| `redis` | `redis:8-alpine` | not exposed | `redis_data:/data` | Started with `--requirepass "$REDIS_PASSWORD"` |
|
||||
|
||||
Healthchecks are configured for backend (`curl /health`), frontend (`curl /`), Mongo (`mongosh ping`), and Redis (`redis-cli -a $REDIS_PASSWORD ping`). See [[Monitoring]].
|
||||
|
||||
Watchtower polls images labelled `com.centurylinklabs.watchtower.enable=true` — currently `nickapp-backend` and `nickapp-frontend`. MongoDB and Redis are **not** auto-updated.
|
||||
|
||||
---
|
||||
|
||||
## 3. Registry & images
|
||||
|
||||
| Image | Registry path |
|
||||
|-------|---------------|
|
||||
| Backend prod | `git.manko.yoga/manawenuz/escrow-backend:latest` |
|
||||
| Backend dev | `git.manko.yoga/manawenuz/escrow-backend:dev` |
|
||||
| Backend tagged | `git.manko.yoga/manawenuz/escrow-backend:<package-version>` |
|
||||
| Frontend | `git.manko.yoga/manawenuz/escrow-frontend:latest` and `:<version>` |
|
||||
|
||||
`docker-compose.production.yml` currently builds locally on first up (`build: context: .`). Once images are in the registry the file can be switched to `image: git.manko.yoga/manawenuz/escrow-backend:latest` to let Watchtower pull straight from there.
|
||||
|
||||
> [!tip] To pin a specific version while debugging, edit the compose file to `image: git.manko.yoga/manawenuz/escrow-backend:2.6.3` and re-run `docker compose up -d`. Remove the Watchtower label or the agent will undo it on next poll.
|
||||
|
||||
---
|
||||
|
||||
## 4. Watchtower
|
||||
|
||||
Watchtower runs as its own container (managed outside the compose file) with `WATCHTOWER_LABEL_ENABLE=true` so it only touches services that opt in. On each poll cycle (default 5 minutes, configurable via `WATCHTOWER_POLL_INTERVAL`) it:
|
||||
|
||||
1. Pulls the latest digest for each enabled service's image.
|
||||
2. Compares to the running container's digest.
|
||||
3. If different, stops the container, removes it, and starts a new one from the new image, preserving all named volumes.
|
||||
|
||||
Configuration knobs typically set on the host:
|
||||
|
||||
```bash
|
||||
docker run -d --name watchtower \
|
||||
--restart unless-stopped \
|
||||
-v /var/run/docker.sock:/var/run/docker.sock \
|
||||
-v /root/.docker/config.json:/config.json \ # so it can pull from the private Gitea registry
|
||||
-e WATCHTOWER_POLL_INTERVAL=300 \
|
||||
-e WATCHTOWER_LABEL_ENABLE=true \
|
||||
-e WATCHTOWER_CLEANUP=true \
|
||||
-e WATCHTOWER_INCLUDE_RESTARTING=true \
|
||||
containrrr/watchtower
|
||||
```
|
||||
|
||||
The `~/.docker/config.json` must have a valid login for `git.manko.yoga` (created via `docker login git.manko.yoga -u manawenuz`).
|
||||
|
||||
---
|
||||
|
||||
## 5. First-time deploy (cold start)
|
||||
|
||||
> [!warning] Run these steps on a fresh production host. They are destructive on an existing one. See [[Backup & Recovery]] before touching live data.
|
||||
|
||||
### Prerequisites on the host
|
||||
|
||||
- Ubuntu 22.04+ (or any systemd Linux), Docker Engine 24+, `docker compose` plugin
|
||||
- `git` installed
|
||||
- DNS `amn.gg` + `dev.amn.gg` already pointing here
|
||||
- An SSL terminator (Caddy / Nginx / Cloudflare) reverse-proxying to host port `8083`
|
||||
- Registry login: `docker login git.manko.yoga -u manawenuz`
|
||||
|
||||
### Steps
|
||||
|
||||
```bash
|
||||
# 1. Clone both repos as siblings (compose references ../frontend)
|
||||
cd /opt
|
||||
git clone ssh://git@git.manko.yoga:222/nick/backend.git
|
||||
git clone ssh://git@git.manko.yoga:222/nick/frontend.git
|
||||
cd backend
|
||||
git checkout main
|
||||
|
||||
# 2. Create the production .env
|
||||
sudo nano .env # fill from Environment Variables doc; production values, real secrets
|
||||
|
||||
# 3. Provision the nginx config + uploads dir
|
||||
mkdir -p nginx/logs uploads mongo-init
|
||||
sudo cp /path/to/nginx.conf nginx/nginx.conf
|
||||
# (the nginx.conf forwards /api/* and /socket.io/* to nickapp-backend:5001,
|
||||
# forwards /uploads/* to /uploads (volume), and everything else to nickapp-frontend:8083)
|
||||
|
||||
# 4. Build & start the stack
|
||||
docker compose -f docker-compose.production.yml up --build -d
|
||||
|
||||
# 5. Verify
|
||||
docker compose -f docker-compose.production.yml ps
|
||||
docker compose -f docker-compose.production.yml logs -f --tail=200
|
||||
curl -fsS http://localhost:8083/api/health | jq .
|
||||
|
||||
# 6. Seed initial data (optional — if AUTO_SEED_ON_START=true is set, it's already done)
|
||||
docker compose -f docker-compose.production.yml exec nickapp-backend node dist/scripts/seedCategories.js
|
||||
|
||||
# 7. Start Watchtower (one-time)
|
||||
docker run -d --name watchtower --restart unless-stopped \
|
||||
-v /var/run/docker.sock:/var/run/docker.sock \
|
||||
-v /root/.docker/config.json:/config.json \
|
||||
-e WATCHTOWER_POLL_INTERVAL=300 \
|
||||
-e WATCHTOWER_LABEL_ENABLE=true \
|
||||
-e WATCHTOWER_CLEANUP=true \
|
||||
containrrr/watchtower
|
||||
```
|
||||
|
||||
### SSL / TLS
|
||||
|
||||
Termination happens at the edge — outside the compose stack. The two common setups:
|
||||
|
||||
- **Caddy on the host** forwarding `amn.gg` and `dev.amn.gg` to `127.0.0.1:8083`. Caddy handles Let's Encrypt automatically.
|
||||
- **Cloudflare Full (strict)** in front of the host. Use Cloudflare Origin certificates on the host's Caddy/Nginx.
|
||||
|
||||
Either way, the compose stack itself sees only HTTP on port 80 inside the nginx container. The `nginx.conf` should set `proxy_set_header X-Forwarded-Proto $http_x_forwarded_proto` and the backend already trusts the proxy when `NODE_ENV=production` (see `trust proxy` block in `src/app.ts`).
|
||||
|
||||
---
|
||||
|
||||
## 6. Routine deploy (after first deploy)
|
||||
|
||||
The normal flow is **fully automatic**:
|
||||
|
||||
1. Developer merges PR to `main` (see [[Git Workflow]]).
|
||||
2. Gitea Actions runs `.gitea/workflows/docker-build-no-cache.yml` (backend) or `deploy.yml` (frontend). The workflow builds the production image and pushes `:latest` + `:<version>` to the registry. See [[CI-CD Pipeline]].
|
||||
3. Watchtower polls the registry, sees a new digest, restarts the container.
|
||||
4. Healthcheck on the new container passes after `start_period=40s`, traffic resumes.
|
||||
|
||||
Total time from merge to live: **5–10 minutes** depending on Watchtower poll interval and image size.
|
||||
|
||||
### Force an immediate deploy
|
||||
|
||||
If you don't want to wait for the poll:
|
||||
|
||||
```bash
|
||||
# On the production host:
|
||||
cd /opt/backend
|
||||
docker login git.manko.yoga -u manawenuz # if creds expired
|
||||
docker compose -f docker-compose.production.yml pull nickapp-backend nickapp-frontend
|
||||
docker compose -f docker-compose.production.yml up -d nickapp-backend nickapp-frontend
|
||||
```
|
||||
|
||||
The `up -d` will detect changed images and restart only the affected containers.
|
||||
|
||||
### Roll back
|
||||
|
||||
```bash
|
||||
# Find available versions
|
||||
docker images git.manko.yoga/manawenuz/escrow-backend
|
||||
|
||||
# Pin to the previous tag in the compose file
|
||||
sed -i 's|escrow-backend:latest|escrow-backend:2.6.2|' docker-compose.production.yml
|
||||
|
||||
# Re-up
|
||||
docker compose -f docker-compose.production.yml up -d nickapp-backend
|
||||
|
||||
# Disable Watchtower for the affected service until you're ready to resume
|
||||
docker compose ... restart # no-op if you removed the watchtower label
|
||||
```
|
||||
|
||||
> [!warning] Watchtower will undo a pin to a non-`latest` tag on its next poll if the container still has the `watchtower.enable=true` label. Either remove the label temporarily or pause Watchtower (`docker stop watchtower`).
|
||||
|
||||
---
|
||||
|
||||
## 7. Logs
|
||||
|
||||
```bash
|
||||
# All services
|
||||
docker compose -f docker-compose.production.yml logs -f --tail=300
|
||||
|
||||
# Single service
|
||||
docker compose -f docker-compose.production.yml logs -f nickapp-backend
|
||||
|
||||
# Nginx access log
|
||||
tail -f /opt/backend/nginx/logs/access.log
|
||||
```
|
||||
|
||||
Backend logs are also captured by Sentry breadcrumbs when an error occurs — see [[Monitoring]].
|
||||
|
||||
---
|
||||
|
||||
## 8. Maintenance window
|
||||
|
||||
Plan a 5-minute window when bumping major versions or running migrations:
|
||||
|
||||
```bash
|
||||
# Announce + drain
|
||||
# (set a maintenance banner in the frontend if possible)
|
||||
|
||||
# Take a backup first
|
||||
./scripts/backup-mongo.sh # or per Backup & Recovery
|
||||
|
||||
# Pull new images, restart
|
||||
docker compose -f docker-compose.production.yml pull
|
||||
docker compose -f docker-compose.production.yml up -d
|
||||
|
||||
# Verify
|
||||
curl -fsS https://amn.gg/api/health
|
||||
```
|
||||
|
||||
If anything goes sideways, follow [[Incident Response]].
|
||||
381
08 - Operations/Docker Setup.md
Normal file
381
08 - Operations/Docker Setup.md
Normal file
@@ -0,0 +1,381 @@
|
||||
---
|
||||
title: Docker Setup
|
||||
tags: [operations]
|
||||
---
|
||||
|
||||
# Docker Setup
|
||||
|
||||
Walk-through of every Dockerfile, compose file, volume, and network used by the marketplace stack. Cross-references [[Deployment]] for the live-host configuration and [[Local Setup]] for developer use.
|
||||
|
||||
---
|
||||
|
||||
## 1. Backend — `Dockerfile.dev`
|
||||
|
||||
Path: `/Users/mojtabaheidari/code/backend/Dockerfile.dev`
|
||||
|
||||
```dockerfile
|
||||
FROM node:22-alpine
|
||||
RUN corepack enable
|
||||
WORKDIR /app
|
||||
COPY package.json ./
|
||||
RUN yarn install --frozen-lockfile
|
||||
COPY . .
|
||||
RUN mkdir -p uploads/{avatars,documents,products,temp}
|
||||
RUN addgroup -g 1001 -S nodejs && adduser -S marketplace -u 1001
|
||||
RUN chown -R marketplace:nodejs /app
|
||||
USER marketplace
|
||||
EXPOSE 5001
|
||||
HEALTHCHECK --interval=30s --timeout=3s --start-period=5s --retries=3 \
|
||||
CMD node healthcheck.js
|
||||
CMD ["yarn", "dev"]
|
||||
```
|
||||
|
||||
Notes:
|
||||
|
||||
- **Base.** `node:22-alpine` — small, glibc-musl. Corepack is enabled to use the pinned Yarn 1.22.22.
|
||||
- **Install.** `yarn install --frozen-lockfile` brings dev dependencies (needed for `ts-node` + `nodemon` hot reload).
|
||||
- **Uploads scaffold.** Creates the four canonical upload directories so the API doesn't have to `mkdir` at runtime.
|
||||
- **Non-root user.** Process runs as `marketplace` (uid `1001`). Defence-in-depth.
|
||||
- **Healthcheck.** `healthcheck.js` does a local HTTP GET to `/health` (see [[Monitoring]]).
|
||||
- **CMD.** `yarn dev` → `nodemon --exec ts-node src/app.ts`. Source code is mounted from the host so saves trigger restarts.
|
||||
|
||||
Used by `docker-compose.dev.yml`. Not pushed to the registry — dev images are local.
|
||||
|
||||
---
|
||||
|
||||
## 2. Backend — `Dockerfile.prod`
|
||||
|
||||
Path: `/Users/mojtabaheidari/code/backend/Dockerfile.prod`
|
||||
|
||||
Multi-stage build to keep the runtime image small and free of build tooling.
|
||||
|
||||
```dockerfile
|
||||
# ---- builder ----
|
||||
FROM node:22-alpine AS builder
|
||||
RUN corepack enable
|
||||
WORKDIR /app
|
||||
COPY package.json ./
|
||||
COPY healthcheck.js ./
|
||||
RUN yarn install --frozen-lockfile
|
||||
COPY . .
|
||||
RUN yarn build # tsc → ./dist
|
||||
|
||||
# ---- production ----
|
||||
FROM node:22-alpine AS production
|
||||
RUN corepack enable
|
||||
WORKDIR /app
|
||||
COPY package.json ./
|
||||
COPY healthcheck.js ./
|
||||
RUN yarn install --frozen-lockfile --production && yarn cache clean
|
||||
COPY --from=builder /app/dist ./dist
|
||||
RUN mkdir -p uploads/{avatars,documents,products,temp}
|
||||
RUN addgroup -g 1001 -S nodejs && adduser -S marketplace -u 1001
|
||||
RUN chown -R marketplace:nodejs /app
|
||||
USER marketplace
|
||||
EXPOSE 5001
|
||||
HEALTHCHECK --interval=30s --timeout=3s --start-period=5s --retries=3 \
|
||||
CMD node healthcheck.js
|
||||
CMD ["node", "dist/app.js"]
|
||||
```
|
||||
|
||||
Notes:
|
||||
|
||||
- **Two stages.** `builder` compiles TS to JS; `production` keeps only the compiled output + production deps. Final image is ~150 MB.
|
||||
- **No dev deps.** `--production` flag in the second stage trims away TypeScript, Jest, ts-node etc.
|
||||
- **Same non-root pattern.** `marketplace:nodejs` (uid 1001).
|
||||
- **CMD.** Plain `node dist/app.js` — no transpilation at runtime.
|
||||
- **Uploads.** The directory is created inside the image, then the running container mounts `/app/uploads` from a host volume in compose (overrides the embedded dir).
|
||||
|
||||
Built and pushed by `.gitea/workflows/docker-build-no-cache.yml` (and friends — see [[CI-CD Pipeline]]). The resulting image is `git.manko.yoga/manawenuz/escrow-backend:<version>` + `:latest`.
|
||||
|
||||
---
|
||||
|
||||
## 3. Frontend — `Dockerfile` (production)
|
||||
|
||||
Path: `/Users/mojtabaheidari/code/frontend/Dockerfile`
|
||||
|
||||
Multi-stage Next.js **standalone** build.
|
||||
|
||||
```dockerfile
|
||||
# ---- builder ----
|
||||
FROM node:22-alpine AS builder
|
||||
# (NEXT_PUBLIC_* vars set here so they bake into the bundle)
|
||||
ENV NEXT_PUBLIC_API_URL=https://dev.amn.gg/api
|
||||
ENV NEXT_PUBLIC_BACKEND_URL=https://dev.amn.gg
|
||||
# ...more ENV lines (see file)...
|
||||
|
||||
RUN apk add --no-cache git python3 make g++ py3-pip
|
||||
RUN corepack enable
|
||||
WORKDIR /app
|
||||
COPY package.json yarn.lock* ./
|
||||
RUN yarn install --frozen-lockfile --production=false --network-timeout 600000
|
||||
COPY src ./src
|
||||
COPY public ./public
|
||||
COPY next.config.ts tsconfig.json ./
|
||||
COPY *.config.mjs ./
|
||||
ENV NODE_ENV=production
|
||||
ENV NEXT_TELEMETRY_DISABLED=1
|
||||
RUN yarn build # produces .next/standalone + .next/static
|
||||
|
||||
# ---- runner ----
|
||||
FROM node:22-alpine AS runner
|
||||
RUN apk add --no-cache curl
|
||||
RUN addgroup --system --gid 1001 nodejs && adduser --system --uid 1001 nextjs
|
||||
WORKDIR /app
|
||||
COPY --from=builder --chown=nextjs:nodejs /app/.next/standalone ./
|
||||
COPY --from=builder --chown=nextjs:nodejs /app/public ./public
|
||||
ENV PORT=8083 HOSTNAME="0.0.0.0" NODE_ENV=production
|
||||
ENV NEXT_PUBLIC_SENTRY_DSN=https://...sentry.io/...
|
||||
USER nextjs
|
||||
EXPOSE 8083
|
||||
HEALTHCHECK --interval=30s --timeout=3s --start-period=5s --retries=3 \
|
||||
CMD curl -f http://localhost:8083 || exit 1
|
||||
CMD ["node", "server.js"]
|
||||
```
|
||||
|
||||
Notes:
|
||||
|
||||
- **Baked env vars.** `NEXT_PUBLIC_*` variables are set as `ENV` in the builder stage so Next inlines them into the static bundle at build time. To deploy to a different domain you must rebuild — there is no runtime override for `NEXT_PUBLIC_*`. See [[Environment Variables#how-env-is-loaded]].
|
||||
- **System packages.** `git python3 make g++ py3-pip` are needed by `node-gyp` for native modules (e.g. `sharp`, `@google-cloud/local-auth`).
|
||||
- **Standalone output.** `next.config.ts` sets `output: 'standalone'`, so the runner stage copies only `.next/standalone/` and `public/` — a self-contained tree with a built-in `server.js`. Final runtime image: ~250 MB.
|
||||
- **Non-root.** `nextjs` (uid 1001).
|
||||
- **`server.js`** is generated by Next.js — it embeds the necessary Node modules and starts the production server.
|
||||
|
||||
---
|
||||
|
||||
## 4. Frontend — `Dockerfile.dev`
|
||||
|
||||
Path: `/Users/mojtabaheidari/code/frontend/Dockerfile.dev`
|
||||
|
||||
```dockerfile
|
||||
FROM node:22-alpine
|
||||
RUN apk add --no-cache git python3 make g++ py3-pip
|
||||
RUN corepack enable
|
||||
WORKDIR /app
|
||||
COPY package.json yarn.lock* ./
|
||||
RUN yarn config set network-timeout 600000 && \
|
||||
yarn config set network-concurrency 1 && \
|
||||
yarn install --frozen-lockfile --network-timeout 600000
|
||||
COPY . .
|
||||
EXPOSE 3000
|
||||
CMD ["yarn", "dev:docker"]
|
||||
```
|
||||
|
||||
Notes:
|
||||
|
||||
- Listens on **port 3000** in dev (matches the legacy convention).
|
||||
- `yarn dev:docker` is a variant of `dev` that binds 0.0.0.0 so the container is reachable from the host.
|
||||
- No multi-stage — speed > size.
|
||||
|
||||
Used for local development if you choose to run the frontend in Docker instead of via `yarn dev`. Most developers run frontend natively for HMR speed; backend in Docker for parity.
|
||||
|
||||
---
|
||||
|
||||
## 5. `docker-compose.dev.yml`
|
||||
|
||||
Path: `/Users/mojtabaheidari/code/backend/docker-compose.dev.yml`
|
||||
|
||||
```yaml
|
||||
name: nickapp-development
|
||||
|
||||
services:
|
||||
nickdev-backend:
|
||||
build: { context: ., dockerfile: Dockerfile.dev }
|
||||
container_name: nickdev-backend
|
||||
env_file: [.env.local]
|
||||
ports: ["5001:5001"]
|
||||
volumes:
|
||||
- ./src:/app/src
|
||||
- ./uploads:/app/uploads
|
||||
depends_on: [mongodb, redis]
|
||||
restart: unless-stopped
|
||||
networks: [nickapp-network]
|
||||
|
||||
mongodb:
|
||||
image: mongo:8.2
|
||||
container_name: nickdev-mongodb
|
||||
ports: ["27017:27017"]
|
||||
env_file: [.env.local]
|
||||
volumes: [mongodb_data:/data/db]
|
||||
restart: unless-stopped
|
||||
networks: [nickapp-network]
|
||||
|
||||
redis:
|
||||
image: redis:8-alpine
|
||||
container_name: nickdev-redis
|
||||
env_file: [.env.local]
|
||||
command: redis-server
|
||||
volumes: [redis_data:/data]
|
||||
restart: unless-stopped
|
||||
networks: [nickapp-network]
|
||||
|
||||
networks:
|
||||
nickapp-network: { driver: bridge }
|
||||
|
||||
volumes:
|
||||
mongodb_data:
|
||||
redis_data:
|
||||
```
|
||||
|
||||
Highlights:
|
||||
|
||||
- **No auth on Mongo/Redis in dev.** Mongo runs default; Redis runs plain `redis-server`.
|
||||
- **Source mounted.** `./src` is volume-mounted into the backend container so hot reload works.
|
||||
- **Uploads mounted.** `./uploads` on the host is bind-mounted to `/app/uploads` so files survive container restarts.
|
||||
- **Port mappings:** `5001` (backend) + `27017` (Mongo) exposed to host. Redis is **not** exposed by default.
|
||||
- **Network.** `nickapp-network` bridge — Mongo/Redis are reachable as `mongodb` / `redis` from the backend container.
|
||||
|
||||
---
|
||||
|
||||
## 6. `docker-compose.production.yml`
|
||||
|
||||
Path: `/Users/mojtabaheidari/code/backend/docker-compose.production.yml`
|
||||
|
||||
Five services. Reproducing only the most important bits — full file lives in the repo and is summarised in [[Deployment#compose-file]].
|
||||
|
||||
```yaml
|
||||
name: nickapp-production
|
||||
|
||||
services:
|
||||
nginx:
|
||||
image: nginx:alpine
|
||||
container_name: nickapp-nginx
|
||||
ports: ["8083:80"]
|
||||
volumes:
|
||||
- ./nginx/nginx.conf:/etc/nginx/nginx.conf:ro
|
||||
- ./nginx/logs:/var/log/nginx
|
||||
- ./uploads:/uploads
|
||||
depends_on: [nickapp-backend, nickapp-frontend]
|
||||
networks: [default]
|
||||
|
||||
nickapp-backend:
|
||||
build: { context: ., dockerfile: Dockerfile.prod }
|
||||
image: nickapp-backend:latest
|
||||
container_name: nickapp-backend
|
||||
platform: linux/amd64
|
||||
env_file: [.env]
|
||||
volumes: [./uploads:/app/uploads]
|
||||
depends_on: [mongodb, redis]
|
||||
networks: [default]
|
||||
healthcheck: { test: ["CMD","curl","-f","http://localhost:5001/health"], ... }
|
||||
labels: ["com.centurylinklabs.watchtower.enable=true"]
|
||||
|
||||
mongodb:
|
||||
image: mongo:8.2
|
||||
container_name: nickapp-mongodb
|
||||
env_file: [.env]
|
||||
volumes:
|
||||
- mongodb_data:/data/db
|
||||
- ./mongo-init:/docker-entrypoint-initdb.d
|
||||
healthcheck: { test: ["CMD","mongosh","--eval","db.adminCommand('ping')"], ... }
|
||||
|
||||
redis:
|
||||
image: redis:8-alpine
|
||||
container_name: nickapp-redis
|
||||
env_file: [.env]
|
||||
command: ["sh","-lc","redis-server --requirepass \"$${REDIS_PASSWORD}\""]
|
||||
volumes: [redis_data:/data]
|
||||
healthcheck: { test: ["CMD","redis-cli","-a","$${REDIS_PASSWORD}","ping"], ... }
|
||||
|
||||
nickapp-frontend:
|
||||
build: { context: ../frontend, dockerfile: Dockerfile }
|
||||
image: nickapp-frontend:latest
|
||||
container_name: nickapp-frontend
|
||||
platform: linux/amd64
|
||||
env_file: [.env]
|
||||
environment: [PORT=8083, NODE_ENV=production]
|
||||
expose: ["8083"]
|
||||
healthcheck: { test: ["CMD","curl","-f","http://localhost:8083/"], ... }
|
||||
labels: ["com.centurylinklabs.watchtower.enable=true"]
|
||||
|
||||
networks:
|
||||
default: { driver: bridge }
|
||||
|
||||
volumes:
|
||||
mongodb_data:
|
||||
redis_data:
|
||||
```
|
||||
|
||||
Key differences from dev:
|
||||
|
||||
- **Nginx** added as the public entry point.
|
||||
- **Backend and frontend** are labelled for **Watchtower** auto-updates.
|
||||
- **Mongo and Redis** are **not** Watchtower-managed — their major versions need manual planning + backup ([[Backup & Recovery]]).
|
||||
- **Redis password** is read from `.env` (escaped `$$` so docker compose doesn't expand it).
|
||||
- **Frontend build context** points at `../frontend` — the two repos must live as siblings on disk.
|
||||
- **No host port mapping** for backend/frontend — they are reached only via the nginx container.
|
||||
- **platform: linux/amd64** is pinned because production hosts are x86_64; ARM developers must `--platform=linux/amd64` if they build locally for prod.
|
||||
|
||||
---
|
||||
|
||||
## 7. Volumes
|
||||
|
||||
| Volume | Mount point | Lifecycle | Notes |
|
||||
|--------|-------------|-----------|-------|
|
||||
| `mongodb_data` (named) | `/data/db` in `mongodb` | Persistent | The whole database. Back up via `mongodump`. |
|
||||
| `redis_data` (named) | `/data` in `redis` | Persistent | RDB snapshots + AOF if configured. |
|
||||
| `./uploads` (bind) | `/app/uploads` in backend, `/uploads` in nginx | Persistent on host | User-uploaded files. Critical — back up the directory. |
|
||||
| `./nginx/nginx.conf` (bind, RO) | `/etc/nginx/nginx.conf` | Static | Reverse-proxy config. |
|
||||
| `./nginx/logs` (bind) | `/var/log/nginx` | Append-only on host | Access + error logs. |
|
||||
| `./mongo-init` (bind, RO) | `/docker-entrypoint-initdb.d` | One-time | JS files Mongo runs **only on a fresh datadir** to create initial users / indexes. |
|
||||
|
||||
Inspect named volumes:
|
||||
|
||||
```bash
|
||||
docker volume ls
|
||||
docker volume inspect nickapp-production_mongodb_data
|
||||
```
|
||||
|
||||
> [!warning] `docker compose down -v` deletes named volumes. Never run this in production unless you've backed up first.
|
||||
|
||||
---
|
||||
|
||||
## 8. Networks
|
||||
|
||||
- **Dev:** `nickapp-network` bridge. All three services join it; the backend reaches `mongodb` and `redis` by container name.
|
||||
- **Prod:** the default compose network (also a bridge), named `nickapp-production_default`. Same DNS-by-container-name semantics. Nginx talks to `nickapp-backend:5001` and `nickapp-frontend:8083` over this network.
|
||||
|
||||
Inspect:
|
||||
|
||||
```bash
|
||||
docker network ls
|
||||
docker network inspect nickapp-production_default
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 9. Image build & push from a developer machine
|
||||
|
||||
For a production-parity build locally (without going through CI):
|
||||
|
||||
```bash
|
||||
cd ~/code/backend
|
||||
docker build --platform=linux/amd64 -f Dockerfile.prod \
|
||||
-t git.manko.yoga/manawenuz/escrow-backend:test .
|
||||
|
||||
# Sanity-check size + run
|
||||
docker images git.manko.yoga/manawenuz/escrow-backend
|
||||
docker run --rm -p 5001:5001 --env-file .env.local \
|
||||
git.manko.yoga/manawenuz/escrow-backend:test
|
||||
```
|
||||
|
||||
For the official path (build + push to registry) use `./scripts/build-and-push.sh` — see [[Scripts#build-and-push-sh]] — or rely on [[CI-CD Pipeline]] to do it on every push.
|
||||
|
||||
---
|
||||
|
||||
## 10. Image cleanup
|
||||
|
||||
Builds accumulate. Periodically prune:
|
||||
|
||||
```bash
|
||||
docker system prune -a -f
|
||||
docker volume prune -f # ⚠ removes unused named volumes — check first
|
||||
docker builder prune -a -f # buildx cache
|
||||
|
||||
# scripted (backend)
|
||||
npm run docker:clean
|
||||
```
|
||||
|
||||
`docker:clean` runs `docker system prune -a -f && docker volume prune -f` — confirm you don't need anything before you run it.
|
||||
|
||||
> [!warning] `docker volume prune` will delete `mongodb_data` and `redis_data` if their compose project is currently `down`. Always run `docker compose up -d` first to keep the volumes "in use".
|
||||
393
08 - Operations/Incident Response.md
Normal file
393
08 - Operations/Incident Response.md
Normal file
@@ -0,0 +1,393 @@
|
||||
---
|
||||
title: Incident Response
|
||||
tags: [operations]
|
||||
---
|
||||
|
||||
# Incident Response
|
||||
|
||||
Runbooks for the most likely production incidents, plus communication templates and a post-mortem template. Use this page during an active incident — keep [[Monitoring]], [[Database Operations]], and [[Backup & Recovery]] open in adjacent tabs.
|
||||
|
||||
---
|
||||
|
||||
## 1. Severity matrix
|
||||
|
||||
| Sev | Meaning | Response time | Examples |
|
||||
|-----|---------|---------------|----------|
|
||||
| **Sev 1** | Site fully down or unable to process payments | 15 min | Backend container in crashloop; Mongo unreachable; SHKeeper API permanently failing |
|
||||
| **Sev 2** | Major feature broken for a large share of users | 1 hour | Email sending broken; Redis disk full; chat undelivered |
|
||||
| **Sev 3** | Minor / cosmetic issue, isolated user reports | next business day | Single failed webhook; one user can't upload PDF |
|
||||
| **Sev 4** | No user impact, hygiene item | backlog | Backup older than 24h; disk > 80%; missed deploy |
|
||||
|
||||
Escalate one sev higher if more than 10 reports inside 5 minutes.
|
||||
|
||||
---
|
||||
|
||||
## 2. First 5 minutes — always do this
|
||||
|
||||
1. **Acknowledge.** Reply in the on-call channel that you are taking it.
|
||||
2. **Open Sentry.** Filter to the last 15 minutes for new issue spikes.
|
||||
3. **Open the host shell.** `ssh prod` ready.
|
||||
4. **Health endpoint.** `curl -fsS https://amn.gg/api/health` → does it respond?
|
||||
5. **Container status.** `docker ps --format "table {{.Names}}\t{{.Status}}"`.
|
||||
6. **Recent deploy?** Was the `:latest` tag bumped in the last 30 min? If yes, **roll back first** (see [[Deployment#roll-back]]) and investigate after stability is restored.
|
||||
|
||||
If you can't form a hypothesis in 5 minutes, **roll back to the previous image tag** anyway. Stability before forensics.
|
||||
|
||||
---
|
||||
|
||||
## 3. Common incidents
|
||||
|
||||
### 3.1 Backend down (crashloop, no response on /health)
|
||||
|
||||
**Symptoms.** `https://amn.gg/api/health` times out or 5xx; `nickapp-backend` shows `Restarting` in `docker ps`.
|
||||
|
||||
**Runbook.**
|
||||
|
||||
```bash
|
||||
# 1. Inspect last lines
|
||||
docker logs --tail=200 nickapp-backend
|
||||
|
||||
# 2. Common causes:
|
||||
# - Missing env var (`process.env.X!` throws on first read)
|
||||
# - MongoDB unreachable (see 3.2)
|
||||
# - Port conflict
|
||||
# - Out of memory (look for OOMKilled)
|
||||
docker inspect nickapp-backend | jq '.[0].State'
|
||||
|
||||
# 3. If OOM: increase memory limit in compose, restart
|
||||
# If missing env: add to /opt/backend/.env, then `docker compose up -d`
|
||||
|
||||
# 4. If recent deploy: roll back
|
||||
sed -i 's|:latest|:<previous-version>|' docker-compose.production.yml
|
||||
docker compose up -d nickapp-backend
|
||||
# Pause Watchtower for nickapp-backend so it doesn't re-pull
|
||||
docker stop watchtower
|
||||
```
|
||||
|
||||
**Communication.** Post in #incidents using the template in §4.
|
||||
|
||||
---
|
||||
|
||||
### 3.2 MongoDB unreachable
|
||||
|
||||
**Symptoms.** Backend logs show `MongoNetworkError`, `MongooseServerSelectionError`, or `Could not connect to server`.
|
||||
|
||||
**Runbook.**
|
||||
|
||||
```bash
|
||||
# 1. Container alive?
|
||||
docker ps -a --filter "name=mongodb"
|
||||
|
||||
# 2. If exited:
|
||||
docker logs --tail=200 nickapp-mongodb
|
||||
# Common: corrupt journal, disk full, OOM
|
||||
|
||||
# 3. Disk check
|
||||
df -h /var/lib/docker
|
||||
|
||||
# 4. If disk full:
|
||||
# - prune old container logs: docker system prune
|
||||
# - rotate logs if needed
|
||||
# - extend volume
|
||||
|
||||
# 5. Restart
|
||||
docker compose -f docker-compose.production.yml up -d mongodb
|
||||
|
||||
# 6. Verify
|
||||
docker exec nickapp-mongodb mongosh --eval "db.adminCommand('ping')"
|
||||
|
||||
# 7. If data is corrupt, restore from latest dump — see Backup & Recovery
|
||||
```
|
||||
|
||||
> [!warning] If Mongo is corrupted and you must restore, **stop the backend container first** to prevent partial writes during restore. See [[Database Operations#restore]].
|
||||
|
||||
---
|
||||
|
||||
### 3.3 Redis unreachable
|
||||
|
||||
**Symptoms.** Logs show `ECONNREFUSED redis:6379` or `NOAUTH Authentication required`. Rate limits stop working, refresh tokens can't be revoked, but most read flows still work.
|
||||
|
||||
**Runbook.**
|
||||
|
||||
```bash
|
||||
# 1. Container alive?
|
||||
docker ps -a --filter "name=redis"
|
||||
|
||||
# 2. If down:
|
||||
docker logs --tail=200 nickapp-redis
|
||||
docker compose -f docker-compose.production.yml up -d redis
|
||||
|
||||
# 3. Auth issue?
|
||||
docker exec nickapp-redis redis-cli -a "$REDIS_PASSWORD" PING
|
||||
# Should return PONG
|
||||
|
||||
# 4. If `$REDIS_PASSWORD` mismatch between .env and command:
|
||||
nano /opt/backend/.env # confirm REDIS_PASSWORD set
|
||||
docker compose up -d redis backend
|
||||
|
||||
# 5. If memory full + noeviction policy → rejecting writes:
|
||||
docker exec nickapp-redis redis-cli -a "$REDIS_PASSWORD" CONFIG SET maxmemory-policy allkeys-lru
|
||||
```
|
||||
|
||||
The app gracefully degrades when Redis is unreachable for short windows — don't panic, but fix within an hour.
|
||||
|
||||
---
|
||||
|
||||
### 3.4 SHKeeper API down (payments blocked)
|
||||
|
||||
**Symptoms.** Backend logs show repeated `SHKeeper request failed: ECONNREFUSED` or non-2xx responses from `$SHKEEPER_API_URL`. Buyers see "Payment unavailable" in checkout. Sev 1 — money is involved.
|
||||
|
||||
**Runbook.**
|
||||
|
||||
```bash
|
||||
# 1. Confirm SHKeeper itself is reachable
|
||||
curl -fsS -H "X-Shkeeper-Api-Key: $SHKEEPER_API_KEY" \
|
||||
"$SHKEEPER_API_URL/api/v1/healthcheck"
|
||||
|
||||
# 2. If 5xx from SHKeeper → it's their side
|
||||
# - Check their status page / contact provider
|
||||
# - Toggle a banner in the frontend warning buyers
|
||||
# - Consider switching SHKEEPER_FORCE_PAYOUT_DEMO=true so QA still works
|
||||
# (do NOT do this for real customer money)
|
||||
|
||||
# 3. If our network can't reach it:
|
||||
# - test from the host: curl from the host vs from inside the container
|
||||
docker exec nickapp-backend curl -v "$SHKEEPER_API_URL"
|
||||
# - DNS / firewall changes?
|
||||
|
||||
# 4. While blocked, monitor stuck payments
|
||||
docker exec nickapp-mongodb mongosh --eval \
|
||||
"use marketplace; db.payments.find({status:'pending', createdAt:{\$lt: new Date(Date.now() - 30*60*1000)}}).count()"
|
||||
|
||||
# 5. Once SHKeeper is back, the app retries automatically. Verify the
|
||||
# backlog drains. If a payment is stuck > 24h, manually verify against
|
||||
# SHKeeper and use fix-transaction-hashes.js if needed.
|
||||
```
|
||||
|
||||
**Always communicate.** Even short payment outages erode trust — post a status update.
|
||||
|
||||
---
|
||||
|
||||
### 3.5 Email delivery failure
|
||||
|
||||
**Symptoms.** Logs show `SMTPError` from `nodemailer`. Password resets, welcome emails, dispute notifications fail. Sev 2.
|
||||
|
||||
**Runbook.**
|
||||
|
||||
```bash
|
||||
# 1. Test SMTP credentials from the container
|
||||
docker exec nickapp-backend node -e "
|
||||
const nm = require('nodemailer');
|
||||
nm.createTransport({
|
||||
host: process.env.SMTP_HOST,
|
||||
port: Number(process.env.SMTP_PORT),
|
||||
secure: process.env.SMTP_SECURE === 'true',
|
||||
auth: { user: process.env.SMTP_USER, pass: process.env.SMTP_PASS },
|
||||
}).verify().then(console.log).catch(console.error);
|
||||
"
|
||||
|
||||
# 2. If auth failed → password rotated by provider, update SMTP_PASS in .env
|
||||
# 3. If connection timed out → provider rate-limit; switch provider/sender
|
||||
# 4. If specific domains bounce → check SPF / DKIM / DMARC records for amn.gg
|
||||
```
|
||||
|
||||
Users can still operate the app without email; queue critical emails for retry once SMTP is restored.
|
||||
|
||||
---
|
||||
|
||||
### 3.6 WebSocket disconnect storm
|
||||
|
||||
**Symptoms.** Backend logs flood with `🔌 User connected/disconnected` cycling; clients spinning on chat / notification badges. Sev 2.
|
||||
|
||||
**Runbook.**
|
||||
|
||||
```bash
|
||||
# 1. Confirm symptoms
|
||||
docker logs --tail=500 nickapp-backend | grep -c "🔌"
|
||||
|
||||
# 2. Check Nginx access log for socket.io polling spam
|
||||
tail -f /opt/backend/nginx/logs/access.log | grep socket.io
|
||||
|
||||
# 3. Common causes:
|
||||
# - Nginx not configured for WebSocket upgrade (returns 502 → client falls back to polling → reconnect loop)
|
||||
# - Client clock skew breaking JWT validation on every reconnect
|
||||
# - Redis adapter mis-configured (if scaled horizontally — not the case today)
|
||||
|
||||
# 4. Quick mitigation: increase Nginx proxy_read_timeout
|
||||
# Permanent: ensure nginx.conf has:
|
||||
# proxy_http_version 1.1;
|
||||
# proxy_set_header Upgrade $http_upgrade;
|
||||
# proxy_set_header Connection "upgrade";
|
||||
|
||||
# 5. Restart nginx + backend
|
||||
docker compose restart nginx nickapp-backend
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 3.7 Suspicious activity / abuse
|
||||
|
||||
**Symptoms.** Sentry alerts on unusual error volume from one IP; rate-limit logs spiking; reports of brute-force on `/api/auth/login`.
|
||||
|
||||
**Runbook.**
|
||||
|
||||
```bash
|
||||
# 1. Identify the offender
|
||||
tail -n 10000 /opt/backend/nginx/logs/access.log \
|
||||
| awk '{print $1}' | sort | uniq -c | sort -rn | head
|
||||
|
||||
# 2. Block at the edge (Cloudflare / host firewall)
|
||||
# Or use `ufw deny from <ip>` on the host
|
||||
|
||||
# 3. Confirm rate limits in app
|
||||
grep "RATE_LIMIT" /opt/backend/.env
|
||||
# Defaults: 100 req / 15 min per IP. Tighten if abuse continues.
|
||||
|
||||
# 4. If the abuse targets a specific user account:
|
||||
docker exec -it nickapp-backend node -e "
|
||||
// disable the user via mongoose
|
||||
require('./dist/infrastructure/database/connection').connectDatabase()
|
||||
.then(() => require('./dist/models').User.updateOne({email:'attacker@x.com'}, {$set:{disabled:true}}))
|
||||
.then(console.log)
|
||||
"
|
||||
|
||||
# 5. Preserve evidence: copy access logs to /var/incidents/<date>/
|
||||
```
|
||||
|
||||
If user data may have leaked, treat as sev 1 and follow your data-breach disclosure process.
|
||||
|
||||
---
|
||||
|
||||
## 4. Communication templates
|
||||
|
||||
### Initial incident notification
|
||||
|
||||
```
|
||||
🚨 [SEV-X] <one-line summary>
|
||||
Started: <time UTC>
|
||||
Impact: <which users / features>
|
||||
Status: investigating
|
||||
On-call: <@you>
|
||||
Updates: every 15 minutes in this thread
|
||||
```
|
||||
|
||||
### Mid-incident update
|
||||
|
||||
```
|
||||
[SEV-X] Update <n>
|
||||
Time: <UTC>
|
||||
Status: <investigating / mitigating / monitoring>
|
||||
What we know: <facts>
|
||||
What we're trying: <action>
|
||||
Next update: <time>
|
||||
```
|
||||
|
||||
### Resolution
|
||||
|
||||
```
|
||||
✅ [SEV-X] Resolved
|
||||
Started: <UTC>
|
||||
Ended: <UTC>
|
||||
Duration: <minutes>
|
||||
Impact: <users / features / requests affected>
|
||||
Root cause: <one sentence>
|
||||
Permanent fix: <PR / ticket>
|
||||
Postmortem: <doc link, by <date>>
|
||||
```
|
||||
|
||||
### Customer-facing status
|
||||
|
||||
```
|
||||
We're investigating an issue affecting <feature> that started at <time>.
|
||||
We'll post an update by <time + 15 min>.
|
||||
```
|
||||
|
||||
Avoid speculation in customer-facing copy. Say "investigating", "applying fix", "monitoring", "resolved" — and nothing else until you actually know.
|
||||
|
||||
---
|
||||
|
||||
## 5. Post-mortem template
|
||||
|
||||
Use within 5 business days of any sev 1 or sev 2.
|
||||
|
||||
```markdown
|
||||
---
|
||||
title: Post-mortem — <short title>
|
||||
date: <YYYY-MM-DD>
|
||||
severity: SEV-X
|
||||
duration: <minutes>
|
||||
authors: [<names>]
|
||||
tags: [postmortem]
|
||||
---
|
||||
|
||||
## Summary
|
||||
One paragraph: what broke, who was affected, how long, how it was fixed.
|
||||
|
||||
## Timeline (UTC)
|
||||
- HH:MM — first signal (alert, user report)
|
||||
- HH:MM — on-call ack
|
||||
- HH:MM — hypothesis: ...
|
||||
- HH:MM — mitigation deployed
|
||||
- HH:MM — verified resolved
|
||||
- HH:MM — incident closed
|
||||
|
||||
## Impact
|
||||
- Users affected: <count or %>
|
||||
- Features affected: <list>
|
||||
- Money affected: <if payments>
|
||||
- Data loss: <yes/no — describe>
|
||||
|
||||
## Root cause
|
||||
Honest, blameless. Distinguish trigger vs underlying cause.
|
||||
|
||||
## What went well
|
||||
- ...
|
||||
|
||||
## What went poorly
|
||||
- ...
|
||||
|
||||
## Where we got lucky
|
||||
- ...
|
||||
|
||||
## Action items
|
||||
| # | Item | Owner | Due | Ticket |
|
||||
|---|------|-------|-----|--------|
|
||||
| 1 | Add /health probe for MongoDB | @x | 2026-06-01 | OPS-123 |
|
||||
| 2 | Tighten rate limit on /auth/login | @y | 2026-05-30 | OPS-124 |
|
||||
|
||||
## Detection improvements
|
||||
What new alert / dashboard would have caught this earlier?
|
||||
|
||||
## Process improvements
|
||||
What runbook / docs need updating? Update [[Incident Response]] right now.
|
||||
```
|
||||
|
||||
Store postmortems alongside this vault — suggested path `/Users/mojtabaheidari/code/docs/08 - Operations/postmortems/YYYY-MM-DD-<slug>.md`.
|
||||
|
||||
---
|
||||
|
||||
## 6. Escalation contacts
|
||||
|
||||
(Fill in for your team; placeholder structure below.)
|
||||
|
||||
| Role | Primary | Backup | Channel |
|
||||
|------|---------|--------|---------|
|
||||
| On-call engineer | <name> | <name> | #incidents |
|
||||
| Payments lead | <name> | <name> | DM |
|
||||
| Infrastructure | <name> | <name> | DM |
|
||||
| Product / customer comms | <name> | <name> | #customer-comms |
|
||||
| SHKeeper provider contact | <email> | — | email |
|
||||
| SMTP provider | <email> | — | email |
|
||||
|
||||
---
|
||||
|
||||
## 7. After every incident
|
||||
|
||||
- [ ] Updated this page with any new gotchas?
|
||||
- [ ] Updated [[Monitoring]] with new metrics/alerts to add?
|
||||
- [ ] Updated [[Backup & Recovery]] if backup gaps were exposed?
|
||||
- [ ] Action items tracked?
|
||||
- [ ] Customer comms sent (if user-impacting)?
|
||||
- [ ] Post-mortem published?
|
||||
|
||||
Cross-links: [[Deployment]] for rollback steps, [[Database Operations]] for DB diagnostics, [[Backup & Recovery]] for restore procedures, [[Monitoring]] for metrics to watch.
|
||||
253
08 - Operations/Monitoring.md
Normal file
253
08 - Operations/Monitoring.md
Normal file
@@ -0,0 +1,253 @@
|
||||
---
|
||||
title: Monitoring
|
||||
tags: [operations]
|
||||
---
|
||||
|
||||
# Monitoring
|
||||
|
||||
What's instrumented today and what to watch. Today's stack is intentionally lean — health endpoints, Docker healthchecks, Sentry, and access logs. Bigger metric pipelines (Prometheus, Grafana, OpenSearch) are a future addition.
|
||||
|
||||
---
|
||||
|
||||
## 1. Health endpoint
|
||||
|
||||
Path: `GET /health` (backend, port `5001`).
|
||||
|
||||
Defined in `backend/src/app.ts`:
|
||||
|
||||
```ts
|
||||
app.get("/health", (req, res) => {
|
||||
res.json({
|
||||
success: true,
|
||||
message: "Marketplace Backend API is running",
|
||||
timestamp: new Date().toISOString(),
|
||||
environment: config.nodeEnv,
|
||||
version: packageJson.version,
|
||||
});
|
||||
});
|
||||
```
|
||||
|
||||
Returns `200` with a JSON envelope as soon as Express is up. Does **not** currently probe MongoDB or Redis — they are checked via separate Docker healthchecks. If you want deep health, extend the endpoint to ping both data stores and return `503` on failure.
|
||||
|
||||
Public URL behind Nginx: `https://amn.gg/api/health`.
|
||||
|
||||
---
|
||||
|
||||
## 2. Docker healthchecks
|
||||
|
||||
Each long-lived container has a `HEALTHCHECK` baked in or declared in compose.
|
||||
|
||||
| Container | Probe | Interval | Failure threshold |
|
||||
|-----------|-------|----------|-------------------|
|
||||
| `nickapp-backend` | `node healthcheck.js` (HTTP GET `/health`) | 30s | 3 retries |
|
||||
| `nickapp-frontend` | `curl -f http://localhost:8083/` | 30s | 3 retries |
|
||||
| `mongodb` | `mongosh --eval "db.adminCommand('ping')"` | 30s | 3 retries |
|
||||
| `redis` | `redis-cli -a $REDIS_PASSWORD ping` | 30s | 3 retries |
|
||||
|
||||
`healthcheck.js` (backend) is a tiny Node script that does a local HTTP GET to `/health` and exits 0 / 1.
|
||||
|
||||
Inspect health:
|
||||
|
||||
```bash
|
||||
docker ps --format "table {{.Names}}\t{{.Status}}"
|
||||
|
||||
# Detailed
|
||||
docker inspect --format='{{json .State.Health}}' nickapp-backend | jq
|
||||
```
|
||||
|
||||
If a container is `unhealthy`, Watchtower will **not** roll it (it expects the new container to pass healthcheck). Investigate with `docker logs <container>`.
|
||||
|
||||
---
|
||||
|
||||
## 3. Sentry — error tracking
|
||||
|
||||
### Frontend
|
||||
|
||||
`@sentry/nextjs ^10.22.0` is wired in via three config files at the repo root:
|
||||
|
||||
- `sentry.client.config.ts` — browser SDK (with Session Replay enabled at 10% session / 100% error rate).
|
||||
- `sentry.server.config.ts` — server-rendered components (no Replay).
|
||||
- `sentry.edge.config.ts` — edge runtime (not currently used heavily).
|
||||
|
||||
Common settings:
|
||||
|
||||
```ts
|
||||
Sentry.init({
|
||||
dsn: process.env.NEXT_PUBLIC_SENTRY_DSN,
|
||||
tracesSampleRate: process.env.NODE_ENV === 'production' ? 0.1 : 1.0,
|
||||
environment: process.env.NODE_ENV || 'development',
|
||||
enabled: process.env.NODE_ENV === 'production',
|
||||
ignoreErrors: ['ResizeObserver loop limit exceeded', 'ChunkLoadError', ...],
|
||||
});
|
||||
```
|
||||
|
||||
Errors from `localhost` are filtered out — only prod errors land in the dashboard.
|
||||
|
||||
### Backend
|
||||
|
||||
`@sentry/node ^10.22.0` + `@sentry/profiling-node ^10.22.0` are initialised **first** in `src/app.ts` (before any other import) via `src/config/sentry.ts`. DSN comes from `SENTRY_DSN` env var (see [[Environment Variables#sentry]]).
|
||||
|
||||
What's captured:
|
||||
|
||||
- Uncaught exceptions in route handlers
|
||||
- Promise rejections inside `asyncHandler`-wrapped routes
|
||||
- Manually-captured errors via `Sentry.captureException(err)`
|
||||
- Performance traces (10% sample rate in prod)
|
||||
- Profiling samples via `@sentry/profiling-node`
|
||||
|
||||
### Source maps
|
||||
|
||||
Frontend uploads source maps to Sentry at build time when `SENTRY_AUTH_TOKEN`, `SENTRY_ORG`, and `SENTRY_PROJECT` are set in the CI env. Without them the build still succeeds but Sentry traces will show minified frames.
|
||||
|
||||
### Alerts
|
||||
|
||||
Configure in the Sentry dashboard (Issues → Alerts) — common alerts:
|
||||
|
||||
- Any new issue in production → Slack
|
||||
- Error frequency > 50/minute → page on-call
|
||||
- Performance regression on `/api/payments/*` traces → email
|
||||
|
||||
---
|
||||
|
||||
## 4. Logs
|
||||
|
||||
### Backend application logs
|
||||
|
||||
Routed through `src/utils/logger.ts` — currently a thin `console.log` wrapper with emoji prefixes. Output goes to stdout, captured by Docker:
|
||||
|
||||
```bash
|
||||
# Live tail
|
||||
docker compose -f docker-compose.production.yml logs -f --tail=200 nickapp-backend
|
||||
|
||||
# Search for a request
|
||||
docker logs nickapp-backend 2>&1 | grep "POST /api/payments"
|
||||
|
||||
# Pre-filter by date
|
||||
docker logs --since 1h nickapp-backend
|
||||
```
|
||||
|
||||
Notable log lines to look for:
|
||||
|
||||
| Prefix | Meaning |
|
||||
|--------|---------|
|
||||
| `✅ Connected to MongoDB` | DB connection established |
|
||||
| `🚀 Server running on port 5001` | App fully started |
|
||||
| `🔌 User connected: <id>` | Socket.IO connection |
|
||||
| `📥` | Inbound HTTP request log |
|
||||
| `💳 SHKeeper` | SHKeeper webhook / API call |
|
||||
| `🔐 Webhook verification` | Webhook signature check result |
|
||||
| `❌ Error` | Manual error log (also captured by Sentry) |
|
||||
|
||||
### Nginx access + error logs
|
||||
|
||||
Bind-mounted to `./nginx/logs/` on the host:
|
||||
|
||||
```bash
|
||||
tail -f /opt/backend/nginx/logs/access.log
|
||||
tail -f /opt/backend/nginx/logs/error.log
|
||||
```
|
||||
|
||||
Rotate these via host `logrotate` to avoid disk fill.
|
||||
|
||||
### Frontend logs
|
||||
|
||||
Next.js logs go to the container stdout:
|
||||
|
||||
```bash
|
||||
docker logs -f nickapp-frontend
|
||||
```
|
||||
|
||||
Browser-side logs that need attention go through Sentry (above) — `src/utils/logger.ts` in the frontend forwards via Sentry breadcrumbs.
|
||||
|
||||
---
|
||||
|
||||
## 5. Key metrics to watch
|
||||
|
||||
Today these are read manually from logs / Sentry. As Prometheus is added, encode them as alerting rules.
|
||||
|
||||
### Application
|
||||
|
||||
| Metric | Where to check | Healthy | Alert |
|
||||
|--------|---------------|---------|-------|
|
||||
| 5xx rate | Sentry, Nginx access.log | < 0.5 % | > 2 % over 5 min |
|
||||
| `/health` p95 latency | curl + timer | < 100 ms | > 1 s |
|
||||
| Login success rate | Sentry custom event | > 95 % | < 90 % |
|
||||
| Socket disconnect storm | `🔌 User disconnected` log frequency | < 1/s sustained | > 10/s sustained |
|
||||
| OpenAI 429s | Backend log `OpenAI ... 429` | 0 | any |
|
||||
|
||||
### Payments
|
||||
|
||||
| Metric | Where | Healthy | Alert |
|
||||
|--------|-------|---------|-------|
|
||||
| Payment success rate | `db.payments.aggregate([{$group:{_id:"$status",n:{$sum:1}}}])` | > 95 % completed of 24h-old payments | < 90 % |
|
||||
| Webhook signature failures | log `Webhook verification failed` | 0 | > 0 |
|
||||
| SHKeeper API errors (5xx) | log + Sentry | 0 | > 5/min sustained |
|
||||
| Payouts stuck in `pending` > 30 min | `db.payments.find({type:'payout',status:'pending',createdAt:{$lt:ISODate(30 min ago)}})` | empty | non-empty |
|
||||
| Missing `transactionHash` after `completed` | the same query that drives `fix-transaction-hashes.js` | empty | non-empty |
|
||||
|
||||
### MongoDB
|
||||
|
||||
```js
|
||||
db.serverStatus().connections // active connections; alert if >1000
|
||||
db.serverStatus().opcounters // ops/sec
|
||||
db.serverStatus().wiredTiger.cache // cache hit ratio; aim > 95 %
|
||||
db.currentOp({ secs_running: { $gte: 5 } }) // long-running queries
|
||||
```
|
||||
|
||||
### Redis
|
||||
|
||||
```bash
|
||||
docker exec nickapp-redis redis-cli -a "$REDIS_PASSWORD" INFO stats
|
||||
# Watch: instantaneous_ops_per_sec, keyspace_hits/misses, rejected_connections, evicted_keys
|
||||
```
|
||||
|
||||
Alert thresholds: `rejected_connections > 0`, `evicted_keys` rising while you don't expect cache pressure, `latency_ms` p99 > 5ms.
|
||||
|
||||
### Host
|
||||
|
||||
| Metric | Tool | Healthy | Alert |
|
||||
|--------|------|---------|-------|
|
||||
| Disk usage on `/var/lib/docker` | `df -h` | < 80 % | > 90 % |
|
||||
| `/opt/backend/uploads` size | `du -sh` | watch trend | bursty growth (>5 GB/day) |
|
||||
| Memory pressure | `free -h`, `docker stats` | < 80 % | swap actively used |
|
||||
| Open file descriptors | `cat /proc/<pid>/limits` | well under hard limit | nearing limit |
|
||||
|
||||
---
|
||||
|
||||
## 6. Smoke tests after a deploy
|
||||
|
||||
Drop these in a runbook for the on-call:
|
||||
|
||||
```bash
|
||||
# 1. API health
|
||||
curl -fsS https://amn.gg/api/health | jq '.success,.version,.environment'
|
||||
|
||||
# 2. Login
|
||||
curl -fsS -X POST https://amn.gg/api/auth/login \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{"email":"admin@marketplace.com","password":"<prod-admin-pwd>"}' \
|
||||
| jq '.success,.data.user.email'
|
||||
|
||||
# 3. Frontend HTML loads
|
||||
curl -fsS https://amn.gg/ -I | head -1 # expect 200
|
||||
|
||||
# 4. Socket.IO handshake
|
||||
curl -fsS "https://amn.gg/socket.io/?EIO=4&transport=polling" -I | head -1
|
||||
|
||||
# 5. Containers healthy
|
||||
docker ps --filter "name=nickapp-" --format "table {{.Names}}\t{{.Status}}"
|
||||
```
|
||||
|
||||
Any non-OK → see [[Incident Response]].
|
||||
|
||||
---
|
||||
|
||||
## 7. Future work
|
||||
|
||||
- **Prometheus + Grafana** with Node exporter + Mongo exporter + Redis exporter — for proper time-series.
|
||||
- **OpenTelemetry** spans from backend → Sentry / Jaeger.
|
||||
- **Healthcheck endpoint** that probes Mongo + Redis and returns `503` when degraded.
|
||||
- **PagerDuty / OpsGenie** wiring from Sentry alerts.
|
||||
- **Synthetic checks** (Pingdom / UptimeRobot) hitting `/health` from multiple regions.
|
||||
|
||||
For now, Sentry + Docker healthchecks + manual log checks cover the basics. See [[Incident Response]] for what to do when something fires.
|
||||
Reference in New Issue
Block a user