--- title: Incident Response tags: [operations] --- # Incident Response Runbooks for the most likely production incidents, plus communication templates and a post-mortem template. Use this page during an active incident — keep [[Monitoring]], [[Database Operations]], and [[Backup & Recovery]] open in adjacent tabs. --- ## 1. Severity matrix | Sev | Meaning | Response time | Examples | |-----|---------|---------------|----------| | **Sev 1** | Site fully down or unable to process payments | 15 min | Backend container in crashloop; Mongo unreachable; SHKeeper API permanently failing | | **Sev 2** | Major feature broken for a large share of users | 1 hour | Email sending broken; Redis disk full; chat undelivered | | **Sev 3** | Minor / cosmetic issue, isolated user reports | next business day | Single failed webhook; one user can't upload PDF | | **Sev 4** | No user impact, hygiene item | backlog | Backup older than 24h; disk > 80%; missed deploy | Escalate one sev higher if more than 10 reports inside 5 minutes. --- ## 2. First 5 minutes — always do this 1. **Acknowledge.** Reply in the on-call channel that you are taking it. 2. **Open Sentry.** Filter to the last 15 minutes for new issue spikes. 3. **Open the host shell.** `ssh prod` ready. 4. **Health endpoint.** `curl -fsS https://amn.gg/api/health` → does it respond? 5. **Container status.** `docker ps --format "table {{.Names}}\t{{.Status}}"`. 6. **Recent deploy?** Was the `:latest` tag bumped in the last 30 min? If yes, **roll back first** (see [[Deployment#roll-back]]) and investigate after stability is restored. If you can't form a hypothesis in 5 minutes, **roll back to the previous image tag** anyway. Stability before forensics. --- ## 3. Common incidents ### 3.1 Backend down (crashloop, no response on /health) **Symptoms.** `https://amn.gg/api/health` times out or 5xx; `nickapp-backend` shows `Restarting` in `docker ps`. **Runbook.** ```bash # 1. Inspect last lines docker logs --tail=200 nickapp-backend # 2. Common causes: # - Missing env var (`process.env.X!` throws on first read) # - MongoDB unreachable (see 3.2) # - Port conflict # - Out of memory (look for OOMKilled) docker inspect nickapp-backend | jq '.[0].State' # 3. If OOM: increase memory limit in compose, restart # If missing env: add to /opt/backend/.env, then `docker compose up -d` # 4. If recent deploy: roll back sed -i 's|:latest|:|' docker-compose.production.yml docker compose up -d nickapp-backend # Pause Watchtower for nickapp-backend so it doesn't re-pull docker stop watchtower ``` **Communication.** Post in #incidents using the template in §4. --- ### 3.2 MongoDB unreachable **Symptoms.** Backend logs show `MongoNetworkError`, `MongooseServerSelectionError`, or `Could not connect to server`. **Runbook.** ```bash # 1. Container alive? docker ps -a --filter "name=mongodb" # 2. If exited: docker logs --tail=200 nickapp-mongodb # Common: corrupt journal, disk full, OOM # 3. Disk check df -h /var/lib/docker # 4. If disk full: # - prune old container logs: docker system prune # - rotate logs if needed # - extend volume # 5. Restart docker compose -f docker-compose.production.yml up -d mongodb # 6. Verify docker exec nickapp-mongodb mongosh --eval "db.adminCommand('ping')" # 7. If data is corrupt, restore from latest dump — see Backup & Recovery ``` > [!warning] If Mongo is corrupted and you must restore, **stop the backend container first** to prevent partial writes during restore. See [[Database Operations#restore]]. --- ### 3.3 Redis unreachable **Symptoms.** Logs show `ECONNREFUSED redis:6379` or `NOAUTH Authentication required`. Rate limits stop working, refresh tokens can't be revoked, but most read flows still work. **Runbook.** ```bash # 1. Container alive? docker ps -a --filter "name=redis" # 2. If down: docker logs --tail=200 nickapp-redis docker compose -f docker-compose.production.yml up -d redis # 3. Auth issue? docker exec nickapp-redis redis-cli -a "$REDIS_PASSWORD" PING # Should return PONG # 4. If `$REDIS_PASSWORD` mismatch between .env and command: nano /opt/backend/.env # confirm REDIS_PASSWORD set docker compose up -d redis backend # 5. If memory full + noeviction policy → rejecting writes: docker exec nickapp-redis redis-cli -a "$REDIS_PASSWORD" CONFIG SET maxmemory-policy allkeys-lru ``` The app gracefully degrades when Redis is unreachable for short windows — don't panic, but fix within an hour. --- ### 3.4 SHKeeper API down (payments blocked) **Symptoms.** Backend logs show repeated `SHKeeper request failed: ECONNREFUSED` or non-2xx responses from `$SHKEEPER_API_URL`. Buyers see "Payment unavailable" in checkout. Sev 1 — money is involved. **Runbook.** ```bash # 1. Confirm SHKeeper itself is reachable curl -fsS -H "X-Shkeeper-Api-Key: $SHKEEPER_API_KEY" \ "$SHKEEPER_API_URL/api/v1/healthcheck" # 2. If 5xx from SHKeeper → it's their side # - Check their status page / contact provider # - Toggle a banner in the frontend warning buyers # - Consider switching SHKEEPER_FORCE_PAYOUT_DEMO=true so QA still works # (do NOT do this for real customer money) # 3. If our network can't reach it: # - test from the host: curl from the host vs from inside the container docker exec nickapp-backend curl -v "$SHKEEPER_API_URL" # - DNS / firewall changes? # 4. While blocked, monitor stuck payments docker exec nickapp-mongodb mongosh --eval \ "use marketplace; db.payments.find({status:'pending', createdAt:{\$lt: new Date(Date.now() - 30*60*1000)}}).count()" # 5. Once SHKeeper is back, the app retries automatically. Verify the # backlog drains. If a payment is stuck > 24h, manually verify against # SHKeeper and use fix-transaction-hashes.js if needed. ``` **Always communicate.** Even short payment outages erode trust — post a status update. --- ### 3.5 Email delivery failure **Symptoms.** Logs show `SMTPError` from `nodemailer`. Password resets, welcome emails, dispute notifications fail. Sev 2. **Runbook.** ```bash # 1. Test SMTP credentials from the container docker exec nickapp-backend node -e " const nm = require('nodemailer'); nm.createTransport({ host: process.env.SMTP_HOST, port: Number(process.env.SMTP_PORT), secure: process.env.SMTP_SECURE === 'true', auth: { user: process.env.SMTP_USER, pass: process.env.SMTP_PASS }, }).verify().then(console.log).catch(console.error); " # 2. If auth failed → password rotated by provider, update SMTP_PASS in .env # 3. If connection timed out → provider rate-limit; switch provider/sender # 4. If specific domains bounce → check SPF / DKIM / DMARC records for amn.gg ``` Users can still operate the app without email; queue critical emails for retry once SMTP is restored. --- ### 3.6 WebSocket disconnect storm **Symptoms.** Backend logs flood with `🔌 User connected/disconnected` cycling; clients spinning on chat / notification badges. Sev 2. **Runbook.** ```bash # 1. Confirm symptoms docker logs --tail=500 nickapp-backend | grep -c "🔌" # 2. Check Nginx access log for socket.io polling spam tail -f /opt/backend/nginx/logs/access.log | grep socket.io # 3. Common causes: # - Nginx not configured for WebSocket upgrade (returns 502 → client falls back to polling → reconnect loop) # - Client clock skew breaking JWT validation on every reconnect # - Redis adapter mis-configured (if scaled horizontally — not the case today) # 4. Quick mitigation: increase Nginx proxy_read_timeout # Permanent: ensure nginx.conf has: # proxy_http_version 1.1; # proxy_set_header Upgrade $http_upgrade; # proxy_set_header Connection "upgrade"; # 5. Restart nginx + backend docker compose restart nginx nickapp-backend ``` --- ### 3.7 Suspicious activity / abuse **Symptoms.** Sentry alerts on unusual error volume from one IP; rate-limit logs spiking; reports of brute-force on `/api/auth/login`. **Runbook.** ```bash # 1. Identify the offender tail -n 10000 /opt/backend/nginx/logs/access.log \ | awk '{print $1}' | sort | uniq -c | sort -rn | head # 2. Block at the edge (Cloudflare / host firewall) # Or use `ufw deny from ` on the host # 3. Confirm rate limits in app grep "RATE_LIMIT" /opt/backend/.env # Defaults: 100 req / 15 min per IP. Tighten if abuse continues. # 4. If the abuse targets a specific user account: docker exec -it nickapp-backend node -e " // disable the user via mongoose require('./dist/infrastructure/database/connection').connectDatabase() .then(() => require('./dist/models').User.updateOne({email:'attacker@x.com'}, {$set:{disabled:true}})) .then(console.log) " # 5. Preserve evidence: copy access logs to /var/incidents// ``` If user data may have leaked, treat as sev 1 and follow your data-breach disclosure process. ### 3.8 Request Network rollback + reconciliation Use when Request Network payments are failing, stalled, or out of sync with local payment state. **Immediate rollback (minutes):** 1. Stop routing new intents to Request Network by setting: - `REQUEST_NETWORK_ENABLED=false` - `PAYMENT_ENABLED_PROVIDERS=shkeeper` - keep `PAYMENT_ROLLBACK_PROVIDER=shkeeper` 2. Restart backend and confirm new `/api/payment/request-network/*` checks are no longer in your checkout path. 3. Confirm `PAYMENT_PROVIDER_MODE` is in a safe operational mode: - `live`: standard operations - `read-only`: observe only, no writes - `dry-run`: status updates without on-chain actions **Reconciliation before re-enabling:** 1. Keep `PAYMENT_RECONCILIATION_ENABLED=false` until investigation is complete. 2. Run a dry reconciliation pass (dry-run) using the Request Network reconciliation service and capture summary counters. 3. If summary is healthy, run with `apply=true` for the intended payment window. 4. Re-enable RN intentionally only after two deployment health checks pass. Escalate if repeated `lookup_failed`, `missing_reference`, or coordinator-blocked outcomes block reconciliation for more than 10 minutes. --- ## 4. Communication templates ### Initial incident notification ``` 🚨 [SEV-X] Started: