Files
nick-doc/08 - Operations/Incident Response.md
2026-05-23 20:35:34 +03:30

12 KiB

title, tags
title tags
Incident Response
operations

Incident Response

Runbooks for the most likely production incidents, plus communication templates and a post-mortem template. Use this page during an active incident — keep Monitoring, Database Operations, and Backup & Recovery open in adjacent tabs.


1. Severity matrix

Sev Meaning Response time Examples
Sev 1 Site fully down or unable to process payments 15 min Backend container in crashloop; Mongo unreachable; SHKeeper API permanently failing
Sev 2 Major feature broken for a large share of users 1 hour Email sending broken; Redis disk full; chat undelivered
Sev 3 Minor / cosmetic issue, isolated user reports next business day Single failed webhook; one user can't upload PDF
Sev 4 No user impact, hygiene item backlog Backup older than 24h; disk > 80%; missed deploy

Escalate one sev higher if more than 10 reports inside 5 minutes.


2. First 5 minutes — always do this

  1. Acknowledge. Reply in the on-call channel that you are taking it.
  2. Open Sentry. Filter to the last 15 minutes for new issue spikes.
  3. Open the host shell. ssh prod ready.
  4. Health endpoint. curl -fsS https://amn.gg/api/health → does it respond?
  5. Container status. docker ps --format "table {{.Names}}\t{{.Status}}".
  6. Recent deploy? Was the :latest tag bumped in the last 30 min? If yes, roll back first (see Deployment#roll-back) and investigate after stability is restored.

If you can't form a hypothesis in 5 minutes, roll back to the previous image tag anyway. Stability before forensics.


3. Common incidents

3.1 Backend down (crashloop, no response on /health)

Symptoms. https://amn.gg/api/health times out or 5xx; nickapp-backend shows Restarting in docker ps.

Runbook.

# 1. Inspect last lines
docker logs --tail=200 nickapp-backend

# 2. Common causes:
#    - Missing env var (`process.env.X!` throws on first read)
#    - MongoDB unreachable (see 3.2)
#    - Port conflict
#    - Out of memory (look for OOMKilled)
docker inspect nickapp-backend | jq '.[0].State'

# 3. If OOM: increase memory limit in compose, restart
#    If missing env: add to /opt/backend/.env, then `docker compose up -d`

# 4. If recent deploy: roll back
sed -i 's|:latest|:<previous-version>|' docker-compose.production.yml
docker compose up -d nickapp-backend
# Pause Watchtower for nickapp-backend so it doesn't re-pull
docker stop watchtower

Communication. Post in #incidents using the template in §4.


3.2 MongoDB unreachable

Symptoms. Backend logs show MongoNetworkError, MongooseServerSelectionError, or Could not connect to server.

Runbook.

# 1. Container alive?
docker ps -a --filter "name=mongodb"

# 2. If exited:
docker logs --tail=200 nickapp-mongodb
# Common: corrupt journal, disk full, OOM

# 3. Disk check
df -h /var/lib/docker

# 4. If disk full:
#    - prune old container logs: docker system prune
#    - rotate logs if needed
#    - extend volume

# 5. Restart
docker compose -f docker-compose.production.yml up -d mongodb

# 6. Verify
docker exec nickapp-mongodb mongosh --eval "db.adminCommand('ping')"

# 7. If data is corrupt, restore from latest dump — see Backup & Recovery

[!warning] If Mongo is corrupted and you must restore, stop the backend container first to prevent partial writes during restore. See Database Operations#restore.


3.3 Redis unreachable

Symptoms. Logs show ECONNREFUSED redis:6379 or NOAUTH Authentication required. Rate limits stop working, refresh tokens can't be revoked, but most read flows still work.

Runbook.

# 1. Container alive?
docker ps -a --filter "name=redis"

# 2. If down:
docker logs --tail=200 nickapp-redis
docker compose -f docker-compose.production.yml up -d redis

# 3. Auth issue?
docker exec nickapp-redis redis-cli -a "$REDIS_PASSWORD" PING
# Should return PONG

# 4. If `$REDIS_PASSWORD` mismatch between .env and command:
nano /opt/backend/.env       # confirm REDIS_PASSWORD set
docker compose up -d redis backend

# 5. If memory full + noeviction policy → rejecting writes:
docker exec nickapp-redis redis-cli -a "$REDIS_PASSWORD" CONFIG SET maxmemory-policy allkeys-lru

The app gracefully degrades when Redis is unreachable for short windows — don't panic, but fix within an hour.


3.4 SHKeeper API down (payments blocked)

Symptoms. Backend logs show repeated SHKeeper request failed: ECONNREFUSED or non-2xx responses from $SHKEEPER_API_URL. Buyers see "Payment unavailable" in checkout. Sev 1 — money is involved.

Runbook.

# 1. Confirm SHKeeper itself is reachable
curl -fsS -H "X-Shkeeper-Api-Key: $SHKEEPER_API_KEY" \
  "$SHKEEPER_API_URL/api/v1/healthcheck"

# 2. If 5xx from SHKeeper → it's their side
#    - Check their status page / contact provider
#    - Toggle a banner in the frontend warning buyers
#    - Consider switching SHKEEPER_FORCE_PAYOUT_DEMO=true so QA still works
#      (do NOT do this for real customer money)

# 3. If our network can't reach it:
#    - test from the host: curl from the host vs from inside the container
docker exec nickapp-backend curl -v "$SHKEEPER_API_URL"
#    - DNS / firewall changes?

# 4. While blocked, monitor stuck payments
docker exec nickapp-mongodb mongosh --eval \
  "use marketplace; db.payments.find({status:'pending', createdAt:{\$lt: new Date(Date.now() - 30*60*1000)}}).count()"

# 5. Once SHKeeper is back, the app retries automatically. Verify the
#    backlog drains. If a payment is stuck > 24h, manually verify against
#    SHKeeper and use fix-transaction-hashes.js if needed.

Always communicate. Even short payment outages erode trust — post a status update.


3.5 Email delivery failure

Symptoms. Logs show SMTPError from nodemailer. Password resets, welcome emails, dispute notifications fail. Sev 2.

Runbook.

# 1. Test SMTP credentials from the container
docker exec nickapp-backend node -e "
  const nm = require('nodemailer');
  nm.createTransport({
    host: process.env.SMTP_HOST,
    port: Number(process.env.SMTP_PORT),
    secure: process.env.SMTP_SECURE === 'true',
    auth: { user: process.env.SMTP_USER, pass: process.env.SMTP_PASS },
  }).verify().then(console.log).catch(console.error);
"

# 2. If auth failed → password rotated by provider, update SMTP_PASS in .env
# 3. If connection timed out → provider rate-limit; switch provider/sender
# 4. If specific domains bounce → check SPF / DKIM / DMARC records for amn.gg

Users can still operate the app without email; queue critical emails for retry once SMTP is restored.


3.6 WebSocket disconnect storm

Symptoms. Backend logs flood with 🔌 User connected/disconnected cycling; clients spinning on chat / notification badges. Sev 2.

Runbook.

# 1. Confirm symptoms
docker logs --tail=500 nickapp-backend | grep -c "🔌"

# 2. Check Nginx access log for socket.io polling spam
tail -f /opt/backend/nginx/logs/access.log | grep socket.io

# 3. Common causes:
#    - Nginx not configured for WebSocket upgrade (returns 502 → client falls back to polling → reconnect loop)
#    - Client clock skew breaking JWT validation on every reconnect
#    - Redis adapter mis-configured (if scaled horizontally — not the case today)

# 4. Quick mitigation: increase Nginx proxy_read_timeout
#    Permanent: ensure nginx.conf has:
#      proxy_http_version 1.1;
#      proxy_set_header Upgrade $http_upgrade;
#      proxy_set_header Connection "upgrade";

# 5. Restart nginx + backend
docker compose restart nginx nickapp-backend

3.7 Suspicious activity / abuse

Symptoms. Sentry alerts on unusual error volume from one IP; rate-limit logs spiking; reports of brute-force on /api/auth/login.

Runbook.

# 1. Identify the offender
tail -n 10000 /opt/backend/nginx/logs/access.log \
  | awk '{print $1}' | sort | uniq -c | sort -rn | head

# 2. Block at the edge (Cloudflare / host firewall)
#    Or use `ufw deny from <ip>` on the host

# 3. Confirm rate limits in app
grep "RATE_LIMIT" /opt/backend/.env
# Defaults: 100 req / 15 min per IP. Tighten if abuse continues.

# 4. If the abuse targets a specific user account:
docker exec -it nickapp-backend node -e "
  // disable the user via mongoose
  require('./dist/infrastructure/database/connection').connectDatabase()
    .then(() => require('./dist/models').User.updateOne({email:'attacker@x.com'}, {$set:{disabled:true}}))
    .then(console.log)
"

# 5. Preserve evidence: copy access logs to /var/incidents/<date>/

If user data may have leaked, treat as sev 1 and follow your data-breach disclosure process.


4. Communication templates

Initial incident notification

🚨 [SEV-X] <one-line summary>
Started: <time UTC>
Impact: <which users / features>
Status: investigating
On-call: <@you>
Updates: every 15 minutes in this thread

Mid-incident update

[SEV-X] Update <n>
Time: <UTC>
Status: <investigating / mitigating / monitoring>
What we know: <facts>
What we're trying: <action>
Next update: <time>

Resolution

✅ [SEV-X] Resolved
Started: <UTC>
Ended:   <UTC>
Duration: <minutes>
Impact: <users / features / requests affected>
Root cause: <one sentence>
Permanent fix: <PR / ticket>
Postmortem: <doc link, by <date>>

Customer-facing status

We're investigating an issue affecting <feature> that started at <time>.
We'll post an update by <time + 15 min>.

Avoid speculation in customer-facing copy. Say "investigating", "applying fix", "monitoring", "resolved" — and nothing else until you actually know.


5. Post-mortem template

Use within 5 business days of any sev 1 or sev 2.

---
title: Post-mortem — <short title>
date: <YYYY-MM-DD>
severity: SEV-X
duration: <minutes>
authors: [<names>]
tags: [postmortem]
---

## Summary
One paragraph: what broke, who was affected, how long, how it was fixed.

## Timeline (UTC)
- HH:MM — first signal (alert, user report)
- HH:MM — on-call ack
- HH:MM — hypothesis: ...
- HH:MM — mitigation deployed
- HH:MM — verified resolved
- HH:MM — incident closed

## Impact
- Users affected: <count or %>
- Features affected: <list>
- Money affected: <if payments>
- Data loss: <yes/no — describe>

## Root cause
Honest, blameless. Distinguish trigger vs underlying cause.

## What went well
- ...

## What went poorly
- ...

## Where we got lucky
- ...

## Action items
| # | Item | Owner | Due | Ticket |
|---|------|-------|-----|--------|
| 1 | Add /health probe for MongoDB | @x | 2026-06-01 | OPS-123 |
| 2 | Tighten rate limit on /auth/login | @y | 2026-05-30 | OPS-124 |

## Detection improvements
What new alert / dashboard would have caught this earlier?

## Process improvements
What runbook / docs need updating? Update [[Incident Response]] right now.

Store postmortems alongside this vault — suggested path /Users/mojtabaheidari/code/docs/08 - Operations/postmortems/YYYY-MM-DD-<slug>.md.


6. Escalation contacts

(Fill in for your team; placeholder structure below.)

Role Primary Backup Channel
On-call engineer #incidents
Payments lead DM
Infrastructure DM
Product / customer comms #customer-comms
SHKeeper provider contact email
SMTP provider email

7. After every incident

  • Updated this page with any new gotchas?
  • Updated Monitoring with new metrics/alerts to add?
  • Updated Backup & Recovery if backup gaps were exposed?
  • Action items tracked?
  • Customer comms sent (if user-impacting)?
  • Post-mortem published?

Cross-links: Deployment for rollback steps, Database Operations for DB diagnostics, Backup & Recovery for restore procedures, Monitoring for metrics to watch.