394 lines
12 KiB
Markdown
394 lines
12 KiB
Markdown
---
|
|
title: Incident Response
|
|
tags: [operations]
|
|
---
|
|
|
|
# Incident Response
|
|
|
|
Runbooks for the most likely production incidents, plus communication templates and a post-mortem template. Use this page during an active incident — keep [[Monitoring]], [[Database Operations]], and [[Backup & Recovery]] open in adjacent tabs.
|
|
|
|
---
|
|
|
|
## 1. Severity matrix
|
|
|
|
| Sev | Meaning | Response time | Examples |
|
|
|-----|---------|---------------|----------|
|
|
| **Sev 1** | Site fully down or unable to process payments | 15 min | Backend container in crashloop; Mongo unreachable; SHKeeper API permanently failing |
|
|
| **Sev 2** | Major feature broken for a large share of users | 1 hour | Email sending broken; Redis disk full; chat undelivered |
|
|
| **Sev 3** | Minor / cosmetic issue, isolated user reports | next business day | Single failed webhook; one user can't upload PDF |
|
|
| **Sev 4** | No user impact, hygiene item | backlog | Backup older than 24h; disk > 80%; missed deploy |
|
|
|
|
Escalate one sev higher if more than 10 reports inside 5 minutes.
|
|
|
|
---
|
|
|
|
## 2. First 5 minutes — always do this
|
|
|
|
1. **Acknowledge.** Reply in the on-call channel that you are taking it.
|
|
2. **Open Sentry.** Filter to the last 15 minutes for new issue spikes.
|
|
3. **Open the host shell.** `ssh prod` ready.
|
|
4. **Health endpoint.** `curl -fsS https://amn.gg/api/health` → does it respond?
|
|
5. **Container status.** `docker ps --format "table {{.Names}}\t{{.Status}}"`.
|
|
6. **Recent deploy?** Was the `:latest` tag bumped in the last 30 min? If yes, **roll back first** (see [[Deployment#roll-back]]) and investigate after stability is restored.
|
|
|
|
If you can't form a hypothesis in 5 minutes, **roll back to the previous image tag** anyway. Stability before forensics.
|
|
|
|
---
|
|
|
|
## 3. Common incidents
|
|
|
|
### 3.1 Backend down (crashloop, no response on /health)
|
|
|
|
**Symptoms.** `https://amn.gg/api/health` times out or 5xx; `nickapp-backend` shows `Restarting` in `docker ps`.
|
|
|
|
**Runbook.**
|
|
|
|
```bash
|
|
# 1. Inspect last lines
|
|
docker logs --tail=200 nickapp-backend
|
|
|
|
# 2. Common causes:
|
|
# - Missing env var (`process.env.X!` throws on first read)
|
|
# - MongoDB unreachable (see 3.2)
|
|
# - Port conflict
|
|
# - Out of memory (look for OOMKilled)
|
|
docker inspect nickapp-backend | jq '.[0].State'
|
|
|
|
# 3. If OOM: increase memory limit in compose, restart
|
|
# If missing env: add to /opt/backend/.env, then `docker compose up -d`
|
|
|
|
# 4. If recent deploy: roll back
|
|
sed -i 's|:latest|:<previous-version>|' docker-compose.production.yml
|
|
docker compose up -d nickapp-backend
|
|
# Pause Watchtower for nickapp-backend so it doesn't re-pull
|
|
docker stop watchtower
|
|
```
|
|
|
|
**Communication.** Post in #incidents using the template in §4.
|
|
|
|
---
|
|
|
|
### 3.2 MongoDB unreachable
|
|
|
|
**Symptoms.** Backend logs show `MongoNetworkError`, `MongooseServerSelectionError`, or `Could not connect to server`.
|
|
|
|
**Runbook.**
|
|
|
|
```bash
|
|
# 1. Container alive?
|
|
docker ps -a --filter "name=mongodb"
|
|
|
|
# 2. If exited:
|
|
docker logs --tail=200 nickapp-mongodb
|
|
# Common: corrupt journal, disk full, OOM
|
|
|
|
# 3. Disk check
|
|
df -h /var/lib/docker
|
|
|
|
# 4. If disk full:
|
|
# - prune old container logs: docker system prune
|
|
# - rotate logs if needed
|
|
# - extend volume
|
|
|
|
# 5. Restart
|
|
docker compose -f docker-compose.production.yml up -d mongodb
|
|
|
|
# 6. Verify
|
|
docker exec nickapp-mongodb mongosh --eval "db.adminCommand('ping')"
|
|
|
|
# 7. If data is corrupt, restore from latest dump — see Backup & Recovery
|
|
```
|
|
|
|
> [!warning] If Mongo is corrupted and you must restore, **stop the backend container first** to prevent partial writes during restore. See [[Database Operations#restore]].
|
|
|
|
---
|
|
|
|
### 3.3 Redis unreachable
|
|
|
|
**Symptoms.** Logs show `ECONNREFUSED redis:6379` or `NOAUTH Authentication required`. Rate limits stop working, refresh tokens can't be revoked, but most read flows still work.
|
|
|
|
**Runbook.**
|
|
|
|
```bash
|
|
# 1. Container alive?
|
|
docker ps -a --filter "name=redis"
|
|
|
|
# 2. If down:
|
|
docker logs --tail=200 nickapp-redis
|
|
docker compose -f docker-compose.production.yml up -d redis
|
|
|
|
# 3. Auth issue?
|
|
docker exec nickapp-redis redis-cli -a "$REDIS_PASSWORD" PING
|
|
# Should return PONG
|
|
|
|
# 4. If `$REDIS_PASSWORD` mismatch between .env and command:
|
|
nano /opt/backend/.env # confirm REDIS_PASSWORD set
|
|
docker compose up -d redis backend
|
|
|
|
# 5. If memory full + noeviction policy → rejecting writes:
|
|
docker exec nickapp-redis redis-cli -a "$REDIS_PASSWORD" CONFIG SET maxmemory-policy allkeys-lru
|
|
```
|
|
|
|
The app gracefully degrades when Redis is unreachable for short windows — don't panic, but fix within an hour.
|
|
|
|
---
|
|
|
|
### 3.4 SHKeeper API down (payments blocked)
|
|
|
|
**Symptoms.** Backend logs show repeated `SHKeeper request failed: ECONNREFUSED` or non-2xx responses from `$SHKEEPER_API_URL`. Buyers see "Payment unavailable" in checkout. Sev 1 — money is involved.
|
|
|
|
**Runbook.**
|
|
|
|
```bash
|
|
# 1. Confirm SHKeeper itself is reachable
|
|
curl -fsS -H "X-Shkeeper-Api-Key: $SHKEEPER_API_KEY" \
|
|
"$SHKEEPER_API_URL/api/v1/healthcheck"
|
|
|
|
# 2. If 5xx from SHKeeper → it's their side
|
|
# - Check their status page / contact provider
|
|
# - Toggle a banner in the frontend warning buyers
|
|
# - Consider switching SHKEEPER_FORCE_PAYOUT_DEMO=true so QA still works
|
|
# (do NOT do this for real customer money)
|
|
|
|
# 3. If our network can't reach it:
|
|
# - test from the host: curl from the host vs from inside the container
|
|
docker exec nickapp-backend curl -v "$SHKEEPER_API_URL"
|
|
# - DNS / firewall changes?
|
|
|
|
# 4. While blocked, monitor stuck payments
|
|
docker exec nickapp-mongodb mongosh --eval \
|
|
"use marketplace; db.payments.find({status:'pending', createdAt:{\$lt: new Date(Date.now() - 30*60*1000)}}).count()"
|
|
|
|
# 5. Once SHKeeper is back, the app retries automatically. Verify the
|
|
# backlog drains. If a payment is stuck > 24h, manually verify against
|
|
# SHKeeper and use fix-transaction-hashes.js if needed.
|
|
```
|
|
|
|
**Always communicate.** Even short payment outages erode trust — post a status update.
|
|
|
|
---
|
|
|
|
### 3.5 Email delivery failure
|
|
|
|
**Symptoms.** Logs show `SMTPError` from `nodemailer`. Password resets, welcome emails, dispute notifications fail. Sev 2.
|
|
|
|
**Runbook.**
|
|
|
|
```bash
|
|
# 1. Test SMTP credentials from the container
|
|
docker exec nickapp-backend node -e "
|
|
const nm = require('nodemailer');
|
|
nm.createTransport({
|
|
host: process.env.SMTP_HOST,
|
|
port: Number(process.env.SMTP_PORT),
|
|
secure: process.env.SMTP_SECURE === 'true',
|
|
auth: { user: process.env.SMTP_USER, pass: process.env.SMTP_PASS },
|
|
}).verify().then(console.log).catch(console.error);
|
|
"
|
|
|
|
# 2. If auth failed → password rotated by provider, update SMTP_PASS in .env
|
|
# 3. If connection timed out → provider rate-limit; switch provider/sender
|
|
# 4. If specific domains bounce → check SPF / DKIM / DMARC records for amn.gg
|
|
```
|
|
|
|
Users can still operate the app without email; queue critical emails for retry once SMTP is restored.
|
|
|
|
---
|
|
|
|
### 3.6 WebSocket disconnect storm
|
|
|
|
**Symptoms.** Backend logs flood with `🔌 User connected/disconnected` cycling; clients spinning on chat / notification badges. Sev 2.
|
|
|
|
**Runbook.**
|
|
|
|
```bash
|
|
# 1. Confirm symptoms
|
|
docker logs --tail=500 nickapp-backend | grep -c "🔌"
|
|
|
|
# 2. Check Nginx access log for socket.io polling spam
|
|
tail -f /opt/backend/nginx/logs/access.log | grep socket.io
|
|
|
|
# 3. Common causes:
|
|
# - Nginx not configured for WebSocket upgrade (returns 502 → client falls back to polling → reconnect loop)
|
|
# - Client clock skew breaking JWT validation on every reconnect
|
|
# - Redis adapter mis-configured (if scaled horizontally — not the case today)
|
|
|
|
# 4. Quick mitigation: increase Nginx proxy_read_timeout
|
|
# Permanent: ensure nginx.conf has:
|
|
# proxy_http_version 1.1;
|
|
# proxy_set_header Upgrade $http_upgrade;
|
|
# proxy_set_header Connection "upgrade";
|
|
|
|
# 5. Restart nginx + backend
|
|
docker compose restart nginx nickapp-backend
|
|
```
|
|
|
|
---
|
|
|
|
### 3.7 Suspicious activity / abuse
|
|
|
|
**Symptoms.** Sentry alerts on unusual error volume from one IP; rate-limit logs spiking; reports of brute-force on `/api/auth/login`.
|
|
|
|
**Runbook.**
|
|
|
|
```bash
|
|
# 1. Identify the offender
|
|
tail -n 10000 /opt/backend/nginx/logs/access.log \
|
|
| awk '{print $1}' | sort | uniq -c | sort -rn | head
|
|
|
|
# 2. Block at the edge (Cloudflare / host firewall)
|
|
# Or use `ufw deny from <ip>` on the host
|
|
|
|
# 3. Confirm rate limits in app
|
|
grep "RATE_LIMIT" /opt/backend/.env
|
|
# Defaults: 100 req / 15 min per IP. Tighten if abuse continues.
|
|
|
|
# 4. If the abuse targets a specific user account:
|
|
docker exec -it nickapp-backend node -e "
|
|
// disable the user via mongoose
|
|
require('./dist/infrastructure/database/connection').connectDatabase()
|
|
.then(() => require('./dist/models').User.updateOne({email:'attacker@x.com'}, {$set:{disabled:true}}))
|
|
.then(console.log)
|
|
"
|
|
|
|
# 5. Preserve evidence: copy access logs to /var/incidents/<date>/
|
|
```
|
|
|
|
If user data may have leaked, treat as sev 1 and follow your data-breach disclosure process.
|
|
|
|
---
|
|
|
|
## 4. Communication templates
|
|
|
|
### Initial incident notification
|
|
|
|
```
|
|
🚨 [SEV-X] <one-line summary>
|
|
Started: <time UTC>
|
|
Impact: <which users / features>
|
|
Status: investigating
|
|
On-call: <@you>
|
|
Updates: every 15 minutes in this thread
|
|
```
|
|
|
|
### Mid-incident update
|
|
|
|
```
|
|
[SEV-X] Update <n>
|
|
Time: <UTC>
|
|
Status: <investigating / mitigating / monitoring>
|
|
What we know: <facts>
|
|
What we're trying: <action>
|
|
Next update: <time>
|
|
```
|
|
|
|
### Resolution
|
|
|
|
```
|
|
✅ [SEV-X] Resolved
|
|
Started: <UTC>
|
|
Ended: <UTC>
|
|
Duration: <minutes>
|
|
Impact: <users / features / requests affected>
|
|
Root cause: <one sentence>
|
|
Permanent fix: <PR / ticket>
|
|
Postmortem: <doc link, by <date>>
|
|
```
|
|
|
|
### Customer-facing status
|
|
|
|
```
|
|
We're investigating an issue affecting <feature> that started at <time>.
|
|
We'll post an update by <time + 15 min>.
|
|
```
|
|
|
|
Avoid speculation in customer-facing copy. Say "investigating", "applying fix", "monitoring", "resolved" — and nothing else until you actually know.
|
|
|
|
---
|
|
|
|
## 5. Post-mortem template
|
|
|
|
Use within 5 business days of any sev 1 or sev 2.
|
|
|
|
```markdown
|
|
---
|
|
title: Post-mortem — <short title>
|
|
date: <YYYY-MM-DD>
|
|
severity: SEV-X
|
|
duration: <minutes>
|
|
authors: [<names>]
|
|
tags: [postmortem]
|
|
---
|
|
|
|
## Summary
|
|
One paragraph: what broke, who was affected, how long, how it was fixed.
|
|
|
|
## Timeline (UTC)
|
|
- HH:MM — first signal (alert, user report)
|
|
- HH:MM — on-call ack
|
|
- HH:MM — hypothesis: ...
|
|
- HH:MM — mitigation deployed
|
|
- HH:MM — verified resolved
|
|
- HH:MM — incident closed
|
|
|
|
## Impact
|
|
- Users affected: <count or %>
|
|
- Features affected: <list>
|
|
- Money affected: <if payments>
|
|
- Data loss: <yes/no — describe>
|
|
|
|
## Root cause
|
|
Honest, blameless. Distinguish trigger vs underlying cause.
|
|
|
|
## What went well
|
|
- ...
|
|
|
|
## What went poorly
|
|
- ...
|
|
|
|
## Where we got lucky
|
|
- ...
|
|
|
|
## Action items
|
|
| # | Item | Owner | Due | Ticket |
|
|
|---|------|-------|-----|--------|
|
|
| 1 | Add /health probe for MongoDB | @x | 2026-06-01 | OPS-123 |
|
|
| 2 | Tighten rate limit on /auth/login | @y | 2026-05-30 | OPS-124 |
|
|
|
|
## Detection improvements
|
|
What new alert / dashboard would have caught this earlier?
|
|
|
|
## Process improvements
|
|
What runbook / docs need updating? Update [[Incident Response]] right now.
|
|
```
|
|
|
|
Store postmortems alongside this vault — suggested path `/Users/mojtabaheidari/code/docs/08 - Operations/postmortems/YYYY-MM-DD-<slug>.md`.
|
|
|
|
---
|
|
|
|
## 6. Escalation contacts
|
|
|
|
(Fill in for your team; placeholder structure below.)
|
|
|
|
| Role | Primary | Backup | Channel |
|
|
|------|---------|--------|---------|
|
|
| On-call engineer | <name> | <name> | #incidents |
|
|
| Payments lead | <name> | <name> | DM |
|
|
| Infrastructure | <name> | <name> | DM |
|
|
| Product / customer comms | <name> | <name> | #customer-comms |
|
|
| SHKeeper provider contact | <email> | — | email |
|
|
| SMTP provider | <email> | — | email |
|
|
|
|
---
|
|
|
|
## 7. After every incident
|
|
|
|
- [ ] Updated this page with any new gotchas?
|
|
- [ ] Updated [[Monitoring]] with new metrics/alerts to add?
|
|
- [ ] Updated [[Backup & Recovery]] if backup gaps were exposed?
|
|
- [ ] Action items tracked?
|
|
- [ ] Customer comms sent (if user-impacting)?
|
|
- [ ] Post-mortem published?
|
|
|
|
Cross-links: [[Deployment]] for rollback steps, [[Database Operations]] for DB diagnostics, [[Backup & Recovery]] for restore procedures, [[Monitoring]] for metrics to watch.
|