In-house Request Network checkout went fully end-to-end on dev today. A real 0.01 USDC payment flowed through wallet connect -> approve -> ERC20FeeProxy.transferFromWithReferenceAndFee -> RN webhook -> TransactionSafetyProvider -> Payment.status=completed -> page success state. Tx 0x494c77a29161b5100d8e0b1ac675f1822955d0bb3633ecdbfafb886f84f2f320. Docs: - New PRD: Wallet, Multichain, Confirmations, AML, Trezor (5 follow-ups, each sized for an independent contributor) - Updated PRD: Request Network In-House Checkout (phases 0..3 done, phase 4 partial, phases 5-6 not started) - Updated handoff: deployed versions, what is working end-to-end, follow-up tasks index Taskmaster: 5 new top-level tasks (#7..#11) covering ephemeral destination wallets, multichain proxy registry + USDC/USDT, runtime confirmation thresholds, optional seller-paid AML screening, and Trezor signing for admin actions. Tasks are scoped fine-grained so each is independent enough for kimi to pick up. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
14 KiB
title, tags
| title | tags | |
|---|---|---|
| Incident Response |
|
Incident Response
Runbooks for the most likely production incidents, plus communication templates and a post-mortem template. Use this page during an active incident — keep Monitoring, Database Operations, and Backup & Recovery open in adjacent tabs.
1. Severity matrix
| Sev | Meaning | Response time | Examples |
|---|---|---|---|
| Sev 1 | Site fully down or unable to process payments | 15 min | Backend container in crashloop; Mongo unreachable; SHKeeper API permanently failing |
| Sev 2 | Major feature broken for a large share of users | 1 hour | Email sending broken; Redis disk full; chat undelivered |
| Sev 3 | Minor / cosmetic issue, isolated user reports | next business day | Single failed webhook; one user can't upload PDF |
| Sev 4 | No user impact, hygiene item | backlog | Backup older than 24h; disk > 80%; missed deploy |
Escalate one sev higher if more than 10 reports inside 5 minutes.
2. First 5 minutes — always do this
- Acknowledge. Reply in the on-call channel that you are taking it.
- Open Sentry. Filter to the last 15 minutes for new issue spikes.
- Open the host shell.
ssh prodready. - Health endpoint.
curl -fsS https://amn.gg/api/health→ does it respond? - Container status.
docker ps --format "table {{.Names}}\t{{.Status}}". - Recent deploy? Was the
:latesttag bumped in the last 30 min? If yes, roll back first (see Deployment#roll-back) and investigate after stability is restored.
If you can't form a hypothesis in 5 minutes, roll back to the previous image tag anyway. Stability before forensics.
3. Common incidents
3.1 Backend down (crashloop, no response on /health)
Symptoms. https://amn.gg/api/health times out or 5xx; nickapp-backend shows Restarting in docker ps.
Runbook.
# 1. Inspect last lines
docker logs --tail=200 nickapp-backend
# 2. Common causes:
# - Missing env var (`process.env.X!` throws on first read)
# - MongoDB unreachable (see 3.2)
# - Port conflict
# - Out of memory (look for OOMKilled)
docker inspect nickapp-backend | jq '.[0].State'
# 3. If OOM: increase memory limit in compose, restart
# If missing env: add to /opt/backend/.env, then `docker compose up -d`
# 4. If recent deploy: roll back
sed -i 's|:latest|:<previous-version>|' docker-compose.production.yml
docker compose up -d nickapp-backend
# Pause Watchtower for nickapp-backend so it doesn't re-pull
docker stop watchtower
Communication. Post in #incidents using the template in §4.
3.2 MongoDB unreachable
Symptoms. Backend logs show MongoNetworkError, MongooseServerSelectionError, or Could not connect to server.
Runbook.
# 1. Container alive?
docker ps -a --filter "name=mongodb"
# 2. If exited:
docker logs --tail=200 nickapp-mongodb
# Common: corrupt journal, disk full, OOM
# 3. Disk check
df -h /var/lib/docker
# 4. If disk full:
# - prune old container logs: docker system prune
# - rotate logs if needed
# - extend volume
# 5. Restart
docker compose -f docker-compose.production.yml up -d mongodb
# 6. Verify
docker exec nickapp-mongodb mongosh --eval "db.adminCommand('ping')"
# 7. If data is corrupt, restore from latest dump — see Backup & Recovery
[!warning] If Mongo is corrupted and you must restore, stop the backend container first to prevent partial writes during restore. See Database Operations#restore.
3.3 Redis unreachable
Symptoms. Logs show ECONNREFUSED redis:6379 or NOAUTH Authentication required. Rate limits stop working, refresh tokens can't be revoked, but most read flows still work.
Runbook.
# 1. Container alive?
docker ps -a --filter "name=redis"
# 2. If down:
docker logs --tail=200 nickapp-redis
docker compose -f docker-compose.production.yml up -d redis
# 3. Auth issue?
docker exec nickapp-redis redis-cli -a "$REDIS_PASSWORD" PING
# Should return PONG
# 4. If `$REDIS_PASSWORD` mismatch between .env and command:
nano /opt/backend/.env # confirm REDIS_PASSWORD set
docker compose up -d redis backend
# 5. If memory full + noeviction policy → rejecting writes:
docker exec nickapp-redis redis-cli -a "$REDIS_PASSWORD" CONFIG SET maxmemory-policy allkeys-lru
The app gracefully degrades when Redis is unreachable for short windows — don't panic, but fix within an hour.
3.4 SHKeeper API down (payments blocked)
Symptoms. Backend logs show repeated SHKeeper request failed: ECONNREFUSED or non-2xx responses from $SHKEEPER_API_URL. Buyers see "Payment unavailable" in checkout. Sev 1 — money is involved.
Runbook.
# 1. Confirm SHKeeper itself is reachable
curl -fsS -H "X-Shkeeper-Api-Key: $SHKEEPER_API_KEY" \
"$SHKEEPER_API_URL/api/v1/healthcheck"
# 2. If 5xx from SHKeeper → it's their side
# - Check their status page / contact provider
# - Toggle a banner in the frontend warning buyers
# - Consider switching SHKEEPER_FORCE_PAYOUT_DEMO=true so QA still works
# (do NOT do this for real customer money)
# 3. If our network can't reach it:
# - test from the host: curl from the host vs from inside the container
docker exec nickapp-backend curl -v "$SHKEEPER_API_URL"
# - DNS / firewall changes?
# 4. While blocked, monitor stuck payments
docker exec nickapp-mongodb mongosh --eval \
"use marketplace; db.payments.find({status:'pending', createdAt:{\$lt: new Date(Date.now() - 30*60*1000)}}).count()"
# 5. Once SHKeeper is back, the app retries automatically. Verify the
# backlog drains. If a payment is stuck > 24h, manually verify against
# SHKeeper and use fix-transaction-hashes.js if needed.
Always communicate. Even short payment outages erode trust — post a status update.
3.5 Email delivery failure
Symptoms. Logs show SMTPError from nodemailer. Password resets, welcome emails, dispute notifications fail. Sev 2.
Runbook.
# 1. Test SMTP credentials from the container
docker exec nickapp-backend node -e "
const nm = require('nodemailer');
nm.createTransport({
host: process.env.SMTP_HOST,
port: Number(process.env.SMTP_PORT),
secure: process.env.SMTP_SECURE === 'true',
auth: { user: process.env.SMTP_USER, pass: process.env.SMTP_PASS },
}).verify().then(console.log).catch(console.error);
"
# 2. If auth failed → password rotated by provider, update SMTP_PASS in .env
# 3. If connection timed out → provider rate-limit; switch provider/sender
# 4. If specific domains bounce → check SPF / DKIM / DMARC records for amn.gg
Users can still operate the app without email; queue critical emails for retry once SMTP is restored.
3.6 WebSocket disconnect storm
Symptoms. Backend logs flood with 🔌 User connected/disconnected cycling; clients spinning on chat / notification badges. Sev 2.
Runbook.
# 1. Confirm symptoms
docker logs --tail=500 nickapp-backend | grep -c "🔌"
# 2. Check Nginx access log for socket.io polling spam
tail -f /opt/backend/nginx/logs/access.log | grep socket.io
# 3. Common causes:
# - Nginx not configured for WebSocket upgrade (returns 502 → client falls back to polling → reconnect loop)
# - Client clock skew breaking JWT validation on every reconnect
# - Redis adapter mis-configured (if scaled horizontally — not the case today)
# 4. Quick mitigation: increase Nginx proxy_read_timeout
# Permanent: ensure nginx.conf has:
# proxy_http_version 1.1;
# proxy_set_header Upgrade $http_upgrade;
# proxy_set_header Connection "upgrade";
# 5. Restart nginx + backend
docker compose restart nginx nickapp-backend
3.7 Suspicious activity / abuse
Symptoms. Sentry alerts on unusual error volume from one IP; rate-limit logs spiking; reports of brute-force on /api/auth/login.
Runbook.
# 1. Identify the offender
tail -n 10000 /opt/backend/nginx/logs/access.log \
| awk '{print $1}' | sort | uniq -c | sort -rn | head
# 2. Block at the edge (Cloudflare / host firewall)
# Or use `ufw deny from <ip>` on the host
# 3. Confirm rate limits in app
grep "RATE_LIMIT" /opt/backend/.env
# Defaults: 100 req / 15 min per IP. Tighten if abuse continues.
# 4. If the abuse targets a specific user account:
docker exec -it nickapp-backend node -e "
// disable the user via mongoose
require('./dist/infrastructure/database/connection').connectDatabase()
.then(() => require('./dist/models').User.updateOne({email:'attacker@x.com'}, {$set:{disabled:true}}))
.then(console.log)
"
# 5. Preserve evidence: copy access logs to /var/incidents/<date>/
If user data may have leaked, treat as sev 1 and follow your data-breach disclosure process.
3.8 Request Network rollback + reconciliation
Use when Request Network payments are failing, stalled, or out of sync with local payment state.
First triage:
-
Check whether RN reached nginx:
grep '/api/payment/request-network/webhook' /opt/backend/nginx/logs/access.log | tail -50 -
If RN deliveries returned
404, treat it as a backend correlation/config bug. Do not run another paid probe until the correlation fix is deployed and smoke-tested. -
If deliveries returned
202or200but the payment is still pending, inspectmetadata.transactionSafetyon thePaymentdocument. A safety-pending payment is captured but not credited; look for missing tx hash, insufficient confirmations, transfer mismatch, or AML provider blockers. -
If Cloudflare Worker durable ingress is enabled, replay from the Worker delivery id/time window after backend repair instead of asking the buyer to pay again.
Immediate rollback (minutes):
-
Stop routing new intents to Request Network by setting:
REQUEST_NETWORK_ENABLED=falsePAYMENT_ENABLED_PROVIDERS=shkeeper- keep
PAYMENT_ROLLBACK_PROVIDER=shkeeper
-
Restart backend and confirm new
/api/payment/request-network/*checks are no longer in your checkout path. -
Confirm
PAYMENT_PROVIDER_MODEis in a safe operational mode:live: standard operationsread-only: observe only, no writesdry-run: status updates without on-chain actions
Reconciliation before re-enabling:
-
Keep
PAYMENT_RECONCILIATION_ENABLED=falseuntil investigation is complete. -
Run a dry reconciliation pass (dry-run) using the Request Network reconciliation service and capture summary counters.
-
If summary is healthy, run with
apply=truefor the intended payment window. -
Re-enable RN intentionally only after two deployment health checks pass.
Escalate if repeated lookup_failed, missing_reference, or coordinator-blocked outcomes block reconciliation for more than 10 minutes.
4. Communication templates
Initial incident notification
🚨 [SEV-X] <one-line summary>
Started: <time UTC>
Impact: <which users / features>
Status: investigating
On-call: <@you>
Updates: every 15 minutes in this thread
Mid-incident update
[SEV-X] Update <n>
Time: <UTC>
Status: <investigating / mitigating / monitoring>
What we know: <facts>
What we're trying: <action>
Next update: <time>
Resolution
✅ [SEV-X] Resolved
Started: <UTC>
Ended: <UTC>
Duration: <minutes>
Impact: <users / features / requests affected>
Root cause: <one sentence>
Permanent fix: <PR / ticket>
Postmortem: <doc link, by <date>>
Customer-facing status
We're investigating an issue affecting <feature> that started at <time>.
We'll post an update by <time + 15 min>.
Avoid speculation in customer-facing copy. Say "investigating", "applying fix", "monitoring", "resolved" — and nothing else until you actually know.
5. Post-mortem template
Use within 5 business days of any sev 1 or sev 2.
---
title: Post-mortem — <short title>
date: <YYYY-MM-DD>
severity: SEV-X
duration: <minutes>
authors: [<names>]
tags: [postmortem]
---
## Summary
One paragraph: what broke, who was affected, how long, how it was fixed.
## Timeline (UTC)
- HH:MM — first signal (alert, user report)
- HH:MM — on-call ack
- HH:MM — hypothesis: ...
- HH:MM — mitigation deployed
- HH:MM — verified resolved
- HH:MM — incident closed
## Impact
- Users affected: <count or %>
- Features affected: <list>
- Money affected: <if payments>
- Data loss: <yes/no — describe>
## Root cause
Honest, blameless. Distinguish trigger vs underlying cause.
## What went well
- ...
## What went poorly
- ...
## Where we got lucky
- ...
## Action items
| # | Item | Owner | Due | Ticket |
|---|------|-------|-----|--------|
| 1 | Add /health probe for MongoDB | @x | 2026-06-01 | OPS-123 |
| 2 | Tighten rate limit on /auth/login | @y | 2026-05-30 | OPS-124 |
## Detection improvements
What new alert / dashboard would have caught this earlier?
## Process improvements
What runbook / docs need updating? Update [[Incident Response]] right now.
Store postmortems alongside this vault — suggested path /Users/mojtabaheidari/code/docs/08 - Operations/postmortems/YYYY-MM-DD-<slug>.md.
6. Escalation contacts
(Fill in for your team; placeholder structure below.)
| Role | Primary | Backup | Channel |
|---|---|---|---|
| On-call engineer | #incidents | ||
| Payments lead | DM | ||
| Infrastructure | DM | ||
| Product / customer comms | #customer-comms | ||
| SHKeeper provider contact | — | ||
| SMTP provider | — |
7. After every incident
- Updated this page with any new gotchas?
- Updated Monitoring with new metrics/alerts to add?
- Updated Backup & Recovery if backup gaps were exposed?
- Action items tracked?
- Customer comms sent (if user-impacting)?
- Post-mortem published?
Cross-links: Deployment for rollback steps, Database Operations for DB diagnostics, Backup & Recovery for restore procedures, Monitoring for metrics to watch.