Files
nick-doc/11 - Testing/Concurrency and Performance Profile.md

286 lines
8.7 KiB
Markdown

---
title: Concurrency and Performance Profile
tags: [testing, performance, concurrency, profiling, e2e]
created: 2026-06-06
---
# Concurrency and Performance Profile
This procedure defines the ramp test for simultaneous escrow E2E flows and the
report format for performance characteristics.
The purpose is not only load generation. It must prove that business behavior
remains correct under concurrency: payments confirm once, notifications are
issued to the right users, and no request/offer/payment state leaks across
parallel workers.
## Test Shape
One worker is one complete isolated E2E flow:
```text
buyer + sellers -> request -> bids -> accept -> payment intent -> tUSDT payment
-> scanner confirmation -> seller delivery -> buyer confirmation
```
Each worker must use unique:
- run id suffix;
- buyer and seller users;
- purchase request;
- selected offer;
- payment id;
- scanner destination/baseline;
- tx hash or simulated payment fixture, depending on mode.
Notifications are mandatory inside every worker. See
[[Notification Assertion Procedure]].
Implemented runner:
```bash
cd ~/CascadeProjects/escrow/backend
BASE_URL=https://dev.amn.gg \
PAYMENT_MODE=status \
CONCURRENCY_LEVELS=1,2,4,8,16,32 \
ROUNDS=1 \
bash scripts/smoke/marketplace-e2e-notifications.sh
```
Use `PAYMENT_MODE=live` for low-concurrency BSC Testnet tUSDT confirmation.
Use `PAYMENT_MODE=status` for high-concurrency marketplace/notification
profiling without consuming gas.
## Ramp Plan
Start with one simultaneous worker and double until a stop condition is reached:
| Stage | Simultaneous workers | Purpose |
|---|---:|---|
| C1 | `1` | Baseline correctness and latency. |
| C2 | `2` | Detect simple race conditions. |
| C4 | `4` | Validate small parallel seller/payment load. |
| C8 | `8` | First meaningful contention check. |
| C16 | `16` | Stress DB/API/socket fanout. |
| C32 | `32` | Upper dev-stack target before release planning. |
| C64+ | `64+` | Only if C32 passes and infrastructure headroom is clear. |
Hold each stage long enough to complete at least one full E2E round per worker.
For API-only profiling, also support a fixed-duration mode such as 5 minutes per
stage.
## Modes
| Mode | Payment behavior | Use |
|---|---|---|
| Live-chain mode | Real BSC Testnet tUSDT transfers | Final confidence at low concurrency; expensive/slower; consumes gas. |
| Status-only smoke mode | Moves accepted request to `payment` through the status route | Implemented high-concurrency marketplace/notification profiling without chain variables. |
| Scanner fixture mode | Deterministic scanner/balance fixture or controlled test endpoint | High concurrency without chain bottleneck. Must not be enabled in production. |
| API-only dry run | Runs request/offer/delivery and skips payment finalization | Marketplace/notification profiling without chain variables. |
Live-chain mode should usually stop at low concurrency unless there is enough
tBNB/tUSDT and the chain/RPC is reliable. Higher stages should use scanner
fixture mode once implemented.
## Metrics To Collect
### Business correctness
| Metric | Target |
|---|---|
| completed worker success rate | `100%` for C1-C8, `>= 99%` for C16+ after retries are classified |
| duplicate payment credit count | `0` |
| wrong-recipient notification count | `0` |
| cross-worker state leak count | `0` |
| non-buyer delivery confirmation success | `0` |
| ledger inconsistency count | `0` |
### API latency
Initial performance goals for dev profiling:
| Operation | p50 goal | p95 goal | p99 watch |
|---|---:|---:|---:|
| login | `< 300 ms` | `< 1 s` | `< 2 s` |
| create request | `< 400 ms` | `< 1.5 s` | `< 3 s` |
| create offer | `< 400 ms` | `< 1.5 s` | `< 3 s` |
| accept offer | `< 500 ms` | `< 2 s` | `< 4 s` |
| create payment intent | `< 750 ms` | `< 3 s` | `< 6 s` |
| scanner balance check | `< 1 s` | `< 5 s` | `< 10 s` |
| seller delivery | `< 500 ms` | `< 2 s` | `< 4 s` |
| buyer delivery confirmation | `< 500 ms` | `< 2 s` | `< 4 s` |
| notification visibility | `< 1 s` | `< 5 s` | `< 10 s` |
These are starting goals, not final SLOs. The first complete C1-C32 run should
produce a baseline report and then adjust targets with evidence.
### Infrastructure
Collect per stage:
- backend CPU and memory;
- frontend CPU and memory;
- scanner CPU and memory;
- MongoDB CPU, memory, connections, slow queries;
- Postgres CPU, memory, connections, locks;
- Redis CPU, memory, connected clients;
- container restarts;
- Docker image/version;
- BSC Testnet RPC latency/error rate;
- Socket.IO connected clients and emitted event count;
- notification insert count and error count.
Suggested host commands:
```bash
docker stats --no-stream
docker ps --format '{{.Names}}\t{{.Image}}\t{{.Status}}'
docker logs --since 5m escrow-backend
docker logs --since 5m escrow-scanner
```
Do not paste secrets from environment output into reports.
## Stop Conditions
Stop the ramp immediately if any P0 condition appears:
- payment marked paid without correct chain/token/destination/amount evidence;
- duplicate ledger credit;
- notification delivered to wrong user;
- expected notification missing for a step without approved known-gap classification;
- backend, scanner, Mongo, Postgres, or Redis container restarts;
- sustained HTTP 5xx rate above `1%`;
- p95 create payment intent exceeds `10 s` for two consecutive stages;
- scanner confirmation/check p95 exceeds `30 s` outside known BSC Testnet RPC issues;
- queue/backlog grows without draining after the stage ends;
- host CPU remains above `85%` or memory above `90%` after cooldown.
## Stage Procedure
For each stage:
1. Verify dev stack health.
2. Capture container stats baseline.
3. Create isolated worker test data.
4. Start all workers at a barrier time.
5. For every worker, execute full E2E and notification assertions.
6. Capture per-operation timings.
7. Capture infrastructure metrics during run.
8. Wait for queues/notifications to settle.
9. Capture cooldown metrics.
10. Classify failures:
- product bug;
- test data/setup bug;
- BSC Testnet/RPC external issue;
- infrastructure capacity issue;
- known product gap.
11. Decide whether to proceed to the next stage.
## Worker Result Schema
Each worker should produce a JSON result:
```json
{
"workerId": "C8-W03",
"stage": 8,
"runId": "20260606-perf-C8-W03",
"status": "pass",
"buyerUserId": "<id>",
"sellerUserIds": ["<id>", "<id>", "<id>"],
"purchaseRequestId": "<uuid>",
"selectedOfferId": "<uuid>",
"paymentId": "<uuid>",
"txHash": "0x...",
"timingsMs": {
"login": 180,
"createRequest": 420,
"createOffers": 910,
"acceptOffer": 330,
"createPaymentIntent": 850,
"scannerConfirm": 4200,
"sellerDelivery": 380,
"buyerConfirmDelivery": 410,
"total": 12100
},
"notifications": [
{
"step": "seller_offer_created",
"recipient": "buyer",
"observed": true,
"latencyMs": 640
}
],
"errors": []
}
```
## Report Template
Create one report per full ramp:
```markdown
# Performance Profile Report - <date>
## Summary
- Target:
- Backend/frontend/scanner versions:
- Commit SHAs:
- Payment mode: live-chain / scanner fixture / API-only
- Ramp stages completed:
- Overall result:
## Key Findings
| Finding | Severity | Evidence | Next action |
|---|---|---|---|
## Stage Results
| Stage | Workers | Pass | Fail | p95 total | p95 payment intent | p95 scanner | p95 notification | 5xx rate |
|---|---:|---:|---:|---:|---:|---:|---:|---:|
## Notification Results
| Step | Expected | Observed | Missing | Wrong recipient | p95 latency |
|---|---:|---:|---:|---:|---:|
## Infrastructure
| Stage | Backend CPU/mem | Scanner CPU/mem | Mongo | Postgres | Redis | Restarts |
|---|---|---|---|---|---|---:|
## Payment Correctness
- Duplicate credits:
- Under/overpayment anomalies:
- Scanner mismatches:
- Ledger mismatches:
## Bottlenecks
- API:
- Database:
- Scanner:
- Socket/notifications:
- RPC/chain:
## Decisions
- Current safe dev concurrency:
- Recommended production target:
- Required fixes before next ramp:
```
## Initial Performance Characteristic Hypotheses
These are the expectations to validate:
- Request/offer APIs should scale mostly with Mongo/Postgres write throughput.
- Notification latency will become a visible bottleneck before raw API latency if every offer/status change creates individual Mongo inserts and socket emits.
- Scanner live-chain checks are likely bounded by BSC Testnet RPC latency and should be separated from API-only profiling.
- Payment intent creation may become slower if destination derivation, token registry lookup, and scanner registration are serial.
- Socket fanout should be watched at C16+ because each worker has multiple actors and multiple tabs/devices may multiply room membership.