btest-rs/PERFORMANCE_AUDIT.md

# btest-rs Performance Audit

**Date:** 2026-04-30
**Scope:** Full codebase (`src/`, `tests/`, `Cargo.toml`)
**Methodology:** Static code analysis, hot-path tracing, lock/contention review, algorithmic complexity analysis

---

## Executive Summary

The codebase is generally well-structured for a network I/O tool, with good use of atomics in the per-packet hot path and zero-allocation protocol serialization. However, **three critical bottlenecks** significantly limit throughput and scalability:

1. **O(n) buffer scan on every TCP read** in the client RX loop (up to 256KB scanned per `read()` call)
2. **Expensive EC curve reconstruction on every authentication** (heavy `BigUint` modular arithmetic)
3. **Single SQLite connection mutex** serializing all DB operations in `server_pro`

Additionally, there is **no benchmark or profiling infrastructure** in the project, making it impossible to measure improvements or catch regressions.

---

## Severity Legend

| Icon | Severity | Impact |
|------|----------|--------|
| 🔴 | **Critical** | Direct throughput/latency hit in hot path; fix immediately |
| 🟠 | **High** | Significant overhead under load; fix in next sprint |
| 🟡 | **Medium** | Noticeable at scale or under specific conditions |
| 🟢 | **Low** | Cosmetic / easy wins; batch with other work |

---

## 🔴 Critical Bottlenecks

### 1. O(n) Linear Buffer Scan in `tcp_client_rx_loop` (`src/client.rs:210-216`)

**Problem:** On every TCP `read()` call (up to 256KB), the client performs a byte-by-byte scan looking for interleaved 12-byte status messages:

```rust
for i in 0..=(n - STATUS_MSG_SIZE) {
    if buf[i] == STATUS_MSG_TYPE && buf[i + 1] >= 0x80 {
        // ...
    }
}
```

Since data packets are all zeros and status bytes are extremely rare, this **almost always scans the entire 256KB buffer** uselessly. At high bandwidth (many reads/sec), this wastes massive CPU cycles and pollutes cache lines.

**Impact:** CPU-bound slowdown on the client RX side during bidirectional TCP tests. The compiler *may* auto-vectorize the simple loop, but it still processes ~256K bytes per read.

**Fix Options (pick one):**
- **Best:** Use `memchr` (crate or `std::slice::memchr` on nightly) to find `0x07` bytes. On all-zero buffers this exits after a few SIMD-width checks.
- **Alternative:** Since status messages are injected at `write_all` boundaries and data is all zeros, maintain a small 12-byte sliding ring buffer across reads. Process the stream with a tiny state machine instead of scanning the whole buffer.
- **Alternative:** Track read bytes modulo expected packet size. Status messages are injected between full packets, so they will appear at predictable offsets *if* the client knows the server's `effective_size`. This requires protocol coordination.

---

### 2. `WCurve::new()` Recomputes Generator on Every Auth (`src/ecsrp5.rs:363,499`)

**Problem:** Every EC-SRP5 authentication (client and server) calls `WCurve::new()`, which performs `lift_x(9)` → `prime_mod_sqrt()` — heavy `BigUint` modular arithmetic to derive the curve generator point.

```rust
pub async fn client_authenticate<S: ...>(stream: &mut S, username: &str, password: &str) -> Result<()> {
    let w = WCurve::new();  // <-- expensive, same result every time
    // ...
}
```

The curve constants (`P`, `CURVE_ORDER`, `WEIERSTRASS_A`) are already cached as `LazyLock` statics, but the generator point is not.

**Impact:** Auth latency spikes, especially on the server under many concurrent connections. Each auth does redundant `BigUint` allocations and modular square roots.

**Fix:** Cache `WCurve` (or at least the generator point) in a global `LazyLock`:

```rust
static WCURVE: std::sync::LazyLock<WCurve> = std::sync::LazyLock::new(WCurve::new);
```

Then use `&*WCURVE` in both `client_authenticate` and `server_authenticate`.

---

### 3. Single SQLite Mutex Serializes All DB Operations (`src/server_pro/user_db.rs:15-18`)

**Problem:** The entire `server_pro` database layer uses a single shared `Connection` behind a `std::sync::Mutex`:

```rust
pub struct UserDb {
    conn: Arc<Mutex<Connection>>,
}
```

While SQLite WAL mode is enabled (allowing readers to proceed during writes), **the Rust mutex still serializes all access to the connection object**. Under concurrent load with many tests starting/finishing, this becomes the primary bottleneck.

**Critical sub-issue:** `QuotaManager::remaining_budget()` (`src/server_pro/quota.rs:387`) performs **up to 15 separate SQLite queries** in sequence, locking the mutex 15+ times per pre-test check.

**Impact:** Connection setup/teardown latency increases linearly with concurrency. Quota checks and usage recording block each other.

**Fix Options:**
- **Connection pooling:** Use `r2d2_sqlite` or `deadpool-sqlite` to maintain a small pool of connections (SQLite handles this well in WAL mode).
- **Separate read/write paths:** Open a read-only connection for quota checks (`remaining_budget`) and a dedicated write connection for usage recording. SQLite WAL allows this.
- **Batch quota checks:** Cache quota results for a few seconds per user/IP to avoid redundant queries.
- **Channel-based writer:** Use a single dedicated DB writer task with an `mpsc` channel so only one task ever touches the connection, eliminating lock contention entirely.

---

## 🟠 High Severity Issues

### 4. 100ms Busy-Poll Wait in Multi-Connection TCP (`src/server.rs:313-332`)

**Problem:** When waiting for secondary TCP connections to join a multi-connection session, the primary connection busy-polls the session map every 100ms:

```rust
loop {
    let count = { let map = sessions.lock().await; ... };
    if count + 1 >= conn_count as usize { break; }
    tokio::time::sleep(Duration::from_millis(100)).await;
}
```

This adds up to **100ms of unnecessary latency** to every multi-connection test startup. It also hammers the async mutex needlessly.

**Fix:** Replace with `tokio::sync::Notify`. When a secondary connection registers itself, it calls `notify_one()`. The primary waits on `notified().await` with a timeout, waking instantly when ready.

---

### 5. FreeBSD CPU Sampling Spawns Process Every Second (`src/cpu.rs:142`)

**Problem:** On FreeBSD, `get_cpu_times()` spawns `sysctl -n kern.cp_time` via `std::process::Command` every second:

```rust
fn get_cpu_times() -> (u64, u64) {
    if let Ok(output) = std::process::Command::new("sysctl")
        .arg("-n").arg("kern.cp_time").output()
    { ... }
}
```

`fork()` + `exec()` is extremely expensive relative to the work being done (reading 5 integers).

**Fix:** Use `libc::sysctl()` via FFI, matching the macOS implementation style. Cache the `mib` array and call the syscall directly.

---

### 6. Per-Call Timer Registration in UDP RX Loops (`src/client.rs:393`, `src/server.rs:925`)

**Problem:** Both UDP RX loops create a new `tokio::time::timeout` timer on **every single `recv`/`recv_from` call**:

```rust
match tokio::time::timeout(Duration::from_secs(5), socket.recv(&mut buf)).await
```

At high packet rates (e.g., 100K pps), registering and canceling timers on the Tokio timer wheel adds measurable overhead.

**Fix:** Use `tokio::select!` with a long-lived `tokio::time::sleep` future that is reset, or use the socket's built-in SO_RCVTIMEO if available via `socket2`. Alternatively, since UDP is connectionless, consider whether a 5-second timeout is needed on every call or if the outer test duration timer is sufficient.

---

## 🟡 Medium Severity Issues

### 7. String Error Matching with Allocation (`src/server_pro/enforcer.rs:157-161`)

```rust
match format!("{}", e).as_str() {
    s if s.contains("daily") => ...
}
```

`format!("{}", e)` allocates a `String` from the error just to do substring matching. Use `e.to_string().contains(...)` or match on error types directly if possible.

---

### 8. `ip.to_string()` Called Repeatedly in Quota Checks (`src/server_pro/quota.rs:389`)

```rust
let ip_str = ip.to_string();
// ... used in 6+ DB calls
```

This allocates a `String` on every quota check. Accept `&str` or `IpAddr` directly in DB methods, or cache the string.

---

### 9. `chrono_date_today()` Recomputes Calendar from Epoch (`src/server_pro/user_db.rs:617-638`)

A hand-rolled date calculation loops through years from 1970 and months every time it's called (which is before almost every DB write). The `chrono` crate is already used indirectly by `rusqlite`; add it as a direct dependency and replace with `chrono::Local::now().format("%Y-%m-%d")`.

---

## 🟢 Low Severity / Easy Wins

### 10. CSV File Reopened on Every Write (`src/csv_output.rs:77`)

```rust
if let Ok(mut f) = OpenOptions::new().append(true).open(path) {
    let _ = writeln!(f, "{}", row);
}
```

Called once per test, not per-packet, but still suboptimal. Consider keeping a lazily-initialized `Mutex<Option<File>>` or using `std::fs::OpenOptions` once at init and storing the handle.

---

### 11. Global Syslog Mutex Held During I/O (`src/syslog_logger.rs`)

```rust
static SYSLOG: Mutex<Option<SyslogSender>> = Mutex::new(None);
```

The global `std::sync::Mutex` is held while formatting the timestamp (expensive manual calendar math) and sending UDP. Switch to `parking_lot::Mutex` (faster) or `tokio::sync::Mutex` if async, and format the message outside the lock. Better yet, use `tracing`'s built-in syslog integration or a structured appender.

---

### 12. `hash_password()` Uses `format!` + `format!` in Hex Loop (`src/server_pro/user_db.rs:612-614`)

```rust
hasher.update(format!("{}:{}", username, password).as_bytes());
result.iter().map(|b| format!("{:02x}", b)).collect()  // N allocs for N bytes
```

The hex encoding allocates one `String` per byte. Use a small fixed buffer or `hex` crate (already used elsewhere in `ecsrp5.rs`).

---

### 13. Redundant `Instant::now()` Calls in TX Loop (`src/server.rs:593,606`)

```rust
if send_status && Instant::now() >= next_status {
    // ...
    next_status = Instant::now() + Duration::from_secs(1);
}
```

Two monotonic clock reads per loop iteration. Cache `let now = Instant::now();` at the top of the loop.

---

## Architecture Observations

### What the Code Does Well
- **Zero-allocation protocol layer:** `serialize()` returns fixed-size stack arrays (`[u8; 12]`, `[u8; 16]`). Excellent.
- **Atomic bandwidth tracking:** `BandwidthState` uses `AtomicU64` with `Relaxed` ordering in the per-packet path. No locks in the data plane.
- **Buffer reuse:** TX/RX loops allocate `vec![0u8; ...]` once before the loop. Good.
- **Aggressive release profile:** `lto = true`, `codegen-units = 1`, `opt-level = 3`.

### Async Runtime Usage
- `tokio` with `full` features is used. For a primarily I/O-bound tool, this is appropriate.
- `tokio::task::yield_now().await` is used in unlimited-rate mode to prevent starving the runtime. This is correct but consider whether `tokio::task::spawn_blocking` or dedicated CPU pinning is needed for the EC-SRP5 math (which is CPU-bound and currently runs on the async runtime during auth).

### Memory Safety
- Several `unwrap()`/`expect()` calls in setup paths (socket binding, address parsing). These are acceptable for config errors but should use `?` propagation where possible to allow graceful degradation.

---

## Missing Performance Infrastructure

| Infrastructure | Status | Recommendation |
|----------------|--------|----------------|
| **Benchmarks** | ❌ None | Add `criterion` + `benches/` for `BandwidthState`, protocol ser/de, and EC-SRP5 auth |
| **Profiling hooks** | ❌ None | Add optional `pprof` or `dhat` dev-deps for heap profiling |
| **Throughput regression tests** | ⚠️ Partial | Integration tests assert `tx > 0` and `rx > 0` but don't measure sustained throughput |
| **Load tests** | ❌ None | Add a `benches/load_test.rs` that spawns 100+ concurrent tests against a local server |
| **CI performance gates** | ❌ None | Consider a benchmark action that fails on >5% regression |

---

## Priority Action Plan

### Phase 1: Hot-Path Fixes (1-2 days)
1. Replace buffer scan with `memchr` or ring-buffer approach in `tcp_client_rx_loop`
2. Cache `WCurve` in a global `LazyLock`
3. Replace 100ms poll with `tokio::sync::Notify` in multi-conn wait

### Phase 2: Scalability (2-3 days)
4. Add SQLite connection pooling or channel-based writer in `server_pro`
5. Cache `remaining_budget()` results for 5-10 seconds
6. Fix FreeBSD CPU sampling to use `libc::sysctl` FFI

### Phase 3: Polish & Tooling (1-2 days)
7. Replace manual date arithmetic with `chrono`
8. Add `criterion` benchmarks for auth and bandwidth state
9. Fix low-severity allocation issues (CSV, syslog, hex encoding)

---

## Appendix: File-by-File Quick Reference

| File | Lines | Hot Path? | Key Concern |
|------|-------|-----------|-------------|
| `src/client.rs` | 531 | ✅ Yes | O(n) 256KB scan per TCP read |
| `src/server.rs` | 1094 | ✅ Yes | 100ms poll wait, status injection timing |
| `src/ecsrp5.rs` | 660 | ✅ Yes (auth) | `WCurve::new()` recomputed per auth |
| `src/bandwidth.rs` | 263 | ✅ Yes (atomics) | Well-designed; no issues |
| `src/protocol.rs` | 214 | ✅ Yes (ser/de) | Zero-allocation; excellent |
| `src/cpu.rs` | 215 | ⚠️ Periodic | FreeBSD `fork+exec` every second |
| `src/server_pro/quota.rs` | 470 | ⚠️ Periodic | 15 DB queries per budget check |
| `src/server_pro/user_db.rs` | 641 | ⚠️ All DB ops | Single mutex serializes everything |
| `src/server_pro/server_loop.rs` | 449 | ✅ Yes | DB auth locks during connection setup |
| `src/server_pro/enforcer.rs` | 411 | ⚠️ Periodic | String error matching allocates |
| `src/csv_output.rs` | 86 | ❌ No | File reopen per write |
| `src/syslog_logger.rs` | 154 | ❌ No | Global mutex + manual calendar math |
| `src/auth.rs` | 164 | ⚠️ Auth only | Minor; double MD5 per auth |
| `src/main.rs` | 243 | ❌ No | Entry point only |