# btest-rs Performance Audit **Date:** 2026-04-30 **Scope:** Full codebase (`src/`, `tests/`, `Cargo.toml`) **Methodology:** Static code analysis, hot-path tracing, lock/contention review, algorithmic complexity analysis --- ## Executive Summary The codebase is generally well-structured for a network I/O tool, with good use of atomics in the per-packet hot path and zero-allocation protocol serialization. However, **three critical bottlenecks** significantly limit throughput and scalability: 1. **O(n) buffer scan on every TCP read** in the client RX loop (up to 256KB scanned per `read()` call) 2. **Expensive EC curve reconstruction on every authentication** (heavy `BigUint` modular arithmetic) 3. **Single SQLite connection mutex** serializing all DB operations in `server_pro` Additionally, there is **no benchmark or profiling infrastructure** in the project, making it impossible to measure improvements or catch regressions. --- ## Severity Legend | Icon | Severity | Impact | |------|----------|--------| | 🔴 | **Critical** | Direct throughput/latency hit in hot path; fix immediately | | 🟠 | **High** | Significant overhead under load; fix in next sprint | | 🟡 | **Medium** | Noticeable at scale or under specific conditions | | 🟢 | **Low** | Cosmetic / easy wins; batch with other work | --- ## 🔴 Critical Bottlenecks ### 1. O(n) Linear Buffer Scan in `tcp_client_rx_loop` (`src/client.rs:210-216`) **Problem:** On every TCP `read()` call (up to 256KB), the client performs a byte-by-byte scan looking for interleaved 12-byte status messages: ```rust for i in 0..=(n - STATUS_MSG_SIZE) { if buf[i] == STATUS_MSG_TYPE && buf[i + 1] >= 0x80 { // ... } } ``` Since data packets are all zeros and status bytes are extremely rare, this **almost always scans the entire 256KB buffer** uselessly. At high bandwidth (many reads/sec), this wastes massive CPU cycles and pollutes cache lines. **Impact:** CPU-bound slowdown on the client RX side during bidirectional TCP tests. The compiler *may* auto-vectorize the simple loop, but it still processes ~256K bytes per read. **Fix Options (pick one):** - **Best:** Use `memchr` (crate or `std::slice::memchr` on nightly) to find `0x07` bytes. On all-zero buffers this exits after a few SIMD-width checks. - **Alternative:** Since status messages are injected at `write_all` boundaries and data is all zeros, maintain a small 12-byte sliding ring buffer across reads. Process the stream with a tiny state machine instead of scanning the whole buffer. - **Alternative:** Track read bytes modulo expected packet size. Status messages are injected between full packets, so they will appear at predictable offsets *if* the client knows the server's `effective_size`. This requires protocol coordination. --- ### 2. `WCurve::new()` Recomputes Generator on Every Auth (`src/ecsrp5.rs:363,499`) **Problem:** Every EC-SRP5 authentication (client and server) calls `WCurve::new()`, which performs `lift_x(9)` → `prime_mod_sqrt()` — heavy `BigUint` modular arithmetic to derive the curve generator point. ```rust pub async fn client_authenticate(stream: &mut S, username: &str, password: &str) -> Result<()> { let w = WCurve::new(); // <-- expensive, same result every time // ... } ``` The curve constants (`P`, `CURVE_ORDER`, `WEIERSTRASS_A`) are already cached as `LazyLock` statics, but the generator point is not. **Impact:** Auth latency spikes, especially on the server under many concurrent connections. Each auth does redundant `BigUint` allocations and modular square roots. **Fix:** Cache `WCurve` (or at least the generator point) in a global `LazyLock`: ```rust static WCURVE: std::sync::LazyLock = std::sync::LazyLock::new(WCurve::new); ``` Then use `&*WCURVE` in both `client_authenticate` and `server_authenticate`. --- ### 3. Single SQLite Mutex Serializes All DB Operations (`src/server_pro/user_db.rs:15-18`) **Problem:** The entire `server_pro` database layer uses a single shared `Connection` behind a `std::sync::Mutex`: ```rust pub struct UserDb { conn: Arc>, } ``` While SQLite WAL mode is enabled (allowing readers to proceed during writes), **the Rust mutex still serializes all access to the connection object**. Under concurrent load with many tests starting/finishing, this becomes the primary bottleneck. **Critical sub-issue:** `QuotaManager::remaining_budget()` (`src/server_pro/quota.rs:387`) performs **up to 15 separate SQLite queries** in sequence, locking the mutex 15+ times per pre-test check. **Impact:** Connection setup/teardown latency increases linearly with concurrency. Quota checks and usage recording block each other. **Fix Options:** - **Connection pooling:** Use `r2d2_sqlite` or `deadpool-sqlite` to maintain a small pool of connections (SQLite handles this well in WAL mode). - **Separate read/write paths:** Open a read-only connection for quota checks (`remaining_budget`) and a dedicated write connection for usage recording. SQLite WAL allows this. - **Batch quota checks:** Cache quota results for a few seconds per user/IP to avoid redundant queries. - **Channel-based writer:** Use a single dedicated DB writer task with an `mpsc` channel so only one task ever touches the connection, eliminating lock contention entirely. --- ## 🟠 High Severity Issues ### 4. 100ms Busy-Poll Wait in Multi-Connection TCP (`src/server.rs:313-332`) **Problem:** When waiting for secondary TCP connections to join a multi-connection session, the primary connection busy-polls the session map every 100ms: ```rust loop { let count = { let map = sessions.lock().await; ... }; if count + 1 >= conn_count as usize { break; } tokio::time::sleep(Duration::from_millis(100)).await; } ``` This adds up to **100ms of unnecessary latency** to every multi-connection test startup. It also hammers the async mutex needlessly. **Fix:** Replace with `tokio::sync::Notify`. When a secondary connection registers itself, it calls `notify_one()`. The primary waits on `notified().await` with a timeout, waking instantly when ready. --- ### 5. FreeBSD CPU Sampling Spawns Process Every Second (`src/cpu.rs:142`) **Problem:** On FreeBSD, `get_cpu_times()` spawns `sysctl -n kern.cp_time` via `std::process::Command` every second: ```rust fn get_cpu_times() -> (u64, u64) { if let Ok(output) = std::process::Command::new("sysctl") .arg("-n").arg("kern.cp_time").output() { ... } } ``` `fork()` + `exec()` is extremely expensive relative to the work being done (reading 5 integers). **Fix:** Use `libc::sysctl()` via FFI, matching the macOS implementation style. Cache the `mib` array and call the syscall directly. --- ### 6. Per-Call Timer Registration in UDP RX Loops (`src/client.rs:393`, `src/server.rs:925`) **Problem:** Both UDP RX loops create a new `tokio::time::timeout` timer on **every single `recv`/`recv_from` call**: ```rust match tokio::time::timeout(Duration::from_secs(5), socket.recv(&mut buf)).await ``` At high packet rates (e.g., 100K pps), registering and canceling timers on the Tokio timer wheel adds measurable overhead. **Fix:** Use `tokio::select!` with a long-lived `tokio::time::sleep` future that is reset, or use the socket's built-in SO_RCVTIMEO if available via `socket2`. Alternatively, since UDP is connectionless, consider whether a 5-second timeout is needed on every call or if the outer test duration timer is sufficient. --- ## 🟡 Medium Severity Issues ### 7. String Error Matching with Allocation (`src/server_pro/enforcer.rs:157-161`) ```rust match format!("{}", e).as_str() { s if s.contains("daily") => ... } ``` `format!("{}", e)` allocates a `String` from the error just to do substring matching. Use `e.to_string().contains(...)` or match on error types directly if possible. --- ### 8. `ip.to_string()` Called Repeatedly in Quota Checks (`src/server_pro/quota.rs:389`) ```rust let ip_str = ip.to_string(); // ... used in 6+ DB calls ``` This allocates a `String` on every quota check. Accept `&str` or `IpAddr` directly in DB methods, or cache the string. --- ### 9. `chrono_date_today()` Recomputes Calendar from Epoch (`src/server_pro/user_db.rs:617-638`) A hand-rolled date calculation loops through years from 1970 and months every time it's called (which is before almost every DB write). The `chrono` crate is already used indirectly by `rusqlite`; add it as a direct dependency and replace with `chrono::Local::now().format("%Y-%m-%d")`. --- ## 🟢 Low Severity / Easy Wins ### 10. CSV File Reopened on Every Write (`src/csv_output.rs:77`) ```rust if let Ok(mut f) = OpenOptions::new().append(true).open(path) { let _ = writeln!(f, "{}", row); } ``` Called once per test, not per-packet, but still suboptimal. Consider keeping a lazily-initialized `Mutex>` or using `std::fs::OpenOptions` once at init and storing the handle. --- ### 11. Global Syslog Mutex Held During I/O (`src/syslog_logger.rs`) ```rust static SYSLOG: Mutex> = Mutex::new(None); ``` The global `std::sync::Mutex` is held while formatting the timestamp (expensive manual calendar math) and sending UDP. Switch to `parking_lot::Mutex` (faster) or `tokio::sync::Mutex` if async, and format the message outside the lock. Better yet, use `tracing`'s built-in syslog integration or a structured appender. --- ### 12. `hash_password()` Uses `format!` + `format!` in Hex Loop (`src/server_pro/user_db.rs:612-614`) ```rust hasher.update(format!("{}:{}", username, password).as_bytes()); result.iter().map(|b| format!("{:02x}", b)).collect() // N allocs for N bytes ``` The hex encoding allocates one `String` per byte. Use a small fixed buffer or `hex` crate (already used elsewhere in `ecsrp5.rs`). --- ### 13. Redundant `Instant::now()` Calls in TX Loop (`src/server.rs:593,606`) ```rust if send_status && Instant::now() >= next_status { // ... next_status = Instant::now() + Duration::from_secs(1); } ``` Two monotonic clock reads per loop iteration. Cache `let now = Instant::now();` at the top of the loop. --- ## Architecture Observations ### What the Code Does Well - **Zero-allocation protocol layer:** `serialize()` returns fixed-size stack arrays (`[u8; 12]`, `[u8; 16]`). Excellent. - **Atomic bandwidth tracking:** `BandwidthState` uses `AtomicU64` with `Relaxed` ordering in the per-packet path. No locks in the data plane. - **Buffer reuse:** TX/RX loops allocate `vec![0u8; ...]` once before the loop. Good. - **Aggressive release profile:** `lto = true`, `codegen-units = 1`, `opt-level = 3`. ### Async Runtime Usage - `tokio` with `full` features is used. For a primarily I/O-bound tool, this is appropriate. - `tokio::task::yield_now().await` is used in unlimited-rate mode to prevent starving the runtime. This is correct but consider whether `tokio::task::spawn_blocking` or dedicated CPU pinning is needed for the EC-SRP5 math (which is CPU-bound and currently runs on the async runtime during auth). ### Memory Safety - Several `unwrap()`/`expect()` calls in setup paths (socket binding, address parsing). These are acceptable for config errors but should use `?` propagation where possible to allow graceful degradation. --- ## Missing Performance Infrastructure | Infrastructure | Status | Recommendation | |----------------|--------|----------------| | **Benchmarks** | ❌ None | Add `criterion` + `benches/` for `BandwidthState`, protocol ser/de, and EC-SRP5 auth | | **Profiling hooks** | ❌ None | Add optional `pprof` or `dhat` dev-deps for heap profiling | | **Throughput regression tests** | ⚠️ Partial | Integration tests assert `tx > 0` and `rx > 0` but don't measure sustained throughput | | **Load tests** | ❌ None | Add a `benches/load_test.rs` that spawns 100+ concurrent tests against a local server | | **CI performance gates** | ❌ None | Consider a benchmark action that fails on >5% regression | --- ## Priority Action Plan ### Phase 1: Hot-Path Fixes (1-2 days) 1. Replace buffer scan with `memchr` or ring-buffer approach in `tcp_client_rx_loop` 2. Cache `WCurve` in a global `LazyLock` 3. Replace 100ms poll with `tokio::sync::Notify` in multi-conn wait ### Phase 2: Scalability (2-3 days) 4. Add SQLite connection pooling or channel-based writer in `server_pro` 5. Cache `remaining_budget()` results for 5-10 seconds 6. Fix FreeBSD CPU sampling to use `libc::sysctl` FFI ### Phase 3: Polish & Tooling (1-2 days) 7. Replace manual date arithmetic with `chrono` 8. Add `criterion` benchmarks for auth and bandwidth state 9. Fix low-severity allocation issues (CSV, syslog, hex encoding) --- ## Appendix: File-by-File Quick Reference | File | Lines | Hot Path? | Key Concern | |------|-------|-----------|-------------| | `src/client.rs` | 531 | ✅ Yes | O(n) 256KB scan per TCP read | | `src/server.rs` | 1094 | ✅ Yes | 100ms poll wait, status injection timing | | `src/ecsrp5.rs` | 660 | ✅ Yes (auth) | `WCurve::new()` recomputed per auth | | `src/bandwidth.rs` | 263 | ✅ Yes (atomics) | Well-designed; no issues | | `src/protocol.rs` | 214 | ✅ Yes (ser/de) | Zero-allocation; excellent | | `src/cpu.rs` | 215 | ⚠️ Periodic | FreeBSD `fork+exec` every second | | `src/server_pro/quota.rs` | 470 | ⚠️ Periodic | 15 DB queries per budget check | | `src/server_pro/user_db.rs` | 641 | ⚠️ All DB ops | Single mutex serializes everything | | `src/server_pro/server_loop.rs` | 449 | ✅ Yes | DB auth locks during connection setup | | `src/server_pro/enforcer.rs` | 411 | ⚠️ Periodic | String error matching allocates | | `src/csv_output.rs` | 86 | ❌ No | File reopen per write | | `src/syslog_logger.rs` | 154 | ❌ No | Global mutex + manual calendar math | | `src/auth.rs` | 164 | ⚠️ Auth only | Minor; double MD5 per auth | | `src/main.rs` | 243 | ❌ No | Entry point only |