This commit fixes the most significant hot-path bottleneck in the
client: the tcp_client_rx_loop was scanning up to 256KB byte-by-byte
on every read() call looking for interleaved 12-byte status messages.
Changes:
- client.rs (tcp_client_rx_loop): Replace the O(n) for-loop scan
with a three-stage approach:
1. Split-message check: An 11-byte carry buffer stores trailing
bytes from the previous read. We check every possible alignment
where a status message (0x07 + cpu_byte) could span the carry
and the start of the current buffer. This fixes a latent bug
where the old code would miss status messages split across TCP
read boundaries.
2. Fast scan: memchr::memchr (AVX2/NEON SIMD) finds 0x07 bytes
in the 256KB buffer. On all-zero data packets this exits in
~4096 SIMD-width operations instead of 262,144 byte compares.
~64x faster scan path.
3. Carry save: Save up to 11 trailing bytes for the next read.
- client.rs (unit tests): Add scan_status_message() helper and
five unit tests covering:
- Status message fully within buffer
- Status message split across reads (5+7 bytes)
- Status message split at boundary (1+11 bytes)
- All-zero buffer (no false positive)
- Short buffer (no panic)
- Cargo.toml / Cargo.lock: Add memchr as an explicit dependency.
Verified against live MikroTik RouterOS (TCP both + receive modes
with EC-SRP5 auth). Status messages detected correctly. No wire
protocol changes — 100% MikroTik compatible.
14 KiB
btest-rs Performance Audit
Date: 2026-04-30
Scope: Full codebase (src/, tests/, Cargo.toml)
Methodology: Static code analysis, hot-path tracing, lock/contention review, algorithmic complexity analysis
Executive Summary
The codebase is generally well-structured for a network I/O tool, with good use of atomics in the per-packet hot path and zero-allocation protocol serialization. However, three critical bottlenecks significantly limit throughput and scalability:
- O(n) buffer scan on every TCP read in the client RX loop (up to 256KB scanned per
read()call) - Expensive EC curve reconstruction on every authentication (heavy
BigUintmodular arithmetic) - Single SQLite connection mutex serializing all DB operations in
server_pro
Additionally, there is no benchmark or profiling infrastructure in the project, making it impossible to measure improvements or catch regressions.
Severity Legend
| Icon | Severity | Impact |
|---|---|---|
| 🔴 | Critical | Direct throughput/latency hit in hot path; fix immediately |
| 🟠 | High | Significant overhead under load; fix in next sprint |
| 🟡 | Medium | Noticeable at scale or under specific conditions |
| 🟢 | Low | Cosmetic / easy wins; batch with other work |
🔴 Critical Bottlenecks
1. O(n) Linear Buffer Scan in tcp_client_rx_loop (src/client.rs:210-216)
Problem: On every TCP read() call (up to 256KB), the client performs a byte-by-byte scan looking for interleaved 12-byte status messages:
for i in 0..=(n - STATUS_MSG_SIZE) {
if buf[i] == STATUS_MSG_TYPE && buf[i + 1] >= 0x80 {
// ...
}
}
Since data packets are all zeros and status bytes are extremely rare, this almost always scans the entire 256KB buffer uselessly. At high bandwidth (many reads/sec), this wastes massive CPU cycles and pollutes cache lines.
Impact: CPU-bound slowdown on the client RX side during bidirectional TCP tests. The compiler may auto-vectorize the simple loop, but it still processes ~256K bytes per read.
Fix Options (pick one):
- Best: Use
memchr(crate orstd::slice::memchron nightly) to find0x07bytes. On all-zero buffers this exits after a few SIMD-width checks. - Alternative: Since status messages are injected at
write_allboundaries and data is all zeros, maintain a small 12-byte sliding ring buffer across reads. Process the stream with a tiny state machine instead of scanning the whole buffer. - Alternative: Track read bytes modulo expected packet size. Status messages are injected between full packets, so they will appear at predictable offsets if the client knows the server's
effective_size. This requires protocol coordination.
2. WCurve::new() Recomputes Generator on Every Auth (src/ecsrp5.rs:363,499)
Problem: Every EC-SRP5 authentication (client and server) calls WCurve::new(), which performs lift_x(9) → prime_mod_sqrt() — heavy BigUint modular arithmetic to derive the curve generator point.
pub async fn client_authenticate<S: ...>(stream: &mut S, username: &str, password: &str) -> Result<()> {
let w = WCurve::new(); // <-- expensive, same result every time
// ...
}
The curve constants (P, CURVE_ORDER, WEIERSTRASS_A) are already cached as LazyLock statics, but the generator point is not.
Impact: Auth latency spikes, especially on the server under many concurrent connections. Each auth does redundant BigUint allocations and modular square roots.
Fix: Cache WCurve (or at least the generator point) in a global LazyLock:
static WCURVE: std::sync::LazyLock<WCurve> = std::sync::LazyLock::new(WCurve::new);
Then use &*WCURVE in both client_authenticate and server_authenticate.
3. Single SQLite Mutex Serializes All DB Operations (src/server_pro/user_db.rs:15-18)
Problem: The entire server_pro database layer uses a single shared Connection behind a std::sync::Mutex:
pub struct UserDb {
conn: Arc<Mutex<Connection>>,
}
While SQLite WAL mode is enabled (allowing readers to proceed during writes), the Rust mutex still serializes all access to the connection object. Under concurrent load with many tests starting/finishing, this becomes the primary bottleneck.
Critical sub-issue: QuotaManager::remaining_budget() (src/server_pro/quota.rs:387) performs up to 15 separate SQLite queries in sequence, locking the mutex 15+ times per pre-test check.
Impact: Connection setup/teardown latency increases linearly with concurrency. Quota checks and usage recording block each other.
Fix Options:
- Connection pooling: Use
r2d2_sqliteordeadpool-sqliteto maintain a small pool of connections (SQLite handles this well in WAL mode). - Separate read/write paths: Open a read-only connection for quota checks (
remaining_budget) and a dedicated write connection for usage recording. SQLite WAL allows this. - Batch quota checks: Cache quota results for a few seconds per user/IP to avoid redundant queries.
- Channel-based writer: Use a single dedicated DB writer task with an
mpscchannel so only one task ever touches the connection, eliminating lock contention entirely.
🟠 High Severity Issues
4. 100ms Busy-Poll Wait in Multi-Connection TCP (src/server.rs:313-332)
Problem: When waiting for secondary TCP connections to join a multi-connection session, the primary connection busy-polls the session map every 100ms:
loop {
let count = { let map = sessions.lock().await; ... };
if count + 1 >= conn_count as usize { break; }
tokio::time::sleep(Duration::from_millis(100)).await;
}
This adds up to 100ms of unnecessary latency to every multi-connection test startup. It also hammers the async mutex needlessly.
Fix: Replace with tokio::sync::Notify. When a secondary connection registers itself, it calls notify_one(). The primary waits on notified().await with a timeout, waking instantly when ready.
5. FreeBSD CPU Sampling Spawns Process Every Second (src/cpu.rs:142)
Problem: On FreeBSD, get_cpu_times() spawns sysctl -n kern.cp_time via std::process::Command every second:
fn get_cpu_times() -> (u64, u64) {
if let Ok(output) = std::process::Command::new("sysctl")
.arg("-n").arg("kern.cp_time").output()
{ ... }
}
fork() + exec() is extremely expensive relative to the work being done (reading 5 integers).
Fix: Use libc::sysctl() via FFI, matching the macOS implementation style. Cache the mib array and call the syscall directly.
6. Per-Call Timer Registration in UDP RX Loops (src/client.rs:393, src/server.rs:925)
Problem: Both UDP RX loops create a new tokio::time::timeout timer on every single recv/recv_from call:
match tokio::time::timeout(Duration::from_secs(5), socket.recv(&mut buf)).await
At high packet rates (e.g., 100K pps), registering and canceling timers on the Tokio timer wheel adds measurable overhead.
Fix: Use tokio::select! with a long-lived tokio::time::sleep future that is reset, or use the socket's built-in SO_RCVTIMEO if available via socket2. Alternatively, since UDP is connectionless, consider whether a 5-second timeout is needed on every call or if the outer test duration timer is sufficient.
🟡 Medium Severity Issues
7. String Error Matching with Allocation (src/server_pro/enforcer.rs:157-161)
match format!("{}", e).as_str() {
s if s.contains("daily") => ...
}
format!("{}", e) allocates a String from the error just to do substring matching. Use e.to_string().contains(...) or match on error types directly if possible.
8. ip.to_string() Called Repeatedly in Quota Checks (src/server_pro/quota.rs:389)
let ip_str = ip.to_string();
// ... used in 6+ DB calls
This allocates a String on every quota check. Accept &str or IpAddr directly in DB methods, or cache the string.
9. chrono_date_today() Recomputes Calendar from Epoch (src/server_pro/user_db.rs:617-638)
A hand-rolled date calculation loops through years from 1970 and months every time it's called (which is before almost every DB write). The chrono crate is already used indirectly by rusqlite; add it as a direct dependency and replace with chrono::Local::now().format("%Y-%m-%d").
🟢 Low Severity / Easy Wins
10. CSV File Reopened on Every Write (src/csv_output.rs:77)
if let Ok(mut f) = OpenOptions::new().append(true).open(path) {
let _ = writeln!(f, "{}", row);
}
Called once per test, not per-packet, but still suboptimal. Consider keeping a lazily-initialized Mutex<Option<File>> or using std::fs::OpenOptions once at init and storing the handle.
11. Global Syslog Mutex Held During I/O (src/syslog_logger.rs)
static SYSLOG: Mutex<Option<SyslogSender>> = Mutex::new(None);
The global std::sync::Mutex is held while formatting the timestamp (expensive manual calendar math) and sending UDP. Switch to parking_lot::Mutex (faster) or tokio::sync::Mutex if async, and format the message outside the lock. Better yet, use tracing's built-in syslog integration or a structured appender.
12. hash_password() Uses format! + format! in Hex Loop (src/server_pro/user_db.rs:612-614)
hasher.update(format!("{}:{}", username, password).as_bytes());
result.iter().map(|b| format!("{:02x}", b)).collect() // N allocs for N bytes
The hex encoding allocates one String per byte. Use a small fixed buffer or hex crate (already used elsewhere in ecsrp5.rs).
13. Redundant Instant::now() Calls in TX Loop (src/server.rs:593,606)
if send_status && Instant::now() >= next_status {
// ...
next_status = Instant::now() + Duration::from_secs(1);
}
Two monotonic clock reads per loop iteration. Cache let now = Instant::now(); at the top of the loop.
Architecture Observations
What the Code Does Well
- Zero-allocation protocol layer:
serialize()returns fixed-size stack arrays ([u8; 12],[u8; 16]). Excellent. - Atomic bandwidth tracking:
BandwidthStateusesAtomicU64withRelaxedordering in the per-packet path. No locks in the data plane. - Buffer reuse: TX/RX loops allocate
vec![0u8; ...]once before the loop. Good. - Aggressive release profile:
lto = true,codegen-units = 1,opt-level = 3.
Async Runtime Usage
tokiowithfullfeatures is used. For a primarily I/O-bound tool, this is appropriate.tokio::task::yield_now().awaitis used in unlimited-rate mode to prevent starving the runtime. This is correct but consider whethertokio::task::spawn_blockingor dedicated CPU pinning is needed for the EC-SRP5 math (which is CPU-bound and currently runs on the async runtime during auth).
Memory Safety
- Several
unwrap()/expect()calls in setup paths (socket binding, address parsing). These are acceptable for config errors but should use?propagation where possible to allow graceful degradation.
Missing Performance Infrastructure
| Infrastructure | Status | Recommendation |
|---|---|---|
| Benchmarks | ❌ None | Add criterion + benches/ for BandwidthState, protocol ser/de, and EC-SRP5 auth |
| Profiling hooks | ❌ None | Add optional pprof or dhat dev-deps for heap profiling |
| Throughput regression tests | ⚠️ Partial | Integration tests assert tx > 0 and rx > 0 but don't measure sustained throughput |
| Load tests | ❌ None | Add a benches/load_test.rs that spawns 100+ concurrent tests against a local server |
| CI performance gates | ❌ None | Consider a benchmark action that fails on >5% regression |
Priority Action Plan
Phase 1: Hot-Path Fixes (1-2 days)
- Replace buffer scan with
memchror ring-buffer approach intcp_client_rx_loop - Cache
WCurvein a globalLazyLock - Replace 100ms poll with
tokio::sync::Notifyin multi-conn wait
Phase 2: Scalability (2-3 days)
- Add SQLite connection pooling or channel-based writer in
server_pro - Cache
remaining_budget()results for 5-10 seconds - Fix FreeBSD CPU sampling to use
libc::sysctlFFI
Phase 3: Polish & Tooling (1-2 days)
- Replace manual date arithmetic with
chrono - Add
criterionbenchmarks for auth and bandwidth state - Fix low-severity allocation issues (CSV, syslog, hex encoding)
Appendix: File-by-File Quick Reference
| File | Lines | Hot Path? | Key Concern |
|---|---|---|---|
src/client.rs |
531 | ✅ Yes | O(n) 256KB scan per TCP read |
src/server.rs |
1094 | ✅ Yes | 100ms poll wait, status injection timing |
src/ecsrp5.rs |
660 | ✅ Yes (auth) | WCurve::new() recomputed per auth |
src/bandwidth.rs |
263 | ✅ Yes (atomics) | Well-designed; no issues |
src/protocol.rs |
214 | ✅ Yes (ser/de) | Zero-allocation; excellent |
src/cpu.rs |
215 | ⚠️ Periodic | FreeBSD fork+exec every second |
src/server_pro/quota.rs |
470 | ⚠️ Periodic | 15 DB queries per budget check |
src/server_pro/user_db.rs |
641 | ⚠️ All DB ops | Single mutex serializes everything |
src/server_pro/server_loop.rs |
449 | ✅ Yes | DB auth locks during connection setup |
src/server_pro/enforcer.rs |
411 | ⚠️ Periodic | String error matching allocates |
src/csv_output.rs |
86 | ❌ No | File reopen per write |
src/syslog_logger.rs |
154 | ❌ No | Global mutex + manual calendar math |
src/auth.rs |
164 | ⚠️ Auth only | Minor; double MD5 per auth |
src/main.rs |
243 | ❌ No | Entry point only |