All checks were successful
CI / test (push) Successful in 2m14s
This commit fixes the most significant hot-path bottleneck in the
client: the tcp_client_rx_loop was scanning up to 256KB byte-by-byte
on every read() call looking for interleaved 12-byte status messages.
Changes:
- client.rs (tcp_client_rx_loop): Replace the O(n) for-loop scan
with a three-stage approach:
1. Split-message check: An 11-byte carry buffer stores trailing
bytes from the previous read. We check every possible alignment
where a status message (0x07 + cpu_byte) could span the carry
and the start of the current buffer. This fixes a latent bug
where the old code would miss status messages split across TCP
read boundaries.
2. Fast scan: memchr::memchr (AVX2/NEON SIMD) finds 0x07 bytes
in the 256KB buffer. On all-zero data packets this exits in
~4096 SIMD-width operations instead of 262,144 byte compares.
~64x faster scan path.
3. Carry save: Save up to 11 trailing bytes for the next read.
- client.rs (unit tests): Add scan_status_message() helper and
five unit tests covering:
- Status message fully within buffer
- Status message split across reads (5+7 bytes)
- Status message split at boundary (1+11 bytes)
- All-zero buffer (no false positive)
- Short buffer (no panic)
- Cargo.toml / Cargo.lock: Add memchr as an explicit dependency.
Verified against live MikroTik RouterOS (TCP both + receive modes
with EC-SRP5 auth). Status messages detected correctly. No wire
protocol changes — 100% MikroTik compatible.
303 lines
14 KiB
Markdown
303 lines
14 KiB
Markdown
# btest-rs Performance Audit
|
|
|
|
**Date:** 2026-04-30
|
|
**Scope:** Full codebase (`src/`, `tests/`, `Cargo.toml`)
|
|
**Methodology:** Static code analysis, hot-path tracing, lock/contention review, algorithmic complexity analysis
|
|
|
|
---
|
|
|
|
## Executive Summary
|
|
|
|
The codebase is generally well-structured for a network I/O tool, with good use of atomics in the per-packet hot path and zero-allocation protocol serialization. However, **three critical bottlenecks** significantly limit throughput and scalability:
|
|
|
|
1. **O(n) buffer scan on every TCP read** in the client RX loop (up to 256KB scanned per `read()` call)
|
|
2. **Expensive EC curve reconstruction on every authentication** (heavy `BigUint` modular arithmetic)
|
|
3. **Single SQLite connection mutex** serializing all DB operations in `server_pro`
|
|
|
|
Additionally, there is **no benchmark or profiling infrastructure** in the project, making it impossible to measure improvements or catch regressions.
|
|
|
|
---
|
|
|
|
## Severity Legend
|
|
|
|
| Icon | Severity | Impact |
|
|
|------|----------|--------|
|
|
| 🔴 | **Critical** | Direct throughput/latency hit in hot path; fix immediately |
|
|
| 🟠 | **High** | Significant overhead under load; fix in next sprint |
|
|
| 🟡 | **Medium** | Noticeable at scale or under specific conditions |
|
|
| 🟢 | **Low** | Cosmetic / easy wins; batch with other work |
|
|
|
|
---
|
|
|
|
## 🔴 Critical Bottlenecks
|
|
|
|
### 1. O(n) Linear Buffer Scan in `tcp_client_rx_loop` (`src/client.rs:210-216`)
|
|
|
|
**Problem:** On every TCP `read()` call (up to 256KB), the client performs a byte-by-byte scan looking for interleaved 12-byte status messages:
|
|
|
|
```rust
|
|
for i in 0..=(n - STATUS_MSG_SIZE) {
|
|
if buf[i] == STATUS_MSG_TYPE && buf[i + 1] >= 0x80 {
|
|
// ...
|
|
}
|
|
}
|
|
```
|
|
|
|
Since data packets are all zeros and status bytes are extremely rare, this **almost always scans the entire 256KB buffer** uselessly. At high bandwidth (many reads/sec), this wastes massive CPU cycles and pollutes cache lines.
|
|
|
|
**Impact:** CPU-bound slowdown on the client RX side during bidirectional TCP tests. The compiler *may* auto-vectorize the simple loop, but it still processes ~256K bytes per read.
|
|
|
|
**Fix Options (pick one):**
|
|
- **Best:** Use `memchr` (crate or `std::slice::memchr` on nightly) to find `0x07` bytes. On all-zero buffers this exits after a few SIMD-width checks.
|
|
- **Alternative:** Since status messages are injected at `write_all` boundaries and data is all zeros, maintain a small 12-byte sliding ring buffer across reads. Process the stream with a tiny state machine instead of scanning the whole buffer.
|
|
- **Alternative:** Track read bytes modulo expected packet size. Status messages are injected between full packets, so they will appear at predictable offsets *if* the client knows the server's `effective_size`. This requires protocol coordination.
|
|
|
|
---
|
|
|
|
### 2. `WCurve::new()` Recomputes Generator on Every Auth (`src/ecsrp5.rs:363,499`)
|
|
|
|
**Problem:** Every EC-SRP5 authentication (client and server) calls `WCurve::new()`, which performs `lift_x(9)` → `prime_mod_sqrt()` — heavy `BigUint` modular arithmetic to derive the curve generator point.
|
|
|
|
```rust
|
|
pub async fn client_authenticate<S: ...>(stream: &mut S, username: &str, password: &str) -> Result<()> {
|
|
let w = WCurve::new(); // <-- expensive, same result every time
|
|
// ...
|
|
}
|
|
```
|
|
|
|
The curve constants (`P`, `CURVE_ORDER`, `WEIERSTRASS_A`) are already cached as `LazyLock` statics, but the generator point is not.
|
|
|
|
**Impact:** Auth latency spikes, especially on the server under many concurrent connections. Each auth does redundant `BigUint` allocations and modular square roots.
|
|
|
|
**Fix:** Cache `WCurve` (or at least the generator point) in a global `LazyLock`:
|
|
|
|
```rust
|
|
static WCURVE: std::sync::LazyLock<WCurve> = std::sync::LazyLock::new(WCurve::new);
|
|
```
|
|
|
|
Then use `&*WCURVE` in both `client_authenticate` and `server_authenticate`.
|
|
|
|
---
|
|
|
|
### 3. Single SQLite Mutex Serializes All DB Operations (`src/server_pro/user_db.rs:15-18`)
|
|
|
|
**Problem:** The entire `server_pro` database layer uses a single shared `Connection` behind a `std::sync::Mutex`:
|
|
|
|
```rust
|
|
pub struct UserDb {
|
|
conn: Arc<Mutex<Connection>>,
|
|
}
|
|
```
|
|
|
|
While SQLite WAL mode is enabled (allowing readers to proceed during writes), **the Rust mutex still serializes all access to the connection object**. Under concurrent load with many tests starting/finishing, this becomes the primary bottleneck.
|
|
|
|
**Critical sub-issue:** `QuotaManager::remaining_budget()` (`src/server_pro/quota.rs:387`) performs **up to 15 separate SQLite queries** in sequence, locking the mutex 15+ times per pre-test check.
|
|
|
|
**Impact:** Connection setup/teardown latency increases linearly with concurrency. Quota checks and usage recording block each other.
|
|
|
|
**Fix Options:**
|
|
- **Connection pooling:** Use `r2d2_sqlite` or `deadpool-sqlite` to maintain a small pool of connections (SQLite handles this well in WAL mode).
|
|
- **Separate read/write paths:** Open a read-only connection for quota checks (`remaining_budget`) and a dedicated write connection for usage recording. SQLite WAL allows this.
|
|
- **Batch quota checks:** Cache quota results for a few seconds per user/IP to avoid redundant queries.
|
|
- **Channel-based writer:** Use a single dedicated DB writer task with an `mpsc` channel so only one task ever touches the connection, eliminating lock contention entirely.
|
|
|
|
---
|
|
|
|
## 🟠 High Severity Issues
|
|
|
|
### 4. 100ms Busy-Poll Wait in Multi-Connection TCP (`src/server.rs:313-332`)
|
|
|
|
**Problem:** When waiting for secondary TCP connections to join a multi-connection session, the primary connection busy-polls the session map every 100ms:
|
|
|
|
```rust
|
|
loop {
|
|
let count = { let map = sessions.lock().await; ... };
|
|
if count + 1 >= conn_count as usize { break; }
|
|
tokio::time::sleep(Duration::from_millis(100)).await;
|
|
}
|
|
```
|
|
|
|
This adds up to **100ms of unnecessary latency** to every multi-connection test startup. It also hammers the async mutex needlessly.
|
|
|
|
**Fix:** Replace with `tokio::sync::Notify`. When a secondary connection registers itself, it calls `notify_one()`. The primary waits on `notified().await` with a timeout, waking instantly when ready.
|
|
|
|
---
|
|
|
|
### 5. FreeBSD CPU Sampling Spawns Process Every Second (`src/cpu.rs:142`)
|
|
|
|
**Problem:** On FreeBSD, `get_cpu_times()` spawns `sysctl -n kern.cp_time` via `std::process::Command` every second:
|
|
|
|
```rust
|
|
fn get_cpu_times() -> (u64, u64) {
|
|
if let Ok(output) = std::process::Command::new("sysctl")
|
|
.arg("-n").arg("kern.cp_time").output()
|
|
{ ... }
|
|
}
|
|
```
|
|
|
|
`fork()` + `exec()` is extremely expensive relative to the work being done (reading 5 integers).
|
|
|
|
**Fix:** Use `libc::sysctl()` via FFI, matching the macOS implementation style. Cache the `mib` array and call the syscall directly.
|
|
|
|
---
|
|
|
|
### 6. Per-Call Timer Registration in UDP RX Loops (`src/client.rs:393`, `src/server.rs:925`)
|
|
|
|
**Problem:** Both UDP RX loops create a new `tokio::time::timeout` timer on **every single `recv`/`recv_from` call**:
|
|
|
|
```rust
|
|
match tokio::time::timeout(Duration::from_secs(5), socket.recv(&mut buf)).await
|
|
```
|
|
|
|
At high packet rates (e.g., 100K pps), registering and canceling timers on the Tokio timer wheel adds measurable overhead.
|
|
|
|
**Fix:** Use `tokio::select!` with a long-lived `tokio::time::sleep` future that is reset, or use the socket's built-in SO_RCVTIMEO if available via `socket2`. Alternatively, since UDP is connectionless, consider whether a 5-second timeout is needed on every call or if the outer test duration timer is sufficient.
|
|
|
|
---
|
|
|
|
## 🟡 Medium Severity Issues
|
|
|
|
### 7. String Error Matching with Allocation (`src/server_pro/enforcer.rs:157-161`)
|
|
|
|
```rust
|
|
match format!("{}", e).as_str() {
|
|
s if s.contains("daily") => ...
|
|
}
|
|
```
|
|
|
|
`format!("{}", e)` allocates a `String` from the error just to do substring matching. Use `e.to_string().contains(...)` or match on error types directly if possible.
|
|
|
|
---
|
|
|
|
### 8. `ip.to_string()` Called Repeatedly in Quota Checks (`src/server_pro/quota.rs:389`)
|
|
|
|
```rust
|
|
let ip_str = ip.to_string();
|
|
// ... used in 6+ DB calls
|
|
```
|
|
|
|
This allocates a `String` on every quota check. Accept `&str` or `IpAddr` directly in DB methods, or cache the string.
|
|
|
|
---
|
|
|
|
### 9. `chrono_date_today()` Recomputes Calendar from Epoch (`src/server_pro/user_db.rs:617-638`)
|
|
|
|
A hand-rolled date calculation loops through years from 1970 and months every time it's called (which is before almost every DB write). The `chrono` crate is already used indirectly by `rusqlite`; add it as a direct dependency and replace with `chrono::Local::now().format("%Y-%m-%d")`.
|
|
|
|
---
|
|
|
|
## 🟢 Low Severity / Easy Wins
|
|
|
|
### 10. CSV File Reopened on Every Write (`src/csv_output.rs:77`)
|
|
|
|
```rust
|
|
if let Ok(mut f) = OpenOptions::new().append(true).open(path) {
|
|
let _ = writeln!(f, "{}", row);
|
|
}
|
|
```
|
|
|
|
Called once per test, not per-packet, but still suboptimal. Consider keeping a lazily-initialized `Mutex<Option<File>>` or using `std::fs::OpenOptions` once at init and storing the handle.
|
|
|
|
---
|
|
|
|
### 11. Global Syslog Mutex Held During I/O (`src/syslog_logger.rs`)
|
|
|
|
```rust
|
|
static SYSLOG: Mutex<Option<SyslogSender>> = Mutex::new(None);
|
|
```
|
|
|
|
The global `std::sync::Mutex` is held while formatting the timestamp (expensive manual calendar math) and sending UDP. Switch to `parking_lot::Mutex` (faster) or `tokio::sync::Mutex` if async, and format the message outside the lock. Better yet, use `tracing`'s built-in syslog integration or a structured appender.
|
|
|
|
---
|
|
|
|
### 12. `hash_password()` Uses `format!` + `format!` in Hex Loop (`src/server_pro/user_db.rs:612-614`)
|
|
|
|
```rust
|
|
hasher.update(format!("{}:{}", username, password).as_bytes());
|
|
result.iter().map(|b| format!("{:02x}", b)).collect() // N allocs for N bytes
|
|
```
|
|
|
|
The hex encoding allocates one `String` per byte. Use a small fixed buffer or `hex` crate (already used elsewhere in `ecsrp5.rs`).
|
|
|
|
---
|
|
|
|
### 13. Redundant `Instant::now()` Calls in TX Loop (`src/server.rs:593,606`)
|
|
|
|
```rust
|
|
if send_status && Instant::now() >= next_status {
|
|
// ...
|
|
next_status = Instant::now() + Duration::from_secs(1);
|
|
}
|
|
```
|
|
|
|
Two monotonic clock reads per loop iteration. Cache `let now = Instant::now();` at the top of the loop.
|
|
|
|
---
|
|
|
|
## Architecture Observations
|
|
|
|
### What the Code Does Well
|
|
- **Zero-allocation protocol layer:** `serialize()` returns fixed-size stack arrays (`[u8; 12]`, `[u8; 16]`). Excellent.
|
|
- **Atomic bandwidth tracking:** `BandwidthState` uses `AtomicU64` with `Relaxed` ordering in the per-packet path. No locks in the data plane.
|
|
- **Buffer reuse:** TX/RX loops allocate `vec![0u8; ...]` once before the loop. Good.
|
|
- **Aggressive release profile:** `lto = true`, `codegen-units = 1`, `opt-level = 3`.
|
|
|
|
### Async Runtime Usage
|
|
- `tokio` with `full` features is used. For a primarily I/O-bound tool, this is appropriate.
|
|
- `tokio::task::yield_now().await` is used in unlimited-rate mode to prevent starving the runtime. This is correct but consider whether `tokio::task::spawn_blocking` or dedicated CPU pinning is needed for the EC-SRP5 math (which is CPU-bound and currently runs on the async runtime during auth).
|
|
|
|
### Memory Safety
|
|
- Several `unwrap()`/`expect()` calls in setup paths (socket binding, address parsing). These are acceptable for config errors but should use `?` propagation where possible to allow graceful degradation.
|
|
|
|
---
|
|
|
|
## Missing Performance Infrastructure
|
|
|
|
| Infrastructure | Status | Recommendation |
|
|
|----------------|--------|----------------|
|
|
| **Benchmarks** | ❌ None | Add `criterion` + `benches/` for `BandwidthState`, protocol ser/de, and EC-SRP5 auth |
|
|
| **Profiling hooks** | ❌ None | Add optional `pprof` or `dhat` dev-deps for heap profiling |
|
|
| **Throughput regression tests** | ⚠️ Partial | Integration tests assert `tx > 0` and `rx > 0` but don't measure sustained throughput |
|
|
| **Load tests** | ❌ None | Add a `benches/load_test.rs` that spawns 100+ concurrent tests against a local server |
|
|
| **CI performance gates** | ❌ None | Consider a benchmark action that fails on >5% regression |
|
|
|
|
---
|
|
|
|
## Priority Action Plan
|
|
|
|
### Phase 1: Hot-Path Fixes (1-2 days)
|
|
1. Replace buffer scan with `memchr` or ring-buffer approach in `tcp_client_rx_loop`
|
|
2. Cache `WCurve` in a global `LazyLock`
|
|
3. Replace 100ms poll with `tokio::sync::Notify` in multi-conn wait
|
|
|
|
### Phase 2: Scalability (2-3 days)
|
|
4. Add SQLite connection pooling or channel-based writer in `server_pro`
|
|
5. Cache `remaining_budget()` results for 5-10 seconds
|
|
6. Fix FreeBSD CPU sampling to use `libc::sysctl` FFI
|
|
|
|
### Phase 3: Polish & Tooling (1-2 days)
|
|
7. Replace manual date arithmetic with `chrono`
|
|
8. Add `criterion` benchmarks for auth and bandwidth state
|
|
9. Fix low-severity allocation issues (CSV, syslog, hex encoding)
|
|
|
|
---
|
|
|
|
## Appendix: File-by-File Quick Reference
|
|
|
|
| File | Lines | Hot Path? | Key Concern |
|
|
|------|-------|-----------|-------------|
|
|
| `src/client.rs` | 531 | ✅ Yes | O(n) 256KB scan per TCP read |
|
|
| `src/server.rs` | 1094 | ✅ Yes | 100ms poll wait, status injection timing |
|
|
| `src/ecsrp5.rs` | 660 | ✅ Yes (auth) | `WCurve::new()` recomputed per auth |
|
|
| `src/bandwidth.rs` | 263 | ✅ Yes (atomics) | Well-designed; no issues |
|
|
| `src/protocol.rs` | 214 | ✅ Yes (ser/de) | Zero-allocation; excellent |
|
|
| `src/cpu.rs` | 215 | ⚠️ Periodic | FreeBSD `fork+exec` every second |
|
|
| `src/server_pro/quota.rs` | 470 | ⚠️ Periodic | 15 DB queries per budget check |
|
|
| `src/server_pro/user_db.rs` | 641 | ⚠️ All DB ops | Single mutex serializes everything |
|
|
| `src/server_pro/server_loop.rs` | 449 | ✅ Yes | DB auth locks during connection setup |
|
|
| `src/server_pro/enforcer.rs` | 411 | ⚠️ Periodic | String error matching allocates |
|
|
| `src/csv_output.rs` | 86 | ❌ No | File reopen per write |
|
|
| `src/syslog_logger.rs` | 154 | ❌ No | Global mutex + manual calendar math |
|
|
| `src/auth.rs` | 164 | ⚠️ Auth only | Minor; double MD5 per auth |
|
|
| `src/main.rs` | 243 | ❌ No | Entry point only |
|