Files

Siavash Sameni bba9b0512c

CI / test (push) Successful in 2m14s

Details

perf: replace O(n) TCP RX buffer scan with SIMD memchr + carry buffer (Sprint 3)

This commit fixes the most significant hot-path bottleneck in the
client: the tcp_client_rx_loop was scanning up to 256KB byte-by-byte
on every read() call looking for interleaved 12-byte status messages.

Changes:
- client.rs (tcp_client_rx_loop): Replace the O(n) for-loop scan
  with a three-stage approach:

  1. Split-message check: An 11-byte carry buffer stores trailing
     bytes from the previous read. We check every possible alignment
     where a status message (0x07 + cpu_byte) could span the carry
     and the start of the current buffer. This fixes a latent bug
     where the old code would miss status messages split across TCP
     read boundaries.

  2. Fast scan: memchr::memchr (AVX2/NEON SIMD) finds 0x07 bytes
     in the 256KB buffer. On all-zero data packets this exits in
     ~4096 SIMD-width operations instead of 262,144 byte compares.
     ~64x faster scan path.

  3. Carry save: Save up to 11 trailing bytes for the next read.

- client.rs (unit tests): Add scan_status_message() helper and
  five unit tests covering:
  - Status message fully within buffer
  - Status message split across reads (5+7 bytes)
  - Status message split at boundary (1+11 bytes)
  - All-zero buffer (no false positive)
  - Short buffer (no panic)

- Cargo.toml / Cargo.lock: Add memchr as an explicit dependency.

Verified against live MikroTik RouterOS (TCP both + receive modes
with EC-SRP5 auth). Status messages detected correctly. No wire
protocol changes — 100% MikroTik compatible.

2026-04-30 20:46:34 +04:00

14 KiB

Raw Blame History

btest-rs Performance Audit

Date: 2026-04-30
Scope: Full codebase (src/, tests/, Cargo.toml)
Methodology: Static code analysis, hot-path tracing, lock/contention review, algorithmic complexity analysis

Executive Summary

The codebase is generally well-structured for a network I/O tool, with good use of atomics in the per-packet hot path and zero-allocation protocol serialization. However, three critical bottlenecks significantly limit throughput and scalability:

O(n) buffer scan on every TCP read in the client RX loop (up to 256KB scanned per read() call)
Expensive EC curve reconstruction on every authentication (heavy BigUint modular arithmetic)
Single SQLite connection mutex serializing all DB operations in server_pro

Additionally, there is no benchmark or profiling infrastructure in the project, making it impossible to measure improvements or catch regressions.

Severity Legend

Icon	Severity	Impact
🔴	Critical	Direct throughput/latency hit in hot path; fix immediately
🟠	High	Significant overhead under load; fix in next sprint
🟡	Medium	Noticeable at scale or under specific conditions
🟢	Low	Cosmetic / easy wins; batch with other work

🔴 Critical Bottlenecks

1. O(n) Linear Buffer Scan in `tcp_client_rx_loop` (`src/client.rs:210-216`)

Problem: On every TCP read() call (up to 256KB), the client performs a byte-by-byte scan looking for interleaved 12-byte status messages:

for i in 0..=(n - STATUS_MSG_SIZE) {
    if buf[i] == STATUS_MSG_TYPE && buf[i + 1] >= 0x80 {
        // ...
    }
}

Since data packets are all zeros and status bytes are extremely rare, this almost always scans the entire 256KB buffer uselessly. At high bandwidth (many reads/sec), this wastes massive CPU cycles and pollutes cache lines.

Impact: CPU-bound slowdown on the client RX side during bidirectional TCP tests. The compiler may auto-vectorize the simple loop, but it still processes ~256K bytes per read.

Fix Options (pick one):

Best: Use memchr (crate or std::slice::memchr on nightly) to find 0x07 bytes. On all-zero buffers this exits after a few SIMD-width checks.
Alternative: Since status messages are injected at write_all boundaries and data is all zeros, maintain a small 12-byte sliding ring buffer across reads. Process the stream with a tiny state machine instead of scanning the whole buffer.
Alternative: Track read bytes modulo expected packet size. Status messages are injected between full packets, so they will appear at predictable offsets if the client knows the server's effective_size. This requires protocol coordination.

2. `WCurve::new()` Recomputes Generator on Every Auth (`src/ecsrp5.rs:363,499`)

Problem: Every EC-SRP5 authentication (client and server) calls WCurve::new(), which performs lift_x(9) → prime_mod_sqrt() — heavy BigUint modular arithmetic to derive the curve generator point.

pub async fn client_authenticate<S: ...>(stream: &mut S, username: &str, password: &str) -> Result<()> {
    let w = WCurve::new();  // <-- expensive, same result every time
    // ...
}

The curve constants (P, CURVE_ORDER, WEIERSTRASS_A) are already cached as LazyLock statics, but the generator point is not.

Impact: Auth latency spikes, especially on the server under many concurrent connections. Each auth does redundant BigUint allocations and modular square roots.

Fix: Cache WCurve (or at least the generator point) in a global LazyLock:

static WCURVE: std::sync::LazyLock<WCurve> = std::sync::LazyLock::new(WCurve::new);

Then use &*WCURVE in both client_authenticate and server_authenticate.

3. Single SQLite Mutex Serializes All DB Operations (`src/server_pro/user_db.rs:15-18`)

Problem: The entire server_pro database layer uses a single shared Connection behind a std::sync::Mutex:

pub struct UserDb {
    conn: Arc<Mutex<Connection>>,
}

While SQLite WAL mode is enabled (allowing readers to proceed during writes), the Rust mutex still serializes all access to the connection object. Under concurrent load with many tests starting/finishing, this becomes the primary bottleneck.

Critical sub-issue: QuotaManager::remaining_budget() (src/server_pro/quota.rs:387) performs up to 15 separate SQLite queries in sequence, locking the mutex 15+ times per pre-test check.

Impact: Connection setup/teardown latency increases linearly with concurrency. Quota checks and usage recording block each other.

Fix Options:

Connection pooling: Use r2d2_sqlite or deadpool-sqlite to maintain a small pool of connections (SQLite handles this well in WAL mode).
Separate read/write paths: Open a read-only connection for quota checks (remaining_budget) and a dedicated write connection for usage recording. SQLite WAL allows this.
Batch quota checks: Cache quota results for a few seconds per user/IP to avoid redundant queries.
Channel-based writer: Use a single dedicated DB writer task with an mpsc channel so only one task ever touches the connection, eliminating lock contention entirely.

🟠 High Severity Issues

4. 100ms Busy-Poll Wait in Multi-Connection TCP (`src/server.rs:313-332`)

Problem: When waiting for secondary TCP connections to join a multi-connection session, the primary connection busy-polls the session map every 100ms:

loop {
    let count = { let map = sessions.lock().await; ... };
    if count + 1 >= conn_count as usize { break; }
    tokio::time::sleep(Duration::from_millis(100)).await;
}

This adds up to 100ms of unnecessary latency to every multi-connection test startup. It also hammers the async mutex needlessly.

Fix: Replace with tokio::sync::Notify. When a secondary connection registers itself, it calls notify_one(). The primary waits on notified().await with a timeout, waking instantly when ready.

5. FreeBSD CPU Sampling Spawns Process Every Second (`src/cpu.rs:142`)

Problem: On FreeBSD, get_cpu_times() spawns sysctl -n kern.cp_time via std::process::Command every second:

fn get_cpu_times() -> (u64, u64) {
    if let Ok(output) = std::process::Command::new("sysctl")
        .arg("-n").arg("kern.cp_time").output()
    { ... }
}

fork() + exec() is extremely expensive relative to the work being done (reading 5 integers).

Fix: Use libc::sysctl() via FFI, matching the macOS implementation style. Cache the mib array and call the syscall directly.

6. Per-Call Timer Registration in UDP RX Loops (`src/client.rs:393`, `src/server.rs:925`)

Problem: Both UDP RX loops create a new tokio::time::timeout timer on every single recv/recv_from call:

match tokio::time::timeout(Duration::from_secs(5), socket.recv(&mut buf)).await

At high packet rates (e.g., 100K pps), registering and canceling timers on the Tokio timer wheel adds measurable overhead.

Fix: Use tokio::select! with a long-lived tokio::time::sleep future that is reset, or use the socket's built-in SO_RCVTIMEO if available via socket2. Alternatively, since UDP is connectionless, consider whether a 5-second timeout is needed on every call or if the outer test duration timer is sufficient.

🟡 Medium Severity Issues

7. String Error Matching with Allocation (`src/server_pro/enforcer.rs:157-161`)

match format!("{}", e).as_str() {
    s if s.contains("daily") => ...
}

format!("{}", e) allocates a String from the error just to do substring matching. Use e.to_string().contains(...) or match on error types directly if possible.

8. `ip.to_string()` Called Repeatedly in Quota Checks (`src/server_pro/quota.rs:389`)

let ip_str = ip.to_string();
// ... used in 6+ DB calls

This allocates a String on every quota check. Accept &str or IpAddr directly in DB methods, or cache the string.

9. `chrono_date_today()` Recomputes Calendar from Epoch (`src/server_pro/user_db.rs:617-638`)

A hand-rolled date calculation loops through years from 1970 and months every time it's called (which is before almost every DB write). The chrono crate is already used indirectly by rusqlite; add it as a direct dependency and replace with chrono::Local::now().format("%Y-%m-%d").

🟢 Low Severity / Easy Wins

10. CSV File Reopened on Every Write (`src/csv_output.rs:77`)

if let Ok(mut f) = OpenOptions::new().append(true).open(path) {
    let _ = writeln!(f, "{}", row);
}

Called once per test, not per-packet, but still suboptimal. Consider keeping a lazily-initialized Mutex<Option<File>> or using std::fs::OpenOptions once at init and storing the handle.

11. Global Syslog Mutex Held During I/O (`src/syslog_logger.rs`)

static SYSLOG: Mutex<Option<SyslogSender>> = Mutex::new(None);

The global std::sync::Mutex is held while formatting the timestamp (expensive manual calendar math) and sending UDP. Switch to parking_lot::Mutex (faster) or tokio::sync::Mutex if async, and format the message outside the lock. Better yet, use tracing's built-in syslog integration or a structured appender.

12. `hash_password()` Uses `format!` + `format!` in Hex Loop (`src/server_pro/user_db.rs:612-614`)

hasher.update(format!("{}:{}", username, password).as_bytes());
result.iter().map(|b| format!("{:02x}", b)).collect()  // N allocs for N bytes

The hex encoding allocates one String per byte. Use a small fixed buffer or hex crate (already used elsewhere in ecsrp5.rs).

13. Redundant `Instant::now()` Calls in TX Loop (`src/server.rs:593,606`)

if send_status && Instant::now() >= next_status {
    // ...
    next_status = Instant::now() + Duration::from_secs(1);
}

Two monotonic clock reads per loop iteration. Cache let now = Instant::now(); at the top of the loop.

Architecture Observations

What the Code Does Well

Zero-allocation protocol layer: serialize() returns fixed-size stack arrays ([u8; 12], [u8; 16]). Excellent.
Atomic bandwidth tracking: BandwidthState uses AtomicU64 with Relaxed ordering in the per-packet path. No locks in the data plane.
Buffer reuse: TX/RX loops allocate vec![0u8; ...] once before the loop. Good.
Aggressive release profile: lto = true, codegen-units = 1, opt-level = 3.

Async Runtime Usage

tokio with full features is used. For a primarily I/O-bound tool, this is appropriate.
tokio::task::yield_now().await is used in unlimited-rate mode to prevent starving the runtime. This is correct but consider whether tokio::task::spawn_blocking or dedicated CPU pinning is needed for the EC-SRP5 math (which is CPU-bound and currently runs on the async runtime during auth).

Memory Safety

Several unwrap()/expect() calls in setup paths (socket binding, address parsing). These are acceptable for config errors but should use ? propagation where possible to allow graceful degradation.

Missing Performance Infrastructure

Infrastructure	Status	Recommendation
Benchmarks	❌ None	Add `criterion` + `benches/` for `BandwidthState`, protocol ser/de, and EC-SRP5 auth
Profiling hooks	❌ None	Add optional `pprof` or `dhat` dev-deps for heap profiling
Throughput regression tests	⚠️ Partial	Integration tests assert `tx > 0` and `rx > 0` but don't measure sustained throughput
Load tests	❌ None	Add a `benches/load_test.rs` that spawns 100+ concurrent tests against a local server
CI performance gates	❌ None	Consider a benchmark action that fails on >5% regression

Priority Action Plan

Phase 1: Hot-Path Fixes (1-2 days)

Replace buffer scan with memchr or ring-buffer approach in tcp_client_rx_loop
Cache WCurve in a global LazyLock
Replace 100ms poll with tokio::sync::Notify in multi-conn wait

Phase 2: Scalability (2-3 days)

Add SQLite connection pooling or channel-based writer in server_pro
Cache remaining_budget() results for 5-10 seconds
Fix FreeBSD CPU sampling to use libc::sysctl FFI

Phase 3: Polish & Tooling (1-2 days)

Replace manual date arithmetic with chrono
Add criterion benchmarks for auth and bandwidth state
Fix low-severity allocation issues (CSV, syslog, hex encoding)

Appendix: File-by-File Quick Reference

File	Lines	Hot Path?	Key Concern
`src/client.rs`	531	✅ Yes	O(n) 256KB scan per TCP read
`src/server.rs`	1094	✅ Yes	100ms poll wait, status injection timing
`src/ecsrp5.rs`	660	✅ Yes (auth)	`WCurve::new()` recomputed per auth
`src/bandwidth.rs`	263	✅ Yes (atomics)	Well-designed; no issues
`src/protocol.rs`	214	✅ Yes (ser/de)	Zero-allocation; excellent
`src/cpu.rs`	215	⚠️ Periodic	FreeBSD `fork+exec` every second
`src/server_pro/quota.rs`	470	⚠️ Periodic	15 DB queries per budget check
`src/server_pro/user_db.rs`	641	⚠️ All DB ops	Single mutex serializes everything
`src/server_pro/server_loop.rs`	449	✅ Yes	DB auth locks during connection setup
`src/server_pro/enforcer.rs`	411	⚠️ Periodic	String error matching allocates
`src/csv_output.rs`	86	❌ No	File reopen per write
`src/syslog_logger.rs`	154	❌ No	Global mutex + manual calendar math
`src/auth.rs`	164	⚠️ Auth only	Minor; double MD5 per auth
`src/main.rs`	243	❌ No	Entry point only

14 KiB Raw Blame History