Files
btest-rs/PERFORMANCE_PRDS.md
Siavash Sameni bba9b0512c
All checks were successful
CI / test (push) Successful in 2m14s
perf: replace O(n) TCP RX buffer scan with SIMD memchr + carry buffer (Sprint 3)
This commit fixes the most significant hot-path bottleneck in the
client: the tcp_client_rx_loop was scanning up to 256KB byte-by-byte
on every read() call looking for interleaved 12-byte status messages.

Changes:
- client.rs (tcp_client_rx_loop): Replace the O(n) for-loop scan
  with a three-stage approach:

  1. Split-message check: An 11-byte carry buffer stores trailing
     bytes from the previous read. We check every possible alignment
     where a status message (0x07 + cpu_byte) could span the carry
     and the start of the current buffer. This fixes a latent bug
     where the old code would miss status messages split across TCP
     read boundaries.

  2. Fast scan: memchr::memchr (AVX2/NEON SIMD) finds 0x07 bytes
     in the 256KB buffer. On all-zero data packets this exits in
     ~4096 SIMD-width operations instead of 262,144 byte compares.
     ~64x faster scan path.

  3. Carry save: Save up to 11 trailing bytes for the next read.

- client.rs (unit tests): Add scan_status_message() helper and
  five unit tests covering:
  - Status message fully within buffer
  - Status message split across reads (5+7 bytes)
  - Status message split at boundary (1+11 bytes)
  - All-zero buffer (no false positive)
  - Short buffer (no panic)

- Cargo.toml / Cargo.lock: Add memchr as an explicit dependency.

Verified against live MikroTik RouterOS (TCP both + receive modes
with EC-SRP5 auth). Status messages detected correctly. No wire
protocol changes — 100% MikroTik compatible.
2026-04-30 20:46:34 +04:00

23 KiB
Raw Blame History

Performance Improvement PRDs

Project: btest-rs
Constraint: 100% MikroTik BTest protocol compatibility — no wire-format or behavioral changes visible to MikroTik devices
Date: 2026-04-30


How to Read This Document

Each PRD is sorted by recommended execution order, which balances:

  • Effort (development + review + test time)
  • Risk (probability of regression or compatibility break)
  • Performance Effect (measured or estimated throughput/latency improvement)
  • MikroTik Compatibility Risk (whether the change could affect interoperability)

Sorting rationale: Execute quick wins first to build velocity and reduce risk surface, then tackle high-impact items with full attention.


Summary Matrix

# PRD Effort Risk Perf Impact MikroTik Risk Tier
1 WCurve Global Cache 30 min None Medium None Quick Win
2 Redundant Instant::now() 15 min None Low None Quick Win
3 hash_password Hex Fix 30 min None Low None Quick Win
4 CSV File Handle Cache 30 min None Low None Quick Win
5 Error String Matching 30 min None Low None Quick Win
6 chrono_date_today Replace 1 hr Low Low None Quick Win
7 Syslog Mutex + Timestamp 1 hr Low Low None Quick Win
8 ip.to_string() Cache 1 hr Low Low None Quick Win
9 FreeBSD CPU FFI 3 hrs Medium Medium None Platform Fix
10 Multi-Conn Notify Wake 2 hrs Medium Medium None Latency Fix
11 UDP Timer Reuse 2 hrs Medium Medium None Throughput Fix
12 TCP RX Scan Optimization 4 hrs Medium High Low Hot Path Fix
13 SQLite Connection Pool 12 days High High None Scalability Fix

Tier 1: Quick Wins (Do These First)


PRD-001: Cache WCurve in Global LazyLock

Background: WCurve::new() is called on every EC-SRP5 authentication (client and server). It recomputes the Weierstrass curve generator point via lift_x(9)prime_mod_sqrt(), which performs heavy BigUint modular arithmetic. The result is deterministic and immutable.

MikroTik Compatibility:

  • 100% safe. This is pure internal mathematics. The wire bytes, auth handshake order, and hash outputs are identical. No protocol-visible change.

Objective: Eliminate redundant BigUint modular square root computation per authentication.

Design:

// src/ecsrp5.rs
static WCURVE: std::sync::LazyLock<WCurve> = std::sync::LazyLock::new(WCurve::new);

Replace all call sites:

  • src/ecsrp5.rs:363 (client_authenticate)
  • src/ecsrp5.rs:499 (server_authenticate)

Change let w = WCurve::new(); to let w = &*WCURVE;. Update any WCurve methods that take self to take &self if they don't already.

Acceptance Criteria:

  • ecsrp5_test.rs passes unchanged.
  • full_integration_test.rs EC-SRP5 tests pass unchanged.
  • WCurve::new() is called exactly once per process lifetime.
  • No change to serialized auth bytes on the wire.

Effort: 30 min
Risk: None — stateless deterministic cache
Performance Impact: Medium — reduces per-auth CPU time by ~30-50% (estimated), especially noticeable under concurrent logins.


PRD-002: Deduplicate Instant::now() in tcp_tx_loop_inner

Background: The TCP TX loop calls Instant::now() twice per iteration (status check and interval scheduling). Monotonic clock reads are cheap but not free, and occur in the hottest loop in the system.

MikroTik Compatibility:

  • 100% safe. Timing granularity remains identical.

Objective: Reduce syscalls in the per-packet hot path.

Design:

let now = Instant::now();
if send_status && now >= next_status { ... next_status = now + Duration::from_secs(1); }
// ... reuse `now` for interval math

Acceptance Criteria:

  • TCP send/receive/both integration tests pass.
  • No behavioral change in status injection timing.

Effort: 15 min
Risk: None
Performance Impact: Low — micro-optimization, but trivial.


PRD-003: Fix hash_password() Hex Encoding Allocations

Background: user_db.rs:614 allocates one String per byte when hex-encoding a 32-byte SHA256 hash:

result.iter().map(|b| format!("{:02x}", b)).collect()

MikroTik Compatibility:

  • 100% safe. Output string is identical.

Objective: Replace N-allocation hex encoding with a single-allocation approach.

Design: Use hex crate (already in dependency tree via ecsrp5.rs debug logging) or a small [u8; 64] buffer with write! to a String::with_capacity(64).

Acceptance Criteria:

  • Same hex string output for all inputs.
  • pro feature tests pass.

Effort: 30 min
Risk: None
Performance Impact: Low — removes 32 allocations per password hash.


PRD-004: Cache CSV File Handle

Background: csv_output::write_result() re-opens the file via OpenOptions::new().append(true).open(path) on every call (once per test). Safe but wasteful.

MikroTik Compatibility:

  • 100% safe. No protocol involvement.

Objective: Hold the file handle open for the process lifetime.

Design: Change static CSV_FILE: Mutex<Option<String>> to Mutex<Option<(String, std::fs::File)>>, or open once during init() and store Mutex<Option<File>>.

Acceptance Criteria:

  • CSV tests in full_integration_test.rs pass.
  • File is created with headers on init().
  • Multiple write_result calls append correctly.

Effort: 30 min
Risk: None
Performance Impact: Low — removes one open() syscall per test.


PRD-005: Remove Allocating Error String Matching

Background: src/server_pro/enforcer.rs:157-161 does:

match format!("{}", e).as_str() {
    s if s.contains("daily") => ...
}

This allocates a String from the error just for substring matching.

MikroTik Compatibility:

  • 100% safe. Server-pro internal logic only.

Objective: Match without allocation.

Design: Use e.to_string().contains("daily") (still allocates but clearer) or, better, downcast the rusqlite::Error or match on structured error variants. If the error is anyhow::Error, use .downcast_ref::<rusqlite::Error>().

Acceptance Criteria:

  • Quota enforcement behavior unchanged.
  • Enforcer tests pass.

Effort: 30 min
Risk: None
Performance Impact: Low — removes one allocation per enforcer tick.


PRD-006: Replace chrono_date_today() with chrono Crate

Background: user_db.rs:617-638 contains a hand-rolled Gregorian calendar converter that loops from 1970 to compute today's date. Called before almost every DB write. The chrono crate is already pulled in transitively by rusqlite.

MikroTik Compatibility:

  • 100% safe. No protocol involvement.

Objective: Replace 30 lines of error-prone manual date math with one chrono call.

Design: Add chrono = { version = "0.4", optional = true } gated behind pro feature (or use the transitive dep directly). Replace chrono_date_today() with:

chrono::Local::now().format("%Y-%m-%d").to_string()

Acceptance Criteria:

  • pro feature compiles.
  • Date strings match format YYYY-MM-DD.
  • DB write tests pass.

Effort: 1 hr
Risk: Low — adds explicit dep that already exists transitively
Performance Impact: Low — eliminates loop overhead, but called infrequently.


PRD-007: Optimize Syslog Mutex and Timestamp Formatting

Background: syslog_logger.rs holds a global std::sync::Mutex while formatting a timestamp (manual calendar math) and sending UDP. std::sync::Mutex is relatively slow, and the timestamp logic duplicates chrono_date_today() issues.

MikroTik Compatibility:

  • 100% safe. No protocol involvement.

Objective: Reduce lock contention and allocation in logging path.

Design:

  1. Use parking_lot::Mutex (faster, no poisoning) OR switch to std::sync::Mutex but clone the SyslogSender config outside the lock.
  2. Replace bsd_timestamp() with chrono::Local::now().format("%b %e %H:%M:%S").
  3. Pre-allocate the String with with_capacity(256).

Acceptance Criteria:

  • Syslog output format remains RFC 3164 compliant.
  • test_syslog_events in full_integration_test.rs passes.

Effort: 1 hr
Risk: Low
Performance Impact: Low — logging is not a hot path, but reduces global lock hold time.


PRD-008: Cache ip.to_string() in Quota Checks

Background: quota.rs:389 calls ip.to_string() and then passes &ip_str to multiple DB methods, allocating a new String on every remaining_budget() call.

MikroTik Compatibility:

  • 100% safe. Server-pro internal logic.

Objective: Eliminate redundant IP stringification.

Design: Change DB methods to accept &std::net::IpAddr directly and stringify inside only when needed for SQL parameter binding (which rusqlite may already handle via ToSql). Alternatively, pass ip_str: &str from a single to_string() call and avoid re-stringifying in sub-calls.

Acceptance Criteria:

  • Quota checks return identical results.
  • pro feature tests pass.

Effort: 1 hr
Risk: Low
Performance Impact: Low — one allocation removed per quota check.


Tier 2: Moderate Fixes (Platform & Latency)


PRD-009: FreeBSD CPU Sampling via libc::sysctl FFI

Background: On FreeBSD, cpu.rs spawns sysctl -n kern.cp_time as a child process every second. fork() + exec() is orders of magnitude slower than a direct syscall.

MikroTik Compatibility:

  • 100% safe. No protocol involvement. Platform-specific internal code.

Objective: Replace subprocess with direct sysctl(3) syscall.

Design:

#[cfg(target_os = "freebsd")]
fn get_cpu_times() -> (u64, u64) {
    let mut mib = [libc::CTL_KERN, libc::KERN_CP_TIME];
    let mut cp_time: [libc::c_ulong; 5] = [0; 5];
    let mut len = std::mem::size_of_val(&cp_time);
    unsafe {
        if libc::sysctl(
            mib.as_mut_ptr(),
            mib.len() as u32,
            &mut cp_time as *mut _ as *mut libc::c_void,
            &mut len,
            std::ptr::null_mut(),
            0,
        ) == 0 {
            let total = cp_time[0] + cp_time[1] + cp_time[2] + cp_time[3] + cp_time[4];
            return (total as u64, cp_time[4] as u64);
        }
    }
    (0, 0)
}

Acceptance Criteria:

  • Compiles on FreeBSD.
  • Returns same values as previous sysctl command approach.
  • No child process spawned (verify with ktrace or ps).

Effort: 3 hrs
Risk: Medium — requires FreeBSD test environment; FFI is unsafe
Performance Impact: Medium — eliminates 1 fork/exec per second on FreeBSD.


PRD-010: Replace 100ms Poll with tokio::sync::Notify

Background: In server.rs:313-332, the primary connection of a multi-connection TCP test busy-polls the session map every 100ms waiting for secondary connections to join.

MikroTik Compatibility:

  • 100% safe. This is internal server-side coordination. The wire behavior (waiting for connections, then starting the test) is unchanged. MikroTik clients will not observe a difference except potentially faster test startup.

Objective: Eliminate polling latency and unnecessary mutex acquisitions.

Design:

  1. Add a tokio::sync::Notify to TcpSession:
struct TcpSession {
    peer_ip: IpAddr,
    streams: Vec<OwnedTcpStream>,
    expected: u8,
    notify: tokio::sync::Notify,
}
  1. In the secondary connection handler, after pushing to streams, call session.notify.notify_one().
  2. In the primary wait loop, replace the sleep loop with:
let count = { /* lock, get count, drop lock */ };
if count + 1 >= conn_count { break; }

// Wait for notification or 10s deadline
let timeout = tokio::time::sleep(Duration::from_secs(10));
tokio::pin!(timeout);

loop {
    tokio::select! {
        _ = session.notify.notified() => {
            let count = { /* lock, get count */ };
            if count + 1 >= conn_count { break; }
        }
        _ = &mut timeout => { break; }
    }
}

Acceptance Criteria:

  • Multi-connection TCP tests pass.
  • Test startup latency is ≤ 1ms after last connection joins (was up to 100ms).
  • No deadlock under concurrent multi-connection tests.

Effort: 2 hrs
Risk: Medium — concurrency change; must carefully manage lock/notify ordering to avoid races
Performance Impact: Medium — improves multi-conn test startup latency by up to 100ms per test.


PRD-011: Reuse UDP RX Timer Instead of Per-Call Timeout

Background: Both client and server UDP RX loops create a new tokio::time::timeout on every recv/recv_from call:

tokio::time::timeout(Duration::from_secs(5), socket.recv(&mut buf)).await

At high packet rates, this registers and cancels timers on Tokio's timer wheel constantly.

MikroTik Compatibility:

  • 100% safe. Internal async timing only. UDP packet processing is unchanged.

Objective: Reduce timer wheel churn in high-rate UDP RX loops.

Design: Option A — tokio::select! with a pinned sleep future:

let mut timeout = tokio::time::sleep(Duration::from_secs(5));
tokio::pin!(timeout);

loop {
    tokio::select! {
        biased; // prioritize recv
        res = socket.recv(&mut buf) => { /* handle */ timeout.as_mut().reset(Instant::now() + Duration::from_secs(5)); }
        _ = &mut timeout => { tracing::debug!("UDP RX timeout"); }
    }
}

Option B — Use socket2 to set SO_RCVTIMEO on the underlying socket, then use blocking/async recv without Tokio timeouts. This moves timeout handling into the kernel, which is even cheaper.

Recommendation: Start with Option A (pure Tokio, no platform risk). Option B can be a follow-up.

Acceptance Criteria:

  • UDP send/receive/both tests pass.
  • UDP RX still times out correctly when no packets arrive.
  • No change to packet parsing or sequence tracking.

Effort: 2 hrs
Risk: Medium — changes timeout behavior; must ensure test abortion still works correctly
Performance Impact: Medium — reduces timer wheel registration overhead, noticeable at >50K pps.


Tier 3: High Impact (Do These With Full Focus)


PRD-012: Optimize TCP Client RX Status Message Scan

Background: tcp_client_rx_loop (client.rs:210-216) scans up to 256KB byte-by-byte on every read() call looking for a 12-byte status marker (0x07 + 0x80|cpu). Since data is all zeros, this is almost always a full scan.

MikroTik Compatibility Consideration:

  • High confidence of safety. The protocol is: MikroTik injects 12-byte status messages into the TCP stream. Our client must detect them. Changing how we detect them (faster scan) does not change:
    • What bytes are sent on the wire
    • What bytes we expect
    • How we respond to status messages
  • One edge case to handle: TCP is a stream. A status message may be split across two read() calls. The current code does not handle this correctly (it scans each buffer independently). The optimized version should handle split messages to be strictly more correct than the current implementation.

Objective: Replace O(n) byte-by-byte scan with SIMD-accelerated or state-machine-based detection, while correctly handling split messages.

Design — Recommended: Ring Buffer Approach

Since status messages are 12 bytes and all other bytes are zeros, maintain a 12-byte ring buffer across reads:

const STATUS_MSG_SIZE: usize = 12;

async fn tcp_client_rx_loop(mut reader: OwnedReadHalf, state: Arc<BandwidthState>) {
    let mut buf = vec![0u8; 256 * 1024];
    let mut carry = [0u8; STATUS_MSG_SIZE - 1]; // up to 11 bytes from previous read
    let mut carry_len = 0usize;

    while state.running.load(Ordering::Relaxed) {
        match reader.read(&mut buf).await {
            Ok(0) | Err(_) => break,
            Ok(n) => {
                state.rx_bytes.fetch_add(n as u64, Ordering::Relaxed);

                // Check if a status message spans the carry + start of buf
                if carry_len > 0 {
                    let needed = STATUS_MSG_SIZE - carry_len;
                    if n >= needed {
                        let mut candidate = [0u8; STATUS_MSG_SIZE];
                        candidate[..carry_len].copy_from_slice(&carry[..carry_len]);
                        candidate[carry_len..].copy_from_slice(&buf[..needed]);
                        if candidate[0] == STATUS_MSG_TYPE && candidate[1] >= 0x80 {
                            state.remote_cpu.store(candidate[1] & 0x7F, Ordering::Relaxed);
                        }
                    }
                }

                // Scan within buf for status messages
                // Since data is zeros, use memchr to find 0x07 candidates
                if n >= STATUS_MSG_SIZE {
                    let search_end = n - STATUS_MSG_SIZE + 1;
                    let mut offset = 0;
                    while let Some(pos) = memchr::memchr(STATUS_MSG_TYPE, &buf[offset..search_end]) {
                        let i = offset + pos;
                        if buf[i + 1] >= 0x80 {
                            state.remote_cpu.store(buf[i + 1] & 0x7F, Ordering::Relaxed);
                            break;
                        }
                        offset = i + 1;
                        if offset >= search_end { break; }
                    }
                }

                // Save trailing bytes for next read
                carry_len = (n).min(STATUS_MSG_SIZE - 1);
                if n >= carry_len {
                    carry[..carry_len].copy_from_slice(&buf[n - carry_len..n]);
                }
            }
        }
    }
}

Alternative: memchr crate only If we determine split messages are extremely rare and the current behavior is "good enough," simply replace the for loop with:

if let Some(pos) = memchr::memchr(STATUS_MSG_TYPE, &buf[..n - STATUS_MSG_SIZE + 1]) {
    if buf[pos + 1] >= 0x80 { /* ... */ }
}

This is a 5-line change with massive speedup (SIMD scan). However, the ring buffer approach is strictly more correct and not much more complex.

Acceptance Criteria:

  • TCP bidirectional tests pass.
  • Remote CPU reporting still works.
  • Status messages split across reads are correctly detected (unit test for this).
  • memchr crate added to deps (very lightweight).
  • No change to wire bytes or server behavior.

Effort: 4 hrs
Risk: Medium — hot path change; must be carefully reviewed and tested
Performance Impact: High — eliminates 256KB byte scan per read. At 10K reads/sec, saves ~2.5GB of memory scanning per second.


PRD-013: SQLite Connection Pool / Channel-Based Writer

Background: server_pro uses a single Arc<Mutex<Connection>>. All quota checks, usage recordings, and auth lookups serialize through one lock. remaining_budget() issues 15 queries, locking 15+ times. This is the primary scalability bottleneck for the pro server.

MikroTik Compatibility:

  • 100% safe. Server-side infrastructure only. No protocol change.

Objective: Enable concurrent quota checks and usage recording without mutex contention.

Design — Option A: Connection Pool (Recommended for reads) Use r2d2_sqlite or deadpool-sqlite:

  1. Open a pool of ~4-8 connections to the same SQLite file (WAL mode supports this).
  2. Read-only operations (remaining_budget, get_user, check_user) borrow a connection from the pool.
  3. Write operations (record_usage, record_session) also borrow from the pool (WAL allows concurrent readers + one writer).

Design — Option B: Channel-Based Writer (Recommended for writes)

  1. Keep one dedicated Connection owned by a single Tokio task.
  2. Expose an mpsc::channel where other tasks send write requests (RecordUsage { user, tx, rx }).
  3. The writer task batches or sequentially executes writes without any mutex.
  4. Reads use a separate read-only connection or pool.

Hybrid Recommendation:

  • Reads: Small connection pool (4 connections) for quota checks and auth lookups.
  • Writes: Single dedicated async task with an mpsc::unbounded_channel for usage recording.
  • Cache: Add a 5-second TTL cache for remaining_budget() results per user+IP to avoid redundant DB hits during test setup.

Acceptance Criteria:

  • pro feature compiles and all tests pass.
  • Concurrent test launches scale linearly up to at least 50 concurrent sessions.
  • Quota enforcement remains correct (no over-quota usage).
  • Session logging and interval recording remain accurate.
  • No SQLite "database is locked" errors under load.

Effort: 12 days
Risk: High — touches every DB interaction in server_pro; potential for data races, quota leaks, or connection exhaustion
Performance Impact: High — enables horizontal scaling of concurrent tests; removes the primary pro server bottleneck.


Execution Roadmap

Sprint 1: Quick Wins + Foundation (1 day)

  • PRD-001: WCurve cache
  • PRD-002: Instant::now() dedup
  • PRD-003: hash_password hex fix
  • PRD-004: CSV file handle cache
  • PRD-005: Error string matching
  • PRD-006: chrono date replacement
  • PRD-007: Syslog optimization
  • PRD-008: ip.to_string() cache

Deliverable: Low-risk PR with 8 clean commits. Run full integration tests.

Sprint 2: Platform & Async Fixes (1 day)

  • PRD-009: FreeBSD CPU FFI
  • PRD-010: Multi-conn Notify wake
  • PRD-011: UDP timer reuse

Deliverable: PR with platform + latency improvements.

Sprint 3: Hot Path Optimization (12 days)

  • PRD-012: TCP RX scan optimization
  • Add unit test for split status messages
  • Benchmark before/after with criterion (or manual throughput test)

Deliverable: PR with benchmark numbers proving improvement.

Sprint 4: Scalability (23 days)

  • PRD-013: SQLite connection pool / channel writer
  • Load test: 50 concurrent tests, verify no DB lock contention
  • Add remaining_budget cache

Deliverable: PR with load test results.


Testing Requirements for All PRDs

Since no wire protocol changes are made, the existing integration test suite is the primary validation tool. However, for PRD-012 and PRD-013, additional tests are required:

New Tests to Add

  1. Split Status Message Unit Test (for PRD-012)

    #[test]
    fn test_status_message_split_across_reads() {
        // Feed first 5 bytes, then remaining 7 bytes
        // Assert CPU value is extracted correctly
    }
    
  2. Concurrent Quota Load Test (for PRD-013)

    #[tokio::test]
    async fn test_concurrent_quota_checks() {
        // Spawn 50 tasks doing remaining_budget() + record_usage()
        // Assert no panics, no SQLite locked errors
    }
    
  3. FreeBSD CPU Parity Test (for PRD-009) Manual verification on FreeBSD that FFI sysctl returns same values as command.


Appendix: MikroTik Compatibility Checklist

For every PRD, verify:

  • No change to Command or StatusMessage struct layouts or serialization
  • No change to MD5 challenge-response handshake order
  • No change to EC-SRP5 handshake order or byte values
  • No change to TCP packet sizes or UDP payload format
  • No change to status injection timing (1-second interval)
  • No change to NAT probe behavior
  • Client can still authenticate against stock RouterOS btest server
  • Server can still accept connections from stock RouterOS btest client

All PRDs in this document satisfy the above checklist by construction.