# Performance Improvement PRDs **Project:** btest-rs **Constraint:** 100% MikroTik BTest protocol compatibility — no wire-format or behavioral changes visible to MikroTik devices **Date:** 2026-04-30 --- ## How to Read This Document Each PRD is sorted by **recommended execution order**, which balances: - **Effort** (development + review + test time) - **Risk** (probability of regression or compatibility break) - **Performance Effect** (measured or estimated throughput/latency improvement) - **MikroTik Compatibility Risk** (whether the change could affect interoperability) **Sorting rationale:** Execute *quick wins* first to build velocity and reduce risk surface, then tackle *high-impact* items with full attention. --- ## Summary Matrix | # | PRD | Effort | Risk | Perf Impact | MikroTik Risk | Tier | |---|-----|--------|------|-------------|---------------|------| | 1 | WCurve Global Cache | 30 min | None | Medium | None | Quick Win | | 2 | Redundant `Instant::now()` | 15 min | None | Low | None | Quick Win | | 3 | `hash_password` Hex Fix | 30 min | None | Low | None | Quick Win | | 4 | CSV File Handle Cache | 30 min | None | Low | None | Quick Win | | 5 | Error String Matching | 30 min | None | Low | None | Quick Win | | 6 | `chrono_date_today` Replace | 1 hr | Low | Low | None | Quick Win | | 7 | Syslog Mutex + Timestamp | 1 hr | Low | Low | None | Quick Win | | 8 | `ip.to_string()` Cache | 1 hr | Low | Low | None | Quick Win | | 9 | FreeBSD CPU FFI | 3 hrs | Medium | Medium | None | Platform Fix | | 10 | Multi-Conn Notify Wake | 2 hrs | Medium | Medium | None | Latency Fix | | 11 | UDP Timer Reuse | 2 hrs | Medium | Medium | None | Throughput Fix | | 12 | TCP RX Scan Optimization | 4 hrs | Medium | **High** | Low | Hot Path Fix | | 13 | SQLite Connection Pool | 1–2 days | High | **High** | None | Scalability Fix | --- ## Tier 1: Quick Wins (Do These First) --- ### PRD-001: Cache `WCurve` in Global `LazyLock` **Background:** `WCurve::new()` is called on every EC-SRP5 authentication (client and server). It recomputes the Weierstrass curve generator point via `lift_x(9)` → `prime_mod_sqrt()`, which performs heavy `BigUint` modular arithmetic. The result is deterministic and immutable. **MikroTik Compatibility:** - **100% safe.** This is pure internal mathematics. The wire bytes, auth handshake order, and hash outputs are identical. No protocol-visible change. **Objective:** Eliminate redundant `BigUint` modular square root computation per authentication. **Design:** ```rust // src/ecsrp5.rs static WCURVE: std::sync::LazyLock = std::sync::LazyLock::new(WCurve::new); ``` Replace all call sites: - `src/ecsrp5.rs:363` (`client_authenticate`) - `src/ecsrp5.rs:499` (`server_authenticate`) Change `let w = WCurve::new();` to `let w = &*WCURVE;`. Update any `WCurve` methods that take `self` to take `&self` if they don't already. **Acceptance Criteria:** - [ ] `ecsrp5_test.rs` passes unchanged. - [ ] `full_integration_test.rs` EC-SRP5 tests pass unchanged. - [ ] `WCurve::new()` is called exactly once per process lifetime. - [ ] No change to serialized auth bytes on the wire. **Effort:** 30 min **Risk:** None — stateless deterministic cache **Performance Impact:** Medium — reduces per-auth CPU time by ~30-50% (estimated), especially noticeable under concurrent logins. --- ### PRD-002: Deduplicate `Instant::now()` in `tcp_tx_loop_inner` **Background:** The TCP TX loop calls `Instant::now()` twice per iteration (status check and interval scheduling). Monotonic clock reads are cheap but not free, and occur in the hottest loop in the system. **MikroTik Compatibility:** - **100% safe.** Timing granularity remains identical. **Objective:** Reduce syscalls in the per-packet hot path. **Design:** ```rust let now = Instant::now(); if send_status && now >= next_status { ... next_status = now + Duration::from_secs(1); } // ... reuse `now` for interval math ``` **Acceptance Criteria:** - [ ] TCP send/receive/both integration tests pass. - [ ] No behavioral change in status injection timing. **Effort:** 15 min **Risk:** None **Performance Impact:** Low — micro-optimization, but trivial. --- ### PRD-003: Fix `hash_password()` Hex Encoding Allocations **Background:** `user_db.rs:614` allocates one `String` per byte when hex-encoding a 32-byte SHA256 hash: ```rust result.iter().map(|b| format!("{:02x}", b)).collect() ``` **MikroTik Compatibility:** - **100% safe.** Output string is identical. **Objective:** Replace N-allocation hex encoding with a single-allocation approach. **Design:** Use `hex` crate (already in dependency tree via `ecsrp5.rs` debug logging) or a small `[u8; 64]` buffer with `write!` to a `String::with_capacity(64)`. **Acceptance Criteria:** - [ ] Same hex string output for all inputs. - [ ] `pro` feature tests pass. **Effort:** 30 min **Risk:** None **Performance Impact:** Low — removes 32 allocations per password hash. --- ### PRD-004: Cache CSV File Handle **Background:** `csv_output::write_result()` re-opens the file via `OpenOptions::new().append(true).open(path)` on every call (once per test). Safe but wasteful. **MikroTik Compatibility:** - **100% safe.** No protocol involvement. **Objective:** Hold the file handle open for the process lifetime. **Design:** Change `static CSV_FILE: Mutex>` to `Mutex>`, or open once during `init()` and store `Mutex>`. **Acceptance Criteria:** - [ ] CSV tests in `full_integration_test.rs` pass. - [ ] File is created with headers on `init()`. - [ ] Multiple `write_result` calls append correctly. **Effort:** 30 min **Risk:** None **Performance Impact:** Low — removes one `open()` syscall per test. --- ### PRD-005: Remove Allocating Error String Matching **Background:** `src/server_pro/enforcer.rs:157-161` does: ```rust match format!("{}", e).as_str() { s if s.contains("daily") => ... } ``` This allocates a `String` from the error just for substring matching. **MikroTik Compatibility:** - **100% safe.** Server-pro internal logic only. **Objective:** Match without allocation. **Design:** Use `e.to_string().contains("daily")` (still allocates but clearer) or, better, downcast the `rusqlite::Error` or match on structured error variants. If the error is `anyhow::Error`, use `.downcast_ref::()`. **Acceptance Criteria:** - [ ] Quota enforcement behavior unchanged. - [ ] Enforcer tests pass. **Effort:** 30 min **Risk:** None **Performance Impact:** Low — removes one allocation per enforcer tick. --- ### PRD-006: Replace `chrono_date_today()` with `chrono` Crate **Background:** `user_db.rs:617-638` contains a hand-rolled Gregorian calendar converter that loops from 1970 to compute today's date. Called before almost every DB write. The `chrono` crate is already pulled in transitively by `rusqlite`. **MikroTik Compatibility:** - **100% safe.** No protocol involvement. **Objective:** Replace 30 lines of error-prone manual date math with one `chrono` call. **Design:** Add `chrono = { version = "0.4", optional = true }` gated behind `pro` feature (or use the transitive dep directly). Replace `chrono_date_today()` with: ```rust chrono::Local::now().format("%Y-%m-%d").to_string() ``` **Acceptance Criteria:** - [ ] `pro` feature compiles. - [ ] Date strings match format `YYYY-MM-DD`. - [ ] DB write tests pass. **Effort:** 1 hr **Risk:** Low — adds explicit dep that already exists transitively **Performance Impact:** Low — eliminates loop overhead, but called infrequently. --- ### PRD-007: Optimize Syslog Mutex and Timestamp Formatting **Background:** `syslog_logger.rs` holds a global `std::sync::Mutex` while formatting a timestamp (manual calendar math) and sending UDP. `std::sync::Mutex` is relatively slow, and the timestamp logic duplicates `chrono_date_today()` issues. **MikroTik Compatibility:** - **100% safe.** No protocol involvement. **Objective:** Reduce lock contention and allocation in logging path. **Design:** 1. Use `parking_lot::Mutex` (faster, no poisoning) OR switch to `std::sync::Mutex` but clone the `SyslogSender` config outside the lock. 2. Replace `bsd_timestamp()` with `chrono::Local::now().format("%b %e %H:%M:%S")`. 3. Pre-allocate the `String` with `with_capacity(256)`. **Acceptance Criteria:** - [ ] Syslog output format remains RFC 3164 compliant. - [ ] `test_syslog_events` in `full_integration_test.rs` passes. **Effort:** 1 hr **Risk:** Low **Performance Impact:** Low — logging is not a hot path, but reduces global lock hold time. --- ### PRD-008: Cache `ip.to_string()` in Quota Checks **Background:** `quota.rs:389` calls `ip.to_string()` and then passes `&ip_str` to multiple DB methods, allocating a new `String` on every `remaining_budget()` call. **MikroTik Compatibility:** - **100% safe.** Server-pro internal logic. **Objective:** Eliminate redundant IP stringification. **Design:** Change DB methods to accept `&std::net::IpAddr` directly and stringify inside only when needed for SQL parameter binding (which `rusqlite` may already handle via `ToSql`). Alternatively, pass `ip_str: &str` from a single `to_string()` call and avoid re-stringifying in sub-calls. **Acceptance Criteria:** - [ ] Quota checks return identical results. - [ ] `pro` feature tests pass. **Effort:** 1 hr **Risk:** Low **Performance Impact:** Low — one allocation removed per quota check. --- ## Tier 2: Moderate Fixes (Platform & Latency) --- ### PRD-009: FreeBSD CPU Sampling via `libc::sysctl` FFI **Background:** On FreeBSD, `cpu.rs` spawns `sysctl -n kern.cp_time` as a child process every second. `fork()` + `exec()` is orders of magnitude slower than a direct syscall. **MikroTik Compatibility:** - **100% safe.** No protocol involvement. Platform-specific internal code. **Objective:** Replace subprocess with direct `sysctl(3)` syscall. **Design:** ```rust #[cfg(target_os = "freebsd")] fn get_cpu_times() -> (u64, u64) { let mut mib = [libc::CTL_KERN, libc::KERN_CP_TIME]; let mut cp_time: [libc::c_ulong; 5] = [0; 5]; let mut len = std::mem::size_of_val(&cp_time); unsafe { if libc::sysctl( mib.as_mut_ptr(), mib.len() as u32, &mut cp_time as *mut _ as *mut libc::c_void, &mut len, std::ptr::null_mut(), 0, ) == 0 { let total = cp_time[0] + cp_time[1] + cp_time[2] + cp_time[3] + cp_time[4]; return (total as u64, cp_time[4] as u64); } } (0, 0) } ``` **Acceptance Criteria:** - [ ] Compiles on FreeBSD. - [ ] Returns same values as previous `sysctl` command approach. - [ ] No child process spawned (verify with `ktrace` or `ps`). **Effort:** 3 hrs **Risk:** Medium — requires FreeBSD test environment; FFI is unsafe **Performance Impact:** Medium — eliminates 1 fork/exec per second on FreeBSD. --- ### PRD-010: Replace 100ms Poll with `tokio::sync::Notify` **Background:** In `server.rs:313-332`, the primary connection of a multi-connection TCP test busy-polls the session map every 100ms waiting for secondary connections to join. **MikroTik Compatibility:** - **100% safe.** This is internal server-side coordination. The wire behavior (waiting for connections, then starting the test) is unchanged. MikroTik clients will not observe a difference except potentially faster test startup. **Objective:** Eliminate polling latency and unnecessary mutex acquisitions. **Design:** 1. Add a `tokio::sync::Notify` to `TcpSession`: ```rust struct TcpSession { peer_ip: IpAddr, streams: Vec, expected: u8, notify: tokio::sync::Notify, } ``` 2. In the secondary connection handler, after pushing to `streams`, call `session.notify.notify_one()`. 3. In the primary wait loop, replace the sleep loop with: ```rust let count = { /* lock, get count, drop lock */ }; if count + 1 >= conn_count { break; } // Wait for notification or 10s deadline let timeout = tokio::time::sleep(Duration::from_secs(10)); tokio::pin!(timeout); loop { tokio::select! { _ = session.notify.notified() => { let count = { /* lock, get count */ }; if count + 1 >= conn_count { break; } } _ = &mut timeout => { break; } } } ``` **Acceptance Criteria:** - [ ] Multi-connection TCP tests pass. - [ ] Test startup latency is ≤ 1ms after last connection joins (was up to 100ms). - [ ] No deadlock under concurrent multi-connection tests. **Effort:** 2 hrs **Risk:** Medium — concurrency change; must carefully manage lock/notify ordering to avoid races **Performance Impact:** Medium — improves multi-conn test startup latency by up to 100ms per test. --- ### PRD-011: Reuse UDP RX Timer Instead of Per-Call Timeout **Background:** Both client and server UDP RX loops create a new `tokio::time::timeout` on every `recv`/`recv_from` call: ```rust tokio::time::timeout(Duration::from_secs(5), socket.recv(&mut buf)).await ``` At high packet rates, this registers and cancels timers on Tokio's timer wheel constantly. **MikroTik Compatibility:** - **100% safe.** Internal async timing only. UDP packet processing is unchanged. **Objective:** Reduce timer wheel churn in high-rate UDP RX loops. **Design:** Option A — `tokio::select!` with a pinned sleep future: ```rust let mut timeout = tokio::time::sleep(Duration::from_secs(5)); tokio::pin!(timeout); loop { tokio::select! { biased; // prioritize recv res = socket.recv(&mut buf) => { /* handle */ timeout.as_mut().reset(Instant::now() + Duration::from_secs(5)); } _ = &mut timeout => { tracing::debug!("UDP RX timeout"); } } } ``` Option B — Use `socket2` to set `SO_RCVTIMEO` on the underlying socket, then use blocking/async recv without Tokio timeouts. This moves timeout handling into the kernel, which is even cheaper. **Recommendation:** Start with Option A (pure Tokio, no platform risk). Option B can be a follow-up. **Acceptance Criteria:** - [ ] UDP send/receive/both tests pass. - [ ] UDP RX still times out correctly when no packets arrive. - [ ] No change to packet parsing or sequence tracking. **Effort:** 2 hrs **Risk:** Medium — changes timeout behavior; must ensure test abortion still works correctly **Performance Impact:** Medium — reduces timer wheel registration overhead, noticeable at >50K pps. --- ## Tier 3: High Impact (Do These With Full Focus) --- ### PRD-012: Optimize TCP Client RX Status Message Scan **Background:** `tcp_client_rx_loop` (`client.rs:210-216`) scans up to 256KB byte-by-byte on every `read()` call looking for a 12-byte status marker (`0x07` + `0x80|cpu`). Since data is all zeros, this is almost always a full scan. **MikroTik Compatibility Consideration:** - **High confidence of safety.** The protocol is: MikroTik injects 12-byte status messages into the TCP stream. Our client must detect them. Changing *how* we detect them (faster scan) does not change: - What bytes are sent on the wire - What bytes we expect - How we respond to status messages - **One edge case to handle:** TCP is a stream. A status message may be split across two `read()` calls. The current code does **not** handle this correctly (it scans each buffer independently). The optimized version *should* handle split messages to be strictly more correct than the current implementation. **Objective:** Replace O(n) byte-by-byte scan with SIMD-accelerated or state-machine-based detection, while correctly handling split messages. **Design — Recommended: Ring Buffer Approach** Since status messages are 12 bytes and all other bytes are zeros, maintain a 12-byte ring buffer across reads: ```rust const STATUS_MSG_SIZE: usize = 12; async fn tcp_client_rx_loop(mut reader: OwnedReadHalf, state: Arc) { let mut buf = vec![0u8; 256 * 1024]; let mut carry = [0u8; STATUS_MSG_SIZE - 1]; // up to 11 bytes from previous read let mut carry_len = 0usize; while state.running.load(Ordering::Relaxed) { match reader.read(&mut buf).await { Ok(0) | Err(_) => break, Ok(n) => { state.rx_bytes.fetch_add(n as u64, Ordering::Relaxed); // Check if a status message spans the carry + start of buf if carry_len > 0 { let needed = STATUS_MSG_SIZE - carry_len; if n >= needed { let mut candidate = [0u8; STATUS_MSG_SIZE]; candidate[..carry_len].copy_from_slice(&carry[..carry_len]); candidate[carry_len..].copy_from_slice(&buf[..needed]); if candidate[0] == STATUS_MSG_TYPE && candidate[1] >= 0x80 { state.remote_cpu.store(candidate[1] & 0x7F, Ordering::Relaxed); } } } // Scan within buf for status messages // Since data is zeros, use memchr to find 0x07 candidates if n >= STATUS_MSG_SIZE { let search_end = n - STATUS_MSG_SIZE + 1; let mut offset = 0; while let Some(pos) = memchr::memchr(STATUS_MSG_TYPE, &buf[offset..search_end]) { let i = offset + pos; if buf[i + 1] >= 0x80 { state.remote_cpu.store(buf[i + 1] & 0x7F, Ordering::Relaxed); break; } offset = i + 1; if offset >= search_end { break; } } } // Save trailing bytes for next read carry_len = (n).min(STATUS_MSG_SIZE - 1); if n >= carry_len { carry[..carry_len].copy_from_slice(&buf[n - carry_len..n]); } } } } } ``` **Alternative: `memchr` crate only** If we determine split messages are extremely rare and the current behavior is "good enough," simply replace the `for` loop with: ```rust if let Some(pos) = memchr::memchr(STATUS_MSG_TYPE, &buf[..n - STATUS_MSG_SIZE + 1]) { if buf[pos + 1] >= 0x80 { /* ... */ } } ``` This is a 5-line change with massive speedup (SIMD scan). However, the ring buffer approach is strictly more correct and not much more complex. **Acceptance Criteria:** - [ ] TCP bidirectional tests pass. - [ ] Remote CPU reporting still works. - [ ] Status messages split across reads are correctly detected (unit test for this). - [ ] `memchr` crate added to deps (very lightweight). - [ ] No change to wire bytes or server behavior. **Effort:** 4 hrs **Risk:** Medium — hot path change; must be carefully reviewed and tested **Performance Impact:** **High** — eliminates 256KB byte scan per read. At 10K reads/sec, saves ~2.5GB of memory scanning per second. --- ### PRD-013: SQLite Connection Pool / Channel-Based Writer **Background:** `server_pro` uses a single `Arc>`. All quota checks, usage recordings, and auth lookups serialize through one lock. `remaining_budget()` issues 15 queries, locking 15+ times. This is the primary scalability bottleneck for the pro server. **MikroTik Compatibility:** - **100% safe.** Server-side infrastructure only. No protocol change. **Objective:** Enable concurrent quota checks and usage recording without mutex contention. **Design — Option A: Connection Pool (Recommended for reads)** Use `r2d2_sqlite` or `deadpool-sqlite`: 1. Open a pool of ~4-8 connections to the same SQLite file (WAL mode supports this). 2. Read-only operations (`remaining_budget`, `get_user`, `check_user`) borrow a connection from the pool. 3. Write operations (`record_usage`, `record_session`) also borrow from the pool (WAL allows concurrent readers + one writer). **Design — Option B: Channel-Based Writer (Recommended for writes)** 1. Keep one dedicated `Connection` owned by a single Tokio task. 2. Expose an `mpsc::channel` where other tasks send write requests (`RecordUsage { user, tx, rx }`). 3. The writer task batches or sequentially executes writes without any mutex. 4. Reads use a separate read-only connection or pool. **Hybrid Recommendation:** - **Reads:** Small connection pool (4 connections) for quota checks and auth lookups. - **Writes:** Single dedicated async task with an `mpsc::unbounded_channel` for usage recording. - **Cache:** Add a 5-second TTL cache for `remaining_budget()` results per user+IP to avoid redundant DB hits during test setup. **Acceptance Criteria:** - [ ] `pro` feature compiles and all tests pass. - [ ] Concurrent test launches scale linearly up to at least 50 concurrent sessions. - [ ] Quota enforcement remains correct (no over-quota usage). - [ ] Session logging and interval recording remain accurate. - [ ] No SQLite "database is locked" errors under load. **Effort:** 1–2 days **Risk:** High — touches every DB interaction in `server_pro`; potential for data races, quota leaks, or connection exhaustion **Performance Impact:** **High** — enables horizontal scaling of concurrent tests; removes the primary pro server bottleneck. --- ## Execution Roadmap ### Sprint 1: Quick Wins + Foundation (1 day) - [ ] PRD-001: WCurve cache - [ ] PRD-002: `Instant::now()` dedup - [ ] PRD-003: `hash_password` hex fix - [ ] PRD-004: CSV file handle cache - [ ] PRD-005: Error string matching - [ ] PRD-006: `chrono` date replacement - [ ] PRD-007: Syslog optimization - [ ] PRD-008: `ip.to_string()` cache **Deliverable:** Low-risk PR with 8 clean commits. Run full integration tests. ### Sprint 2: Platform & Async Fixes (1 day) - [ ] PRD-009: FreeBSD CPU FFI - [ ] PRD-010: Multi-conn Notify wake - [ ] PRD-011: UDP timer reuse **Deliverable:** PR with platform + latency improvements. ### Sprint 3: Hot Path Optimization (1–2 days) - [ ] PRD-012: TCP RX scan optimization - [ ] Add unit test for split status messages - [ ] Benchmark before/after with `criterion` (or manual throughput test) **Deliverable:** PR with benchmark numbers proving improvement. ### Sprint 4: Scalability (2–3 days) - [ ] PRD-013: SQLite connection pool / channel writer - [ ] Load test: 50 concurrent tests, verify no DB lock contention - [ ] Add `remaining_budget` cache **Deliverable:** PR with load test results. --- ## Testing Requirements for All PRDs Since **no wire protocol changes** are made, the existing integration test suite is the primary validation tool. However, for PRD-012 and PRD-013, additional tests are required: ### New Tests to Add 1. **Split Status Message Unit Test (for PRD-012)** ```rust #[test] fn test_status_message_split_across_reads() { // Feed first 5 bytes, then remaining 7 bytes // Assert CPU value is extracted correctly } ``` 2. **Concurrent Quota Load Test (for PRD-013)** ```rust #[tokio::test] async fn test_concurrent_quota_checks() { // Spawn 50 tasks doing remaining_budget() + record_usage() // Assert no panics, no SQLite locked errors } ``` 3. **FreeBSD CPU Parity Test (for PRD-009)** Manual verification on FreeBSD that FFI `sysctl` returns same values as command. --- ## Appendix: MikroTik Compatibility Checklist For every PRD, verify: - [ ] No change to `Command` or `StatusMessage` struct layouts or serialization - [ ] No change to MD5 challenge-response handshake order - [ ] No change to EC-SRP5 handshake order or byte values - [ ] No change to TCP packet sizes or UDP payload format - [ ] No change to status injection timing (1-second interval) - [ ] No change to NAT probe behavior - [ ] Client can still authenticate against stock RouterOS `btest` server - [ ] Server can still accept connections from stock RouterOS `btest` client All PRDs in this document satisfy the above checklist by construction.