This commit fixes the most significant hot-path bottleneck in the
client: the tcp_client_rx_loop was scanning up to 256KB byte-by-byte
on every read() call looking for interleaved 12-byte status messages.
Changes:
- client.rs (tcp_client_rx_loop): Replace the O(n) for-loop scan
with a three-stage approach:
1. Split-message check: An 11-byte carry buffer stores trailing
bytes from the previous read. We check every possible alignment
where a status message (0x07 + cpu_byte) could span the carry
and the start of the current buffer. This fixes a latent bug
where the old code would miss status messages split across TCP
read boundaries.
2. Fast scan: memchr::memchr (AVX2/NEON SIMD) finds 0x07 bytes
in the 256KB buffer. On all-zero data packets this exits in
~4096 SIMD-width operations instead of 262,144 byte compares.
~64x faster scan path.
3. Carry save: Save up to 11 trailing bytes for the next read.
- client.rs (unit tests): Add scan_status_message() helper and
five unit tests covering:
- Status message fully within buffer
- Status message split across reads (5+7 bytes)
- Status message split at boundary (1+11 bytes)
- All-zero buffer (no false positive)
- Short buffer (no panic)
- Cargo.toml / Cargo.lock: Add memchr as an explicit dependency.
Verified against live MikroTik RouterOS (TCP both + receive modes
with EC-SRP5 auth). Status messages detected correctly. No wire
protocol changes — 100% MikroTik compatible.
23 KiB
Performance Improvement PRDs
Project: btest-rs
Constraint: 100% MikroTik BTest protocol compatibility — no wire-format or behavioral changes visible to MikroTik devices
Date: 2026-04-30
How to Read This Document
Each PRD is sorted by recommended execution order, which balances:
- Effort (development + review + test time)
- Risk (probability of regression or compatibility break)
- Performance Effect (measured or estimated throughput/latency improvement)
- MikroTik Compatibility Risk (whether the change could affect interoperability)
Sorting rationale: Execute quick wins first to build velocity and reduce risk surface, then tackle high-impact items with full attention.
Summary Matrix
| # | PRD | Effort | Risk | Perf Impact | MikroTik Risk | Tier |
|---|---|---|---|---|---|---|
| 1 | WCurve Global Cache | 30 min | None | Medium | None | Quick Win |
| 2 | Redundant Instant::now() |
15 min | None | Low | None | Quick Win |
| 3 | hash_password Hex Fix |
30 min | None | Low | None | Quick Win |
| 4 | CSV File Handle Cache | 30 min | None | Low | None | Quick Win |
| 5 | Error String Matching | 30 min | None | Low | None | Quick Win |
| 6 | chrono_date_today Replace |
1 hr | Low | Low | None | Quick Win |
| 7 | Syslog Mutex + Timestamp | 1 hr | Low | Low | None | Quick Win |
| 8 | ip.to_string() Cache |
1 hr | Low | Low | None | Quick Win |
| 9 | FreeBSD CPU FFI | 3 hrs | Medium | Medium | None | Platform Fix |
| 10 | Multi-Conn Notify Wake | 2 hrs | Medium | Medium | None | Latency Fix |
| 11 | UDP Timer Reuse | 2 hrs | Medium | Medium | None | Throughput Fix |
| 12 | TCP RX Scan Optimization | 4 hrs | Medium | High | Low | Hot Path Fix |
| 13 | SQLite Connection Pool | 1–2 days | High | High | None | Scalability Fix |
Tier 1: Quick Wins (Do These First)
PRD-001: Cache WCurve in Global LazyLock
Background:
WCurve::new() is called on every EC-SRP5 authentication (client and server). It recomputes the Weierstrass curve generator point via lift_x(9) → prime_mod_sqrt(), which performs heavy BigUint modular arithmetic. The result is deterministic and immutable.
MikroTik Compatibility:
- 100% safe. This is pure internal mathematics. The wire bytes, auth handshake order, and hash outputs are identical. No protocol-visible change.
Objective:
Eliminate redundant BigUint modular square root computation per authentication.
Design:
// src/ecsrp5.rs
static WCURVE: std::sync::LazyLock<WCurve> = std::sync::LazyLock::new(WCurve::new);
Replace all call sites:
src/ecsrp5.rs:363(client_authenticate)src/ecsrp5.rs:499(server_authenticate)
Change let w = WCurve::new(); to let w = &*WCURVE;. Update any WCurve methods that take self to take &self if they don't already.
Acceptance Criteria:
ecsrp5_test.rspasses unchanged.full_integration_test.rsEC-SRP5 tests pass unchanged.WCurve::new()is called exactly once per process lifetime.- No change to serialized auth bytes on the wire.
Effort: 30 min
Risk: None — stateless deterministic cache
Performance Impact: Medium — reduces per-auth CPU time by ~30-50% (estimated), especially noticeable under concurrent logins.
PRD-002: Deduplicate Instant::now() in tcp_tx_loop_inner
Background:
The TCP TX loop calls Instant::now() twice per iteration (status check and interval scheduling). Monotonic clock reads are cheap but not free, and occur in the hottest loop in the system.
MikroTik Compatibility:
- 100% safe. Timing granularity remains identical.
Objective: Reduce syscalls in the per-packet hot path.
Design:
let now = Instant::now();
if send_status && now >= next_status { ... next_status = now + Duration::from_secs(1); }
// ... reuse `now` for interval math
Acceptance Criteria:
- TCP send/receive/both integration tests pass.
- No behavioral change in status injection timing.
Effort: 15 min
Risk: None
Performance Impact: Low — micro-optimization, but trivial.
PRD-003: Fix hash_password() Hex Encoding Allocations
Background:
user_db.rs:614 allocates one String per byte when hex-encoding a 32-byte SHA256 hash:
result.iter().map(|b| format!("{:02x}", b)).collect()
MikroTik Compatibility:
- 100% safe. Output string is identical.
Objective: Replace N-allocation hex encoding with a single-allocation approach.
Design:
Use hex crate (already in dependency tree via ecsrp5.rs debug logging) or a small [u8; 64] buffer with write! to a String::with_capacity(64).
Acceptance Criteria:
- Same hex string output for all inputs.
profeature tests pass.
Effort: 30 min
Risk: None
Performance Impact: Low — removes 32 allocations per password hash.
PRD-004: Cache CSV File Handle
Background:
csv_output::write_result() re-opens the file via OpenOptions::new().append(true).open(path) on every call (once per test). Safe but wasteful.
MikroTik Compatibility:
- 100% safe. No protocol involvement.
Objective: Hold the file handle open for the process lifetime.
Design:
Change static CSV_FILE: Mutex<Option<String>> to Mutex<Option<(String, std::fs::File)>>, or open once during init() and store Mutex<Option<File>>.
Acceptance Criteria:
- CSV tests in
full_integration_test.rspass. - File is created with headers on
init(). - Multiple
write_resultcalls append correctly.
Effort: 30 min
Risk: None
Performance Impact: Low — removes one open() syscall per test.
PRD-005: Remove Allocating Error String Matching
Background:
src/server_pro/enforcer.rs:157-161 does:
match format!("{}", e).as_str() {
s if s.contains("daily") => ...
}
This allocates a String from the error just for substring matching.
MikroTik Compatibility:
- 100% safe. Server-pro internal logic only.
Objective: Match without allocation.
Design:
Use e.to_string().contains("daily") (still allocates but clearer) or, better, downcast the rusqlite::Error or match on structured error variants. If the error is anyhow::Error, use .downcast_ref::<rusqlite::Error>().
Acceptance Criteria:
- Quota enforcement behavior unchanged.
- Enforcer tests pass.
Effort: 30 min
Risk: None
Performance Impact: Low — removes one allocation per enforcer tick.
PRD-006: Replace chrono_date_today() with chrono Crate
Background:
user_db.rs:617-638 contains a hand-rolled Gregorian calendar converter that loops from 1970 to compute today's date. Called before almost every DB write. The chrono crate is already pulled in transitively by rusqlite.
MikroTik Compatibility:
- 100% safe. No protocol involvement.
Objective:
Replace 30 lines of error-prone manual date math with one chrono call.
Design:
Add chrono = { version = "0.4", optional = true } gated behind pro feature (or use the transitive dep directly). Replace chrono_date_today() with:
chrono::Local::now().format("%Y-%m-%d").to_string()
Acceptance Criteria:
profeature compiles.- Date strings match format
YYYY-MM-DD. - DB write tests pass.
Effort: 1 hr
Risk: Low — adds explicit dep that already exists transitively
Performance Impact: Low — eliminates loop overhead, but called infrequently.
PRD-007: Optimize Syslog Mutex and Timestamp Formatting
Background:
syslog_logger.rs holds a global std::sync::Mutex while formatting a timestamp (manual calendar math) and sending UDP. std::sync::Mutex is relatively slow, and the timestamp logic duplicates chrono_date_today() issues.
MikroTik Compatibility:
- 100% safe. No protocol involvement.
Objective: Reduce lock contention and allocation in logging path.
Design:
- Use
parking_lot::Mutex(faster, no poisoning) OR switch tostd::sync::Mutexbut clone theSyslogSenderconfig outside the lock. - Replace
bsd_timestamp()withchrono::Local::now().format("%b %e %H:%M:%S"). - Pre-allocate the
Stringwithwith_capacity(256).
Acceptance Criteria:
- Syslog output format remains RFC 3164 compliant.
test_syslog_eventsinfull_integration_test.rspasses.
Effort: 1 hr
Risk: Low
Performance Impact: Low — logging is not a hot path, but reduces global lock hold time.
PRD-008: Cache ip.to_string() in Quota Checks
Background:
quota.rs:389 calls ip.to_string() and then passes &ip_str to multiple DB methods, allocating a new String on every remaining_budget() call.
MikroTik Compatibility:
- 100% safe. Server-pro internal logic.
Objective: Eliminate redundant IP stringification.
Design:
Change DB methods to accept &std::net::IpAddr directly and stringify inside only when needed for SQL parameter binding (which rusqlite may already handle via ToSql). Alternatively, pass ip_str: &str from a single to_string() call and avoid re-stringifying in sub-calls.
Acceptance Criteria:
- Quota checks return identical results.
profeature tests pass.
Effort: 1 hr
Risk: Low
Performance Impact: Low — one allocation removed per quota check.
Tier 2: Moderate Fixes (Platform & Latency)
PRD-009: FreeBSD CPU Sampling via libc::sysctl FFI
Background:
On FreeBSD, cpu.rs spawns sysctl -n kern.cp_time as a child process every second. fork() + exec() is orders of magnitude slower than a direct syscall.
MikroTik Compatibility:
- 100% safe. No protocol involvement. Platform-specific internal code.
Objective:
Replace subprocess with direct sysctl(3) syscall.
Design:
#[cfg(target_os = "freebsd")]
fn get_cpu_times() -> (u64, u64) {
let mut mib = [libc::CTL_KERN, libc::KERN_CP_TIME];
let mut cp_time: [libc::c_ulong; 5] = [0; 5];
let mut len = std::mem::size_of_val(&cp_time);
unsafe {
if libc::sysctl(
mib.as_mut_ptr(),
mib.len() as u32,
&mut cp_time as *mut _ as *mut libc::c_void,
&mut len,
std::ptr::null_mut(),
0,
) == 0 {
let total = cp_time[0] + cp_time[1] + cp_time[2] + cp_time[3] + cp_time[4];
return (total as u64, cp_time[4] as u64);
}
}
(0, 0)
}
Acceptance Criteria:
- Compiles on FreeBSD.
- Returns same values as previous
sysctlcommand approach. - No child process spawned (verify with
ktraceorps).
Effort: 3 hrs
Risk: Medium — requires FreeBSD test environment; FFI is unsafe
Performance Impact: Medium — eliminates 1 fork/exec per second on FreeBSD.
PRD-010: Replace 100ms Poll with tokio::sync::Notify
Background:
In server.rs:313-332, the primary connection of a multi-connection TCP test busy-polls the session map every 100ms waiting for secondary connections to join.
MikroTik Compatibility:
- 100% safe. This is internal server-side coordination. The wire behavior (waiting for connections, then starting the test) is unchanged. MikroTik clients will not observe a difference except potentially faster test startup.
Objective: Eliminate polling latency and unnecessary mutex acquisitions.
Design:
- Add a
tokio::sync::NotifytoTcpSession:
struct TcpSession {
peer_ip: IpAddr,
streams: Vec<OwnedTcpStream>,
expected: u8,
notify: tokio::sync::Notify,
}
- In the secondary connection handler, after pushing to
streams, callsession.notify.notify_one(). - In the primary wait loop, replace the sleep loop with:
let count = { /* lock, get count, drop lock */ };
if count + 1 >= conn_count { break; }
// Wait for notification or 10s deadline
let timeout = tokio::time::sleep(Duration::from_secs(10));
tokio::pin!(timeout);
loop {
tokio::select! {
_ = session.notify.notified() => {
let count = { /* lock, get count */ };
if count + 1 >= conn_count { break; }
}
_ = &mut timeout => { break; }
}
}
Acceptance Criteria:
- Multi-connection TCP tests pass.
- Test startup latency is ≤ 1ms after last connection joins (was up to 100ms).
- No deadlock under concurrent multi-connection tests.
Effort: 2 hrs
Risk: Medium — concurrency change; must carefully manage lock/notify ordering to avoid races
Performance Impact: Medium — improves multi-conn test startup latency by up to 100ms per test.
PRD-011: Reuse UDP RX Timer Instead of Per-Call Timeout
Background:
Both client and server UDP RX loops create a new tokio::time::timeout on every recv/recv_from call:
tokio::time::timeout(Duration::from_secs(5), socket.recv(&mut buf)).await
At high packet rates, this registers and cancels timers on Tokio's timer wheel constantly.
MikroTik Compatibility:
- 100% safe. Internal async timing only. UDP packet processing is unchanged.
Objective: Reduce timer wheel churn in high-rate UDP RX loops.
Design:
Option A — tokio::select! with a pinned sleep future:
let mut timeout = tokio::time::sleep(Duration::from_secs(5));
tokio::pin!(timeout);
loop {
tokio::select! {
biased; // prioritize recv
res = socket.recv(&mut buf) => { /* handle */ timeout.as_mut().reset(Instant::now() + Duration::from_secs(5)); }
_ = &mut timeout => { tracing::debug!("UDP RX timeout"); }
}
}
Option B — Use socket2 to set SO_RCVTIMEO on the underlying socket, then use blocking/async recv without Tokio timeouts. This moves timeout handling into the kernel, which is even cheaper.
Recommendation: Start with Option A (pure Tokio, no platform risk). Option B can be a follow-up.
Acceptance Criteria:
- UDP send/receive/both tests pass.
- UDP RX still times out correctly when no packets arrive.
- No change to packet parsing or sequence tracking.
Effort: 2 hrs
Risk: Medium — changes timeout behavior; must ensure test abortion still works correctly
Performance Impact: Medium — reduces timer wheel registration overhead, noticeable at >50K pps.
Tier 3: High Impact (Do These With Full Focus)
PRD-012: Optimize TCP Client RX Status Message Scan
Background:
tcp_client_rx_loop (client.rs:210-216) scans up to 256KB byte-by-byte on every read() call looking for a 12-byte status marker (0x07 + 0x80|cpu). Since data is all zeros, this is almost always a full scan.
MikroTik Compatibility Consideration:
- High confidence of safety. The protocol is: MikroTik injects 12-byte status messages into the TCP stream. Our client must detect them. Changing how we detect them (faster scan) does not change:
- What bytes are sent on the wire
- What bytes we expect
- How we respond to status messages
- One edge case to handle: TCP is a stream. A status message may be split across two
read()calls. The current code does not handle this correctly (it scans each buffer independently). The optimized version should handle split messages to be strictly more correct than the current implementation.
Objective: Replace O(n) byte-by-byte scan with SIMD-accelerated or state-machine-based detection, while correctly handling split messages.
Design — Recommended: Ring Buffer Approach
Since status messages are 12 bytes and all other bytes are zeros, maintain a 12-byte ring buffer across reads:
const STATUS_MSG_SIZE: usize = 12;
async fn tcp_client_rx_loop(mut reader: OwnedReadHalf, state: Arc<BandwidthState>) {
let mut buf = vec![0u8; 256 * 1024];
let mut carry = [0u8; STATUS_MSG_SIZE - 1]; // up to 11 bytes from previous read
let mut carry_len = 0usize;
while state.running.load(Ordering::Relaxed) {
match reader.read(&mut buf).await {
Ok(0) | Err(_) => break,
Ok(n) => {
state.rx_bytes.fetch_add(n as u64, Ordering::Relaxed);
// Check if a status message spans the carry + start of buf
if carry_len > 0 {
let needed = STATUS_MSG_SIZE - carry_len;
if n >= needed {
let mut candidate = [0u8; STATUS_MSG_SIZE];
candidate[..carry_len].copy_from_slice(&carry[..carry_len]);
candidate[carry_len..].copy_from_slice(&buf[..needed]);
if candidate[0] == STATUS_MSG_TYPE && candidate[1] >= 0x80 {
state.remote_cpu.store(candidate[1] & 0x7F, Ordering::Relaxed);
}
}
}
// Scan within buf for status messages
// Since data is zeros, use memchr to find 0x07 candidates
if n >= STATUS_MSG_SIZE {
let search_end = n - STATUS_MSG_SIZE + 1;
let mut offset = 0;
while let Some(pos) = memchr::memchr(STATUS_MSG_TYPE, &buf[offset..search_end]) {
let i = offset + pos;
if buf[i + 1] >= 0x80 {
state.remote_cpu.store(buf[i + 1] & 0x7F, Ordering::Relaxed);
break;
}
offset = i + 1;
if offset >= search_end { break; }
}
}
// Save trailing bytes for next read
carry_len = (n).min(STATUS_MSG_SIZE - 1);
if n >= carry_len {
carry[..carry_len].copy_from_slice(&buf[n - carry_len..n]);
}
}
}
}
}
Alternative: memchr crate only
If we determine split messages are extremely rare and the current behavior is "good enough," simply replace the for loop with:
if let Some(pos) = memchr::memchr(STATUS_MSG_TYPE, &buf[..n - STATUS_MSG_SIZE + 1]) {
if buf[pos + 1] >= 0x80 { /* ... */ }
}
This is a 5-line change with massive speedup (SIMD scan). However, the ring buffer approach is strictly more correct and not much more complex.
Acceptance Criteria:
- TCP bidirectional tests pass.
- Remote CPU reporting still works.
- Status messages split across reads are correctly detected (unit test for this).
memchrcrate added to deps (very lightweight).- No change to wire bytes or server behavior.
Effort: 4 hrs
Risk: Medium — hot path change; must be carefully reviewed and tested
Performance Impact: High — eliminates 256KB byte scan per read. At 10K reads/sec, saves ~2.5GB of memory scanning per second.
PRD-013: SQLite Connection Pool / Channel-Based Writer
Background:
server_pro uses a single Arc<Mutex<Connection>>. All quota checks, usage recordings, and auth lookups serialize through one lock. remaining_budget() issues 15 queries, locking 15+ times. This is the primary scalability bottleneck for the pro server.
MikroTik Compatibility:
- 100% safe. Server-side infrastructure only. No protocol change.
Objective: Enable concurrent quota checks and usage recording without mutex contention.
Design — Option A: Connection Pool (Recommended for reads)
Use r2d2_sqlite or deadpool-sqlite:
- Open a pool of ~4-8 connections to the same SQLite file (WAL mode supports this).
- Read-only operations (
remaining_budget,get_user,check_user) borrow a connection from the pool. - Write operations (
record_usage,record_session) also borrow from the pool (WAL allows concurrent readers + one writer).
Design — Option B: Channel-Based Writer (Recommended for writes)
- Keep one dedicated
Connectionowned by a single Tokio task. - Expose an
mpsc::channelwhere other tasks send write requests (RecordUsage { user, tx, rx }). - The writer task batches or sequentially executes writes without any mutex.
- Reads use a separate read-only connection or pool.
Hybrid Recommendation:
- Reads: Small connection pool (4 connections) for quota checks and auth lookups.
- Writes: Single dedicated async task with an
mpsc::unbounded_channelfor usage recording. - Cache: Add a 5-second TTL cache for
remaining_budget()results per user+IP to avoid redundant DB hits during test setup.
Acceptance Criteria:
profeature compiles and all tests pass.- Concurrent test launches scale linearly up to at least 50 concurrent sessions.
- Quota enforcement remains correct (no over-quota usage).
- Session logging and interval recording remain accurate.
- No SQLite "database is locked" errors under load.
Effort: 1–2 days
Risk: High — touches every DB interaction in server_pro; potential for data races, quota leaks, or connection exhaustion
Performance Impact: High — enables horizontal scaling of concurrent tests; removes the primary pro server bottleneck.
Execution Roadmap
Sprint 1: Quick Wins + Foundation (1 day)
- PRD-001: WCurve cache
- PRD-002:
Instant::now()dedup - PRD-003:
hash_passwordhex fix - PRD-004: CSV file handle cache
- PRD-005: Error string matching
- PRD-006:
chronodate replacement - PRD-007: Syslog optimization
- PRD-008:
ip.to_string()cache
Deliverable: Low-risk PR with 8 clean commits. Run full integration tests.
Sprint 2: Platform & Async Fixes (1 day)
- PRD-009: FreeBSD CPU FFI
- PRD-010: Multi-conn Notify wake
- PRD-011: UDP timer reuse
Deliverable: PR with platform + latency improvements.
Sprint 3: Hot Path Optimization (1–2 days)
- PRD-012: TCP RX scan optimization
- Add unit test for split status messages
- Benchmark before/after with
criterion(or manual throughput test)
Deliverable: PR with benchmark numbers proving improvement.
Sprint 4: Scalability (2–3 days)
- PRD-013: SQLite connection pool / channel writer
- Load test: 50 concurrent tests, verify no DB lock contention
- Add
remaining_budgetcache
Deliverable: PR with load test results.
Testing Requirements for All PRDs
Since no wire protocol changes are made, the existing integration test suite is the primary validation tool. However, for PRD-012 and PRD-013, additional tests are required:
New Tests to Add
-
Split Status Message Unit Test (for PRD-012)
#[test] fn test_status_message_split_across_reads() { // Feed first 5 bytes, then remaining 7 bytes // Assert CPU value is extracted correctly } -
Concurrent Quota Load Test (for PRD-013)
#[tokio::test] async fn test_concurrent_quota_checks() { // Spawn 50 tasks doing remaining_budget() + record_usage() // Assert no panics, no SQLite locked errors } -
FreeBSD CPU Parity Test (for PRD-009) Manual verification on FreeBSD that FFI
sysctlreturns same values as command.
Appendix: MikroTik Compatibility Checklist
For every PRD, verify:
- No change to
CommandorStatusMessagestruct layouts or serialization - No change to MD5 challenge-response handshake order
- No change to EC-SRP5 handshake order or byte values
- No change to TCP packet sizes or UDP payload format
- No change to status injection timing (1-second interval)
- No change to NAT probe behavior
- Client can still authenticate against stock RouterOS
btestserver - Server can still accept connections from stock RouterOS
btestclient
All PRDs in this document satisfy the above checklist by construction.