Files

Siavash Sameni bba9b0512c

CI / test (push) Successful in 2m14s

Details

perf: replace O(n) TCP RX buffer scan with SIMD memchr + carry buffer (Sprint 3)

This commit fixes the most significant hot-path bottleneck in the
client: the tcp_client_rx_loop was scanning up to 256KB byte-by-byte
on every read() call looking for interleaved 12-byte status messages.

Changes:
- client.rs (tcp_client_rx_loop): Replace the O(n) for-loop scan
  with a three-stage approach:

  1. Split-message check: An 11-byte carry buffer stores trailing
     bytes from the previous read. We check every possible alignment
     where a status message (0x07 + cpu_byte) could span the carry
     and the start of the current buffer. This fixes a latent bug
     where the old code would miss status messages split across TCP
     read boundaries.

  2. Fast scan: memchr::memchr (AVX2/NEON SIMD) finds 0x07 bytes
     in the 256KB buffer. On all-zero data packets this exits in
     ~4096 SIMD-width operations instead of 262,144 byte compares.
     ~64x faster scan path.

  3. Carry save: Save up to 11 trailing bytes for the next read.

- client.rs (unit tests): Add scan_status_message() helper and
  five unit tests covering:
  - Status message fully within buffer
  - Status message split across reads (5+7 bytes)
  - Status message split at boundary (1+11 bytes)
  - All-zero buffer (no false positive)
  - Short buffer (no panic)

- Cargo.toml / Cargo.lock: Add memchr as an explicit dependency.

Verified against live MikroTik RouterOS (TCP both + receive modes
with EC-SRP5 auth). Status messages detected correctly. No wire
protocol changes — 100% MikroTik compatible.

2026-04-30 20:46:34 +04:00

23 KiB

Raw Blame History

Performance Improvement PRDs

Project: btest-rs
Constraint: 100% MikroTik BTest protocol compatibility — no wire-format or behavioral changes visible to MikroTik devices
Date: 2026-04-30

How to Read This Document

Each PRD is sorted by recommended execution order, which balances:

Effort (development + review + test time)
Risk (probability of regression or compatibility break)
Performance Effect (measured or estimated throughput/latency improvement)
MikroTik Compatibility Risk (whether the change could affect interoperability)

Sorting rationale: Execute quick wins first to build velocity and reduce risk surface, then tackle high-impact items with full attention.

Summary Matrix

#	PRD	Effort	Risk	Perf Impact	MikroTik Risk	Tier
1	WCurve Global Cache	30 min	None	Medium	None	Quick Win
2	Redundant `Instant::now()`	15 min	None	Low	None	Quick Win
3	`hash_password` Hex Fix	30 min	None	Low	None	Quick Win
4	CSV File Handle Cache	30 min	None	Low	None	Quick Win
5	Error String Matching	30 min	None	Low	None	Quick Win
6	`chrono_date_today` Replace	1 hr	Low	Low	None	Quick Win
7	Syslog Mutex + Timestamp	1 hr	Low	Low	None	Quick Win
8	`ip.to_string()` Cache	1 hr	Low	Low	None	Quick Win
9	FreeBSD CPU FFI	3 hrs	Medium	Medium	None	Platform Fix
10	Multi-Conn Notify Wake	2 hrs	Medium	Medium	None	Latency Fix
11	UDP Timer Reuse	2 hrs	Medium	Medium	None	Throughput Fix
12	TCP RX Scan Optimization	4 hrs	Medium	High	Low	Hot Path Fix
13	SQLite Connection Pool	1–2 days	High	High	None	Scalability Fix

Tier 1: Quick Wins (Do These First)

PRD-001: Cache `WCurve` in Global `LazyLock`

Background: WCurve::new() is called on every EC-SRP5 authentication (client and server). It recomputes the Weierstrass curve generator point via lift_x(9) → prime_mod_sqrt(), which performs heavy BigUint modular arithmetic. The result is deterministic and immutable.

MikroTik Compatibility:

100% safe. This is pure internal mathematics. The wire bytes, auth handshake order, and hash outputs are identical. No protocol-visible change.

Objective: Eliminate redundant BigUint modular square root computation per authentication.

Design:

// src/ecsrp5.rs
static WCURVE: std::sync::LazyLock<WCurve> = std::sync::LazyLock::new(WCurve::new);

Replace all call sites:

src/ecsrp5.rs:363 (client_authenticate)
src/ecsrp5.rs:499 (server_authenticate)

Change let w = WCurve::new(); to let w = &*WCURVE;. Update any WCurve methods that take self to take &self if they don't already.

Acceptance Criteria:

ecsrp5_test.rs passes unchanged.
full_integration_test.rs EC-SRP5 tests pass unchanged.
WCurve::new() is called exactly once per process lifetime.
No change to serialized auth bytes on the wire.

Effort: 30 min
Risk: None — stateless deterministic cache
Performance Impact: Medium — reduces per-auth CPU time by ~30-50% (estimated), especially noticeable under concurrent logins.

PRD-002: Deduplicate `Instant::now()` in `tcp_tx_loop_inner`

Background: The TCP TX loop calls Instant::now() twice per iteration (status check and interval scheduling). Monotonic clock reads are cheap but not free, and occur in the hottest loop in the system.

MikroTik Compatibility:

100% safe. Timing granularity remains identical.

Objective: Reduce syscalls in the per-packet hot path.

Design:

let now = Instant::now();
if send_status && now >= next_status { ... next_status = now + Duration::from_secs(1); }
// ... reuse `now` for interval math

Acceptance Criteria:

TCP send/receive/both integration tests pass.
No behavioral change in status injection timing.

Effort: 15 min
Risk: None
Performance Impact: Low — micro-optimization, but trivial.

PRD-003: Fix `hash_password()` Hex Encoding Allocations

Background: user_db.rs:614 allocates one String per byte when hex-encoding a 32-byte SHA256 hash:

result.iter().map(|b| format!("{:02x}", b)).collect()

MikroTik Compatibility:

100% safe. Output string is identical.

Objective: Replace N-allocation hex encoding with a single-allocation approach.

Design: Use hex crate (already in dependency tree via ecsrp5.rs debug logging) or a small [u8; 64] buffer with write! to a String::with_capacity(64).

Acceptance Criteria:

Same hex string output for all inputs.
pro feature tests pass.

Effort: 30 min
Risk: None
Performance Impact: Low — removes 32 allocations per password hash.

PRD-004: Cache CSV File Handle

Background: csv_output::write_result() re-opens the file via OpenOptions::new().append(true).open(path) on every call (once per test). Safe but wasteful.

MikroTik Compatibility:

100% safe. No protocol involvement.

Objective: Hold the file handle open for the process lifetime.

Design: Change static CSV_FILE: Mutex<Option<String>> to Mutex<Option<(String, std::fs::File)>>, or open once during init() and store Mutex<Option<File>>.

Acceptance Criteria:

CSV tests in full_integration_test.rs pass.
File is created with headers on init().
Multiple write_result calls append correctly.

Effort: 30 min
Risk: None
Performance Impact: Low — removes one open() syscall per test.

PRD-005: Remove Allocating Error String Matching

Background: src/server_pro/enforcer.rs:157-161 does:

match format!("{}", e).as_str() {
    s if s.contains("daily") => ...
}

This allocates a String from the error just for substring matching.

MikroTik Compatibility:

100% safe. Server-pro internal logic only.

Objective: Match without allocation.

Design: Use e.to_string().contains("daily") (still allocates but clearer) or, better, downcast the rusqlite::Error or match on structured error variants. If the error is anyhow::Error, use .downcast_ref::<rusqlite::Error>().

Acceptance Criteria:

Quota enforcement behavior unchanged.
Enforcer tests pass.

Effort: 30 min
Risk: None
Performance Impact: Low — removes one allocation per enforcer tick.

PRD-006: Replace `chrono_date_today()` with `chrono` Crate

Background: user_db.rs:617-638 contains a hand-rolled Gregorian calendar converter that loops from 1970 to compute today's date. Called before almost every DB write. The chrono crate is already pulled in transitively by rusqlite.

MikroTik Compatibility:

100% safe. No protocol involvement.

Objective: Replace 30 lines of error-prone manual date math with one chrono call.

Design: Add chrono = { version = "0.4", optional = true } gated behind pro feature (or use the transitive dep directly). Replace chrono_date_today() with:

chrono::Local::now().format("%Y-%m-%d").to_string()

Acceptance Criteria:

pro feature compiles.
Date strings match format YYYY-MM-DD.
DB write tests pass.

Effort: 1 hr
Risk: Low — adds explicit dep that already exists transitively
Performance Impact: Low — eliminates loop overhead, but called infrequently.

PRD-007: Optimize Syslog Mutex and Timestamp Formatting

Background: syslog_logger.rs holds a global std::sync::Mutex while formatting a timestamp (manual calendar math) and sending UDP. std::sync::Mutex is relatively slow, and the timestamp logic duplicates chrono_date_today() issues.

MikroTik Compatibility:

100% safe. No protocol involvement.

Objective: Reduce lock contention and allocation in logging path.

Design:

Use parking_lot::Mutex (faster, no poisoning) OR switch to std::sync::Mutex but clone the SyslogSender config outside the lock.
Replace bsd_timestamp() with chrono::Local::now().format("%b %e %H:%M:%S").
Pre-allocate the String with with_capacity(256).

Acceptance Criteria:

Syslog output format remains RFC 3164 compliant.
test_syslog_events in full_integration_test.rs passes.

Effort: 1 hr
Risk: Low
Performance Impact: Low — logging is not a hot path, but reduces global lock hold time.

PRD-008: Cache `ip.to_string()` in Quota Checks

Background: quota.rs:389 calls ip.to_string() and then passes &ip_str to multiple DB methods, allocating a new String on every remaining_budget() call.

MikroTik Compatibility:

100% safe. Server-pro internal logic.

Objective: Eliminate redundant IP stringification.

Design: Change DB methods to accept &std::net::IpAddr directly and stringify inside only when needed for SQL parameter binding (which rusqlite may already handle via ToSql). Alternatively, pass ip_str: &str from a single to_string() call and avoid re-stringifying in sub-calls.

Acceptance Criteria:

Quota checks return identical results.
pro feature tests pass.

Effort: 1 hr
Risk: Low
Performance Impact: Low — one allocation removed per quota check.

Tier 2: Moderate Fixes (Platform & Latency)

PRD-009: FreeBSD CPU Sampling via `libc::sysctl` FFI

Background: On FreeBSD, cpu.rs spawns sysctl -n kern.cp_time as a child process every second. fork() + exec() is orders of magnitude slower than a direct syscall.

MikroTik Compatibility:

100% safe. No protocol involvement. Platform-specific internal code.

Objective: Replace subprocess with direct sysctl(3) syscall.

Design:

#[cfg(target_os = "freebsd")]
fn get_cpu_times() -> (u64, u64) {
    let mut mib = [libc::CTL_KERN, libc::KERN_CP_TIME];
    let mut cp_time: [libc::c_ulong; 5] = [0; 5];
    let mut len = std::mem::size_of_val(&cp_time);
    unsafe {
        if libc::sysctl(
            mib.as_mut_ptr(),
            mib.len() as u32,
            &mut cp_time as *mut _ as *mut libc::c_void,
            &mut len,
            std::ptr::null_mut(),
            0,
        ) == 0 {
            let total = cp_time[0] + cp_time[1] + cp_time[2] + cp_time[3] + cp_time[4];
            return (total as u64, cp_time[4] as u64);
        }
    }
    (0, 0)
}

Acceptance Criteria:

Compiles on FreeBSD.
Returns same values as previous sysctl command approach.
No child process spawned (verify with ktrace or ps).

Effort: 3 hrs
Risk: Medium — requires FreeBSD test environment; FFI is unsafe
Performance Impact: Medium — eliminates 1 fork/exec per second on FreeBSD.

PRD-010: Replace 100ms Poll with `tokio::sync::Notify`

Background: In server.rs:313-332, the primary connection of a multi-connection TCP test busy-polls the session map every 100ms waiting for secondary connections to join.

MikroTik Compatibility:

100% safe. This is internal server-side coordination. The wire behavior (waiting for connections, then starting the test) is unchanged. MikroTik clients will not observe a difference except potentially faster test startup.

Objective: Eliminate polling latency and unnecessary mutex acquisitions.

Design:

Add a tokio::sync::Notify to TcpSession:

struct TcpSession {
    peer_ip: IpAddr,
    streams: Vec<OwnedTcpStream>,
    expected: u8,
    notify: tokio::sync::Notify,
}

In the secondary connection handler, after pushing to streams, call session.notify.notify_one().
In the primary wait loop, replace the sleep loop with:

let count = { /* lock, get count, drop lock */ };
if count + 1 >= conn_count { break; }

// Wait for notification or 10s deadline
let timeout = tokio::time::sleep(Duration::from_secs(10));
tokio::pin!(timeout);

loop {
    tokio::select! {
        _ = session.notify.notified() => {
            let count = { /* lock, get count */ };
            if count + 1 >= conn_count { break; }
        }
        _ = &mut timeout => { break; }
    }
}

Acceptance Criteria:

Multi-connection TCP tests pass.
Test startup latency is ≤ 1ms after last connection joins (was up to 100ms).
No deadlock under concurrent multi-connection tests.

Effort: 2 hrs
Risk: Medium — concurrency change; must carefully manage lock/notify ordering to avoid races
Performance Impact: Medium — improves multi-conn test startup latency by up to 100ms per test.

PRD-011: Reuse UDP RX Timer Instead of Per-Call Timeout

Background: Both client and server UDP RX loops create a new tokio::time::timeout on every recv/recv_from call:

tokio::time::timeout(Duration::from_secs(5), socket.recv(&mut buf)).await

At high packet rates, this registers and cancels timers on Tokio's timer wheel constantly.

MikroTik Compatibility:

100% safe. Internal async timing only. UDP packet processing is unchanged.

Objective: Reduce timer wheel churn in high-rate UDP RX loops.

Design: Option A — tokio::select! with a pinned sleep future:

let mut timeout = tokio::time::sleep(Duration::from_secs(5));
tokio::pin!(timeout);

loop {
    tokio::select! {
        biased; // prioritize recv
        res = socket.recv(&mut buf) => { /* handle */ timeout.as_mut().reset(Instant::now() + Duration::from_secs(5)); }
        _ = &mut timeout => { tracing::debug!("UDP RX timeout"); }
    }
}

Option B — Use socket2 to set SO_RCVTIMEO on the underlying socket, then use blocking/async recv without Tokio timeouts. This moves timeout handling into the kernel, which is even cheaper.

Recommendation: Start with Option A (pure Tokio, no platform risk). Option B can be a follow-up.

Acceptance Criteria:

UDP send/receive/both tests pass.
UDP RX still times out correctly when no packets arrive.
No change to packet parsing or sequence tracking.

Effort: 2 hrs
Risk: Medium — changes timeout behavior; must ensure test abortion still works correctly
Performance Impact: Medium — reduces timer wheel registration overhead, noticeable at >50K pps.

Tier 3: High Impact (Do These With Full Focus)

PRD-012: Optimize TCP Client RX Status Message Scan

Background: tcp_client_rx_loop (client.rs:210-216) scans up to 256KB byte-by-byte on every read() call looking for a 12-byte status marker (0x07 + 0x80|cpu). Since data is all zeros, this is almost always a full scan.

MikroTik Compatibility Consideration:

High confidence of safety. The protocol is: MikroTik injects 12-byte status messages into the TCP stream. Our client must detect them. Changing how we detect them (faster scan) does not change:
- What bytes are sent on the wire
- What bytes we expect
- How we respond to status messages
One edge case to handle: TCP is a stream. A status message may be split across two read() calls. The current code does not handle this correctly (it scans each buffer independently). The optimized version should handle split messages to be strictly more correct than the current implementation.

Objective: Replace O(n) byte-by-byte scan with SIMD-accelerated or state-machine-based detection, while correctly handling split messages.

Design — Recommended: Ring Buffer Approach

Since status messages are 12 bytes and all other bytes are zeros, maintain a 12-byte ring buffer across reads:

const STATUS_MSG_SIZE: usize = 12;

async fn tcp_client_rx_loop(mut reader: OwnedReadHalf, state: Arc<BandwidthState>) {
    let mut buf = vec![0u8; 256 * 1024];
    let mut carry = [0u8; STATUS_MSG_SIZE - 1]; // up to 11 bytes from previous read
    let mut carry_len = 0usize;

    while state.running.load(Ordering::Relaxed) {
        match reader.read(&mut buf).await {
            Ok(0) | Err(_) => break,
            Ok(n) => {
                state.rx_bytes.fetch_add(n as u64, Ordering::Relaxed);

                // Check if a status message spans the carry + start of buf
                if carry_len > 0 {
                    let needed = STATUS_MSG_SIZE - carry_len;
                    if n >= needed {
                        let mut candidate = [0u8; STATUS_MSG_SIZE];
                        candidate[..carry_len].copy_from_slice(&carry[..carry_len]);
                        candidate[carry_len..].copy_from_slice(&buf[..needed]);
                        if candidate[0] == STATUS_MSG_TYPE && candidate[1] >= 0x80 {
                            state.remote_cpu.store(candidate[1] & 0x7F, Ordering::Relaxed);
                        }
                    }
                }

                // Scan within buf for status messages
                // Since data is zeros, use memchr to find 0x07 candidates
                if n >= STATUS_MSG_SIZE {
                    let search_end = n - STATUS_MSG_SIZE + 1;
                    let mut offset = 0;
                    while let Some(pos) = memchr::memchr(STATUS_MSG_TYPE, &buf[offset..search_end]) {
                        let i = offset + pos;
                        if buf[i + 1] >= 0x80 {
                            state.remote_cpu.store(buf[i + 1] & 0x7F, Ordering::Relaxed);
                            break;
                        }
                        offset = i + 1;
                        if offset >= search_end { break; }
                    }
                }

                // Save trailing bytes for next read
                carry_len = (n).min(STATUS_MSG_SIZE - 1);
                if n >= carry_len {
                    carry[..carry_len].copy_from_slice(&buf[n - carry_len..n]);
                }
            }
        }
    }
}

Alternative: memchr crate only If we determine split messages are extremely rare and the current behavior is "good enough," simply replace the for loop with:

if let Some(pos) = memchr::memchr(STATUS_MSG_TYPE, &buf[..n - STATUS_MSG_SIZE + 1]) {
    if buf[pos + 1] >= 0x80 { /* ... */ }
}

This is a 5-line change with massive speedup (SIMD scan). However, the ring buffer approach is strictly more correct and not much more complex.

Acceptance Criteria:

TCP bidirectional tests pass.
Remote CPU reporting still works.
Status messages split across reads are correctly detected (unit test for this).
memchr crate added to deps (very lightweight).
No change to wire bytes or server behavior.

Effort: 4 hrs
Risk: Medium — hot path change; must be carefully reviewed and tested
Performance Impact: High — eliminates 256KB byte scan per read. At 10K reads/sec, saves ~2.5GB of memory scanning per second.

PRD-013: SQLite Connection Pool / Channel-Based Writer

Background: server_pro uses a single Arc<Mutex<Connection>>. All quota checks, usage recordings, and auth lookups serialize through one lock. remaining_budget() issues 15 queries, locking 15+ times. This is the primary scalability bottleneck for the pro server.

MikroTik Compatibility:

100% safe. Server-side infrastructure only. No protocol change.

Objective: Enable concurrent quota checks and usage recording without mutex contention.

Design — Option A: Connection Pool (Recommended for reads) Use r2d2_sqlite or deadpool-sqlite:

Open a pool of ~4-8 connections to the same SQLite file (WAL mode supports this).
Read-only operations (remaining_budget, get_user, check_user) borrow a connection from the pool.
Write operations (record_usage, record_session) also borrow from the pool (WAL allows concurrent readers + one writer).

Design — Option B: Channel-Based Writer (Recommended for writes)

Keep one dedicated Connection owned by a single Tokio task.
Expose an mpsc::channel where other tasks send write requests (RecordUsage { user, tx, rx }).
The writer task batches or sequentially executes writes without any mutex.
Reads use a separate read-only connection or pool.

Hybrid Recommendation:

Reads: Small connection pool (4 connections) for quota checks and auth lookups.
Writes: Single dedicated async task with an mpsc::unbounded_channel for usage recording.
Cache: Add a 5-second TTL cache for remaining_budget() results per user+IP to avoid redundant DB hits during test setup.

Acceptance Criteria:

pro feature compiles and all tests pass.
Concurrent test launches scale linearly up to at least 50 concurrent sessions.
Quota enforcement remains correct (no over-quota usage).
Session logging and interval recording remain accurate.
No SQLite "database is locked" errors under load.

Effort: 1–2 days
Risk: High — touches every DB interaction in server_pro; potential for data races, quota leaks, or connection exhaustion
Performance Impact: High — enables horizontal scaling of concurrent tests; removes the primary pro server bottleneck.

Execution Roadmap

Sprint 1: Quick Wins + Foundation (1 day)

PRD-001: WCurve cache
PRD-002: Instant::now() dedup
PRD-003: hash_password hex fix
PRD-004: CSV file handle cache
PRD-005: Error string matching
PRD-006: chrono date replacement
PRD-007: Syslog optimization
PRD-008: ip.to_string() cache

Deliverable: Low-risk PR with 8 clean commits. Run full integration tests.

Sprint 2: Platform & Async Fixes (1 day)

PRD-009: FreeBSD CPU FFI
PRD-010: Multi-conn Notify wake
PRD-011: UDP timer reuse

Deliverable: PR with platform + latency improvements.

Sprint 3: Hot Path Optimization (1–2 days)

PRD-012: TCP RX scan optimization
Add unit test for split status messages
Benchmark before/after with criterion (or manual throughput test)

Deliverable: PR with benchmark numbers proving improvement.

Sprint 4: Scalability (2–3 days)

PRD-013: SQLite connection pool / channel writer
Load test: 50 concurrent tests, verify no DB lock contention
Add remaining_budget cache

Deliverable: PR with load test results.

Testing Requirements for All PRDs

Since no wire protocol changes are made, the existing integration test suite is the primary validation tool. However, for PRD-012 and PRD-013, additional tests are required:

New Tests to Add

Split Status Message Unit Test (for PRD-012)

#[test]
fn test_status_message_split_across_reads() {
    // Feed first 5 bytes, then remaining 7 bytes
    // Assert CPU value is extracted correctly
}

Concurrent Quota Load Test (for PRD-013)

#[tokio::test]
async fn test_concurrent_quota_checks() {
    // Spawn 50 tasks doing remaining_budget() + record_usage()
    // Assert no panics, no SQLite locked errors
}

FreeBSD CPU Parity Test (for PRD-009) Manual verification on FreeBSD that FFI sysctl returns same values as command.

Appendix: MikroTik Compatibility Checklist

For every PRD, verify:

No change to Command or StatusMessage struct layouts or serialization
No change to MD5 challenge-response handshake order
No change to EC-SRP5 handshake order or byte values
No change to TCP packet sizes or UDP payload format
No change to status injection timing (1-second interval)
No change to NAT probe behavior
Client can still authenticate against stock RouterOS btest server
Server can still accept connections from stock RouterOS btest client

All PRDs in this document satisfy the above checklist by construction.

23 KiB Raw Blame History Unescape Escape

Performance Improvement PRDs

How to Read This Document

Summary Matrix

Tier 1: Quick Wins (Do These First)

PRD-001: Cache WCurve in Global LazyLock

PRD-002: Deduplicate Instant::now() in tcp_tx_loop_inner

PRD-003: Fix hash_password() Hex Encoding Allocations

PRD-004: Cache CSV File Handle

PRD-005: Remove Allocating Error String Matching

PRD-006: Replace chrono_date_today() with chrono Crate

PRD-007: Optimize Syslog Mutex and Timestamp Formatting

PRD-008: Cache ip.to_string() in Quota Checks

Tier 2: Moderate Fixes (Platform & Latency)

PRD-009: FreeBSD CPU Sampling via libc::sysctl FFI

PRD-010: Replace 100ms Poll with tokio::sync::Notify

PRD-011: Reuse UDP RX Timer Instead of Per-Call Timeout

Tier 3: High Impact (Do These With Full Focus)

PRD-012: Optimize TCP Client RX Status Message Scan

PRD-013: SQLite Connection Pool / Channel-Based Writer

Execution Roadmap

Sprint 1: Quick Wins + Foundation (1 day)

Sprint 2: Platform & Async Fixes (1 day)

Sprint 3: Hot Path Optimization (1–2 days)

Sprint 4: Scalability (2–3 days)

Testing Requirements for All PRDs

New Tests to Add

Appendix: MikroTik Compatibility Checklist

23 KiB

Raw Blame History

PRD-001: Cache `WCurve` in Global `LazyLock`

PRD-002: Deduplicate `Instant::now()` in `tcp_tx_loop_inner`

PRD-003: Fix `hash_password()` Hex Encoding Allocations

PRD-006: Replace `chrono_date_today()` with `chrono` Crate

PRD-008: Cache `ip.to_string()` in Quota Checks

PRD-009: FreeBSD CPU Sampling via `libc::sysctl` FFI

PRD-010: Replace 100ms Poll with `tokio::sync::Notify`