feat(p2p): Phase 5 — single-socket architecture (Nebula-style)

Before Phase 5 WarzonePhone used THREE separate UDP sockets per client: 1. Signal endpoint (register_signal, client-only) 2. Reflect probe endpoints (one fresh socket per relay probe) 3. Dual-path race endpoint (fresh per call setup) This broke two things in production on port-preserving NATs (MikroTik masquerade, most consumer routers): a. Phase 2 NAT detection was WRONG. Each probe used a fresh internal port, so MikroTik mapped each one to a different external port, and the classifier saw "different port per relay" and labeled it SymmetricPort. The real NAT was cone-like but measurement via fresh sockets hid that. b. Phase 3.5 dual-path P2P race was BROKEN. The reflex addr we advertised in DirectCallOffer was observed by the signal endpoint's socket. The actual dual-path race listened on a DIFFERENT fresh socket, on a different internal (and therefore external) port. Peers dialed the advertised addr and hit MikroTik's mapping for the signal socket, which forwarded to the signal endpoint — a client-only endpoint that doesn't accept incoming connections. Direct path silently failed, relay always won the race. Nebula-style fix: one socket for everything. The signal endpoint is now dual-purpose (client + server_config), and both the reflect probes and the dual-path race reuse it instead of creating fresh ones. MikroTik's port-preservation then gives us a stable external port across all flows → classifier correctly sees Cone NAT → advertised reflex addr is the actual listening port → direct dials from peers land on the right socket → `endpoint.accept()` in the A-role branch of the dual-path race picks up the incoming connection. ## Changes ### `register_signal` (desktop/src-tauri/src/lib.rs) - Endpoint now created with `Some(server_config())` instead of `None`. The socket can now accept incoming QUIC connections as well as dial outbound. - Every code path that previously read `sig.endpoint` for the relay-dial reuse benefits automatically — same socket is now ALSO listening for peer dials. ### `probe_reflect_addr` (wzp-client/src/reflect.rs) - New `existing_endpoint: Option<Endpoint>` arg. `Some` reuses the caller's socket (production: pass the signal endpoint). `None` creates a fresh one (tests + pre-registration). - Removed the `drop(endpoint)` at the end — was correct for fresh endpoints (explicit early socket close) but incorrect for shared ones. End-of-scope drop does the right thing in both cases via Arc semantics. ### `detect_nat_type` (wzp-client/src/reflect.rs) - New `shared_endpoint: Option<Endpoint>` arg, forwarded to every probe in the JoinSet fan-out. One shared socket means the classifier sees the true NAT type. ### `detect_nat_type` Tauri command (desktop/src-tauri/src/lib.rs) - Reads `state.signal.endpoint` and passes it as the shared endpoint. Falls back to None when not registered. NAT detection now produces accurate classifications against MikroTik / most consumer NATs. ### `dual_path::race` (wzp-client/src/dual_path.rs) - New `shared_endpoint: Option<Endpoint>` arg. - A-role: when `Some`, reuses it for `accept()`. This is the critical change — the reflex addr advertised to peers is now the address listening for incoming direct dials. - D-role: when `Some`, reuses it for the outbound direct dial. MikroTik keeps the same external port for the dial as for the signal flow → direct dial through a cone-mapped NAT. - Relay path: also reuses the shared endpoint so MikroTik has a single consistent mapping across the whole call (saves one extra external port and makes firewall traces cleaner). - When `None`, falls back to fresh per-role endpoints as before. ### `connect` Tauri command (desktop/src-tauri/src/lib.rs) - Reads `state.signal.endpoint` once when acquiring own reflex addr and passes it through to `dual_path::race`. ### Tests - `wzp-client/tests/dual_path.rs` and `wzp-relay/tests/multi_reflect.rs` updated to pass `None` for the new endpoint arg — tests use fresh sockets and that's fine because the loopback harness doesn't care about port-preserving NAT behavior. Full workspace test: 423 passing (no regressions). ## Expected behavior after this commit on real hardware Behind MikroTik + Starlink-bypass (the reporter's setup): - Phase 2 NAT detect → **Cone NAT** (was SymmetricPort — false positive from the measurement artifact) - Phase 3.5 direct-P2P dial → succeeds for both cone-cone and cone-CGNAT cases where the remote side was previously blocked by our own socket mismatch - LTE ↔ LTE cross-carrier → still likely relay fallback; that's genuinely strict symmetric and needs Phase 5.5 port prediction. ## Phase 5.5 (next, separate PRD) Multi-candidate port prediction + ICE-style candidate aggregation for truly strict symmetric NATs. Not needed for the 95% case — Phase 5 alone fixes most consumer-router setups. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-11 19:47:20 +04:00
parent 05ec926317
commit 1618ff6c9d
5 changed files with 183 additions and 38 deletions
--- a/crates/wzp-client/src/reflect.rs
+++ b/crates/wzp-client/src/reflect.rs
@@ -67,22 +67,45 @@ pub enum NatType {
    Unknown,
 }

-/// Probe a single relay with a throwaway QUIC connection.
+/// Probe a single relay with a QUIC connection.
 ///
-/// Each call creates a fresh `quinn::Endpoint` so the OS hands out a
-/// fresh ephemeral source port — essential for NAT-type detection
-/// because a shared socket would produce the same mapping against
-/// every relay and mask symmetric NAT.
+/// # Endpoint reuse (Phase 5 — Nebula-style architecture)
+///
+/// If `existing_endpoint` is `Some`, the probe uses that socket
+/// instead of creating a fresh one. This is the desired mode in
+/// production: a port-preserving NAT (MikroTik masquerade, most
+/// consumer routers) gives a **stable** external port for the
+/// one socket, so the reflex addr observed by ANY relay is the
+/// SAME addr and matches what a peer would see on a direct dial.
+/// Pass the signal endpoint here.
+///
+/// If `None`, creates a fresh one-shot endpoint. Kept for:
+/// - tests that spin up isolated probes
+/// - the "I'm not registered yet" case where there's no signal
+///   endpoint to reuse
+///
+/// NOTE on NAT-type detection: the pre-Phase-5 behavior of
+/// forcing a fresh endpoint per probe was wrong — it made every
+/// port-preserving NAT look symmetric because the classifier saw
+/// a different external port for each fresh source port. With
+/// one shared socket, the classifier reflects the REAL NAT
+/// behavior.
 pub async fn probe_reflect_addr(
    relay: SocketAddr,
    timeout_ms: u64,
+    existing_endpoint: Option<wzp_transport::Endpoint>,
 ) -> Result<(SocketAddr, u32), String> {
    // Install rustls provider idempotently — a second install on the
    // same thread is a no-op.
    let _ = rustls::crypto::ring::default_provider().install_default();

-    let bind: SocketAddr = "0.0.0.0:0".parse().unwrap();
-    let endpoint = create_endpoint(bind, None).map_err(|e| format!("endpoint: {e}"))?;
+    let endpoint = match existing_endpoint {
+        Some(ep) => ep,
+        None => {
+            let bind: SocketAddr = "0.0.0.0:0".parse().unwrap();
+            create_endpoint(bind, None).map_err(|e| format!("endpoint: {e}"))?
+        }
+    };

    let start = Instant::now();
    let probe = async {
@@ -153,9 +176,10 @@ pub async fn probe_reflect_addr(
        .await
        .map_err(|_| format!("probe timeout ({timeout_ms}ms)"))??;

-    // Drop the endpoint explicitly AFTER the probe finishes so the
-    // UDP socket is released before we return.
-    drop(endpoint);
+    // `endpoint` is a quinn::Endpoint clone — an Arc under the
+    // hood. Letting it drop at end-of-scope is correct whether it
+    // was fresh (last ref → socket closes) or shared (ref count
+    // decrements, socket stays alive for the signal loop).
    Ok(out)
 }

@@ -163,17 +187,32 @@ pub async fn probe_reflect_addr(
 /// classifying the returned addresses. Never errors — failing
 /// probes surface via `NatProbeResult.error`; aggregate is always
 /// returned.
+///
+/// # Endpoint reuse (Phase 5)
+///
+/// If `shared_endpoint` is `Some`, every probe reuses it. This is
+/// the PRODUCTION behavior: all probes source from the same UDP
+/// port, so port-preserving NATs map them to the same external
+/// port, and the classifier reflects the real NAT type. Pass the
+/// signal endpoint.
+///
+/// If `None`, each probe creates its own fresh endpoint — useful
+/// in tests that don't have a signal endpoint, but produces
+/// spurious `SymmetricPort` classifications against NATs that
+/// would otherwise look cone-like.
 pub async fn detect_nat_type(
    relays: Vec<(String, SocketAddr)>,
    timeout_ms: u64,
+    shared_endpoint: Option<wzp_transport::Endpoint>,
 ) -> NatDetection {
    // Parallel probes via tokio::task::JoinSet so the wall-clock is
    // bounded by the slowest probe, not the sum. JoinSet keeps the
    // dep surface at just tokio — we already depend on it.
    let mut set = tokio::task::JoinSet::new();
    for (name, addr) in relays {
+        let ep = shared_endpoint.clone();
        set.spawn(async move {
-            let result = probe_reflect_addr(addr, timeout_ms).await;
+            let result = probe_reflect_addr(addr, timeout_ms, ep).await;
            (name, addr, result)
        });
    }