feat(p2p): Phase 5 — single-socket architecture (Nebula-style)

Before Phase 5 WarzonePhone used THREE separate UDP sockets per
client:

  1. Signal endpoint         (register_signal, client-only)
  2. Reflect probe endpoints (one fresh socket per relay probe)
  3. Dual-path race endpoint (fresh per call setup)

This broke two things in production on port-preserving NATs
(MikroTik masquerade, most consumer routers):

  a. Phase 2 NAT detection was WRONG. Each probe used a fresh
     internal port, so MikroTik mapped each one to a different
     external port, and the classifier saw "different port per
     relay" and labeled it SymmetricPort. The real NAT was
     cone-like but measurement via fresh sockets hid that.

  b. Phase 3.5 dual-path P2P race was BROKEN. The reflex addr
     we advertised in DirectCallOffer was observed by the signal
     endpoint's socket. The actual dual-path race listened on a
     DIFFERENT fresh socket, on a different internal (and
     therefore external) port. Peers dialed the advertised addr
     and hit MikroTik's mapping for the signal socket, which
     forwarded to the signal endpoint — a client-only endpoint
     that doesn't accept incoming connections. Direct path
     silently failed, relay always won the race.

Nebula-style fix: one socket for everything. The signal endpoint
is now dual-purpose (client + server_config), and both the
reflect probes and the dual-path race reuse it instead of
creating fresh ones. MikroTik's port-preservation then gives us
a stable external port across all flows → classifier correctly
sees Cone NAT → advertised reflex addr is the actual listening
port → direct dials from peers land on the right socket →
`endpoint.accept()` in the A-role branch of the dual-path race
picks up the incoming connection.

## Changes

### `register_signal` (desktop/src-tauri/src/lib.rs)
- Endpoint now created with `Some(server_config())` instead of
  `None`. The socket can now accept incoming QUIC connections as
  well as dial outbound.
- Every code path that previously read `sig.endpoint` for the
  relay-dial reuse benefits automatically — same socket is now
  ALSO listening for peer dials.

### `probe_reflect_addr` (wzp-client/src/reflect.rs)
- New `existing_endpoint: Option<Endpoint>` arg. `Some` reuses
  the caller's socket (production: pass the signal endpoint).
  `None` creates a fresh one (tests + pre-registration).
- Removed the `drop(endpoint)` at the end — was correct for
  fresh endpoints (explicit early socket close) but incorrect
  for shared ones. End-of-scope drop does the right thing in
  both cases via Arc semantics.

### `detect_nat_type` (wzp-client/src/reflect.rs)
- New `shared_endpoint: Option<Endpoint>` arg, forwarded to
  every probe in the JoinSet fan-out. One shared socket means
  the classifier sees the true NAT type.

### `detect_nat_type` Tauri command (desktop/src-tauri/src/lib.rs)
- Reads `state.signal.endpoint` and passes it as the shared
  endpoint. Falls back to None when not registered. NAT detection
  now produces accurate classifications against MikroTik / most
  consumer NATs.

### `dual_path::race` (wzp-client/src/dual_path.rs)
- New `shared_endpoint: Option<Endpoint>` arg.
- A-role: when `Some`, reuses it for `accept()`. This is the
  critical change — the reflex addr advertised to peers is now
  the address listening for incoming direct dials.
- D-role: when `Some`, reuses it for the outbound direct dial.
  MikroTik keeps the same external port for the dial as for
  the signal flow → direct dial through a cone-mapped NAT.
- Relay path: also reuses the shared endpoint so MikroTik has
  a single consistent mapping across the whole call (saves one
  extra external port and makes firewall traces cleaner).
- When `None`, falls back to fresh per-role endpoints as before.

### `connect` Tauri command (desktop/src-tauri/src/lib.rs)
- Reads `state.signal.endpoint` once when acquiring own reflex
  addr and passes it through to `dual_path::race`.

### Tests
- `wzp-client/tests/dual_path.rs` and
  `wzp-relay/tests/multi_reflect.rs` updated to pass `None` for
  the new endpoint arg — tests use fresh sockets and that's
  fine because the loopback harness doesn't care about
  port-preserving NAT behavior.

Full workspace test: 423 passing (no regressions).

## Expected behavior after this commit on real hardware

Behind MikroTik + Starlink-bypass (the reporter's setup):
- Phase 2 NAT detect → **Cone NAT** (was SymmetricPort — false
  positive from the measurement artifact)
- Phase 3.5 direct-P2P dial → succeeds for both cone-cone and
  cone-CGNAT cases where the remote side was previously blocked
  by our own socket mismatch
- LTE ↔ LTE cross-carrier → still likely relay fallback; that's
  genuinely strict symmetric and needs Phase 5.5 port prediction.

## Phase 5.5 (next, separate PRD)

Multi-candidate port prediction + ICE-style candidate aggregation
for truly strict symmetric NATs. Not needed for the 95% case —
Phase 5 alone fixes most consumer-router setups.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
Siavash Sameni
2026-04-11 19:47:20 +04:00
parent 05ec926317
commit 1618ff6c9d
5 changed files with 183 additions and 38 deletions

View File

@@ -59,6 +59,22 @@ pub async fn race(
relay_addr: SocketAddr,
room_sni: String,
call_sni: String,
// Phase 5: when `Some`, reuse this endpoint for BOTH the
// direct-path branch AND the relay dial. This is critical
// for hole-punching through port-preserving NATs — the
// advertised reflex addr only matches what peers can dial if
// the listening socket is the SAME one that registered with
// the relay. Pass the signal endpoint here.
//
// The endpoint MUST have been created with a server config
// (`create_endpoint(bind, Some(server_config()))`) if the
// A-role branch is going to run, otherwise `accept()` will
// return None immediately.
//
// When `None`, falls back to the pre-Phase-5 behavior of
// creating fresh endpoints per role. Used by tests and by
// paths where we're not registered to a relay.
shared_endpoint: Option<wzp_transport::Endpoint>,
) -> anyhow::Result<(Arc<QuinnTransport>, WinningPath)> {
// Rustls provider must be installed before any quinn endpoint
// is created. Install attempt is idempotent.
@@ -75,18 +91,37 @@ pub async fn race(
match role {
Role::Acceptor => {
let (sc, _cert_der) = wzp_transport::server_config();
let bind: SocketAddr = "0.0.0.0:0".parse().unwrap();
let ep = wzp_transport::create_endpoint(bind, Some(sc))?;
tracing::info!(
local_addr = ?ep.local_addr().ok(),
"dual_path: A-role endpoint up, awaiting peer dial"
);
let ep = match shared_endpoint.clone() {
Some(ep) => {
tracing::info!(
local_addr = ?ep.local_addr().ok(),
"dual_path: A-role reusing shared endpoint for accept"
);
ep
}
None => {
let (sc, _cert_der) = wzp_transport::server_config();
let bind: SocketAddr = "0.0.0.0:0".parse().unwrap();
let fresh = wzp_transport::create_endpoint(bind, Some(sc))?;
tracing::info!(
local_addr = ?fresh.local_addr().ok(),
"dual_path: A-role fresh endpoint up, awaiting peer dial"
);
fresh
}
};
let ep_for_fut = ep.clone();
direct_fut = Box::pin(async move {
// `wzp_transport::accept` wraps the same
// `endpoint.accept().await?.await?` dance we want
// and maps errors into TransportError for us.
//
// If `ep_for_fut` is the shared signal endpoint,
// this accept pulls the NEXT incoming connection
// — normally that's the peer's direct-P2P dial.
// Signal recv is done via the existing signal
// CONNECTION (accept_bi), not the endpoint, so
// there's no conflict.
let conn = wzp_transport::accept(&ep_for_fut)
.await
.map_err(|e| anyhow::anyhow!("direct accept: {e}"))?;
@@ -95,13 +130,26 @@ pub async fn race(
direct_ep = ep;
}
Role::Dialer => {
let bind: SocketAddr = "0.0.0.0:0".parse().unwrap();
let ep = wzp_transport::create_endpoint(bind, None)?;
tracing::info!(
local_addr = ?ep.local_addr().ok(),
%peer_direct_addr,
"dual_path: D-role endpoint up, dialing peer"
);
let ep = match shared_endpoint.clone() {
Some(ep) => {
tracing::info!(
local_addr = ?ep.local_addr().ok(),
%peer_direct_addr,
"dual_path: D-role reusing shared endpoint to dial peer"
);
ep
}
None => {
let bind: SocketAddr = "0.0.0.0:0".parse().unwrap();
let fresh = wzp_transport::create_endpoint(bind, None)?;
tracing::info!(
local_addr = ?fresh.local_addr().ok(),
%peer_direct_addr,
"dual_path: D-role fresh endpoint up, dialing peer"
);
fresh
}
};
let ep_for_fut = ep.clone();
let client_cfg = wzp_transport::client_config();
let sni = call_sni.clone();
@@ -116,9 +164,17 @@ pub async fn race(
}
}
// Relay path: classic dial to the relay's media room.
let relay_bind: SocketAddr = "0.0.0.0:0".parse().unwrap();
let relay_ep = wzp_transport::create_endpoint(relay_bind, None)?;
// Relay path: classic dial to the relay's media room. Phase 5:
// reuse the shared endpoint here too so MikroTik-style NATs
// keep a stable external port across all flows from this
// client. Falls back to a fresh endpoint when not shared.
let relay_ep = match shared_endpoint.clone() {
Some(ep) => ep,
None => {
let relay_bind: SocketAddr = "0.0.0.0:0".parse().unwrap();
wzp_transport::create_endpoint(relay_bind, None)?
}
};
let relay_ep_for_fut = relay_ep.clone();
let relay_client_cfg = wzp_transport::client_config();
let relay_sni = room_sni.clone();
@@ -185,11 +241,16 @@ pub async fn race(
}
};
// Drop both endpoints once the winner is stored in result. The
// winning transport owns its own connection so dropping the
// endpoint won't kill it.
drop(direct_ep);
drop(relay_ep);
// Let both endpoint clones drop at end-of-scope. With the
// Phase 5 shared-endpoint path, these clones are Arc<Endpoint>
// clones of the signal endpoint — dropping them just decrements
// the ref count, the socket stays alive for the signal loop +
// any further direct-P2P attempts. With the fresh-endpoint
// fallback, the drops are the last refs so the sockets close
// promptly. Either way the winning transport already owns its
// own quinn::Connection reference which is independent of the
// Endpoint lifetime.
let _ = (direct_ep, relay_ep);
result
}