Before Phase 5 WarzonePhone used THREE separate UDP sockets per
client:
1. Signal endpoint (register_signal, client-only)
2. Reflect probe endpoints (one fresh socket per relay probe)
3. Dual-path race endpoint (fresh per call setup)
This broke two things in production on port-preserving NATs
(MikroTik masquerade, most consumer routers):
a. Phase 2 NAT detection was WRONG. Each probe used a fresh
internal port, so MikroTik mapped each one to a different
external port, and the classifier saw "different port per
relay" and labeled it SymmetricPort. The real NAT was
cone-like but measurement via fresh sockets hid that.
b. Phase 3.5 dual-path P2P race was BROKEN. The reflex addr
we advertised in DirectCallOffer was observed by the signal
endpoint's socket. The actual dual-path race listened on a
DIFFERENT fresh socket, on a different internal (and
therefore external) port. Peers dialed the advertised addr
and hit MikroTik's mapping for the signal socket, which
forwarded to the signal endpoint — a client-only endpoint
that doesn't accept incoming connections. Direct path
silently failed, relay always won the race.
Nebula-style fix: one socket for everything. The signal endpoint
is now dual-purpose (client + server_config), and both the
reflect probes and the dual-path race reuse it instead of
creating fresh ones. MikroTik's port-preservation then gives us
a stable external port across all flows → classifier correctly
sees Cone NAT → advertised reflex addr is the actual listening
port → direct dials from peers land on the right socket →
`endpoint.accept()` in the A-role branch of the dual-path race
picks up the incoming connection.
## Changes
### `register_signal` (desktop/src-tauri/src/lib.rs)
- Endpoint now created with `Some(server_config())` instead of
`None`. The socket can now accept incoming QUIC connections as
well as dial outbound.
- Every code path that previously read `sig.endpoint` for the
relay-dial reuse benefits automatically — same socket is now
ALSO listening for peer dials.
### `probe_reflect_addr` (wzp-client/src/reflect.rs)
- New `existing_endpoint: Option<Endpoint>` arg. `Some` reuses
the caller's socket (production: pass the signal endpoint).
`None` creates a fresh one (tests + pre-registration).
- Removed the `drop(endpoint)` at the end — was correct for
fresh endpoints (explicit early socket close) but incorrect
for shared ones. End-of-scope drop does the right thing in
both cases via Arc semantics.
### `detect_nat_type` (wzp-client/src/reflect.rs)
- New `shared_endpoint: Option<Endpoint>` arg, forwarded to
every probe in the JoinSet fan-out. One shared socket means
the classifier sees the true NAT type.
### `detect_nat_type` Tauri command (desktop/src-tauri/src/lib.rs)
- Reads `state.signal.endpoint` and passes it as the shared
endpoint. Falls back to None when not registered. NAT detection
now produces accurate classifications against MikroTik / most
consumer NATs.
### `dual_path::race` (wzp-client/src/dual_path.rs)
- New `shared_endpoint: Option<Endpoint>` arg.
- A-role: when `Some`, reuses it for `accept()`. This is the
critical change — the reflex addr advertised to peers is now
the address listening for incoming direct dials.
- D-role: when `Some`, reuses it for the outbound direct dial.
MikroTik keeps the same external port for the dial as for
the signal flow → direct dial through a cone-mapped NAT.
- Relay path: also reuses the shared endpoint so MikroTik has
a single consistent mapping across the whole call (saves one
extra external port and makes firewall traces cleaner).
- When `None`, falls back to fresh per-role endpoints as before.
### `connect` Tauri command (desktop/src-tauri/src/lib.rs)
- Reads `state.signal.endpoint` once when acquiring own reflex
addr and passes it through to `dual_path::race`.
### Tests
- `wzp-client/tests/dual_path.rs` and
`wzp-relay/tests/multi_reflect.rs` updated to pass `None` for
the new endpoint arg — tests use fresh sockets and that's
fine because the loopback harness doesn't care about
port-preserving NAT behavior.
Full workspace test: 423 passing (no regressions).
## Expected behavior after this commit on real hardware
Behind MikroTik + Starlink-bypass (the reporter's setup):
- Phase 2 NAT detect → **Cone NAT** (was SymmetricPort — false
positive from the measurement artifact)
- Phase 3.5 direct-P2P dial → succeeds for both cone-cone and
cone-CGNAT cases where the remote side was previously blocked
by our own socket mismatch
- LTE ↔ LTE cross-carrier → still likely relay fallback; that's
genuinely strict symmetric and needs Phase 5.5 port prediction.
## Phase 5.5 (next, separate PRD)
Multi-candidate port prediction + ICE-style candidate aggregation
for truly strict symmetric NATs. Not needed for the 95% case —
Phase 5 alone fixes most consumer-router setups.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
203 lines
8.2 KiB
Rust
203 lines
8.2 KiB
Rust
//! Phase 3.5 integration tests for the dual-path QUIC race.
|
|
//!
|
|
//! The race takes a role (Acceptor or Dialer), a peer_direct_addr,
|
|
//! a relay_addr, and two SNI strings, then returns whichever QUIC
|
|
//! handshake completes first wrapped in a `QuinnTransport`. These
|
|
//! tests validate that:
|
|
//!
|
|
//! 1. On loopback with two real clients playing A + D roles, the
|
|
//! direct path wins (fewer hops than relay).
|
|
//! 2. When the direct peer is dead (nothing listening) but the
|
|
//! relay is up, the relay wins within the fallback window.
|
|
//! 3. When both paths are dead, the race errors cleanly rather
|
|
//! than hanging forever.
|
|
//!
|
|
//! The "relay" in these tests is a minimal mock that just accepts
|
|
//! an incoming QUIC connection and drops it — we don't need any
|
|
//! protocol handling, just a TCP-ish listen-and-accept.
|
|
|
|
use std::net::{Ipv4Addr, SocketAddr};
|
|
use std::time::Duration;
|
|
|
|
use wzp_client::dual_path::{race, WinningPath};
|
|
use wzp_client::reflect::Role;
|
|
use wzp_transport::{create_endpoint, server_config};
|
|
|
|
/// Spin up a "relay-ish" mock server on loopback that accepts
|
|
/// incoming QUIC connections and does nothing with them. Used to
|
|
/// give the relay branch of the race a real target to dial.
|
|
/// Returns the bound address + a join handle (kept alive to keep
|
|
/// the endpoint up).
|
|
async fn spawn_mock_relay() -> (SocketAddr, tokio::task::JoinHandle<()>) {
|
|
let _ = rustls::crypto::ring::default_provider().install_default();
|
|
let (sc, _cert_der) = server_config();
|
|
let bind: SocketAddr = (Ipv4Addr::LOCALHOST, 0).into();
|
|
let ep = create_endpoint(bind, Some(sc)).expect("relay endpoint");
|
|
let addr = ep.local_addr().expect("local_addr");
|
|
|
|
let handle = tokio::spawn(async move {
|
|
// Accept loop — hold the connection alive for a short
|
|
// while so the race result isn't killed by the peer
|
|
// closing before the winning transport is returned.
|
|
while let Some(incoming) = ep.accept().await {
|
|
if let Ok(_conn) = incoming.await {
|
|
tokio::time::sleep(Duration::from_secs(5)).await;
|
|
}
|
|
}
|
|
});
|
|
(addr, handle)
|
|
}
|
|
|
|
// -----------------------------------------------------------------------
|
|
// Test 1: direct path wins when both sides are up
|
|
// -----------------------------------------------------------------------
|
|
//
|
|
// Spawn a mock relay, then set up a two-client test where one
|
|
// client plays the Acceptor role and the other plays the Dialer
|
|
// role. The Dialer's `peer_direct_addr` is the Acceptor's listen
|
|
// address. Because the direct path is a single loopback hop and
|
|
// the relay dial also terminates on loopback, both complete
|
|
// essentially instantly — the `biased` tokio::select in race()
|
|
// should pick direct.
|
|
|
|
#[tokio::test(flavor = "multi_thread", worker_threads = 4)]
|
|
async fn dual_path_direct_wins_on_loopback() {
|
|
let _ = rustls::crypto::ring::default_provider().install_default();
|
|
let (relay_addr, _relay_handle) = spawn_mock_relay().await;
|
|
|
|
// Acceptor task: run race(Role::Acceptor, peer_addr_placeholder, ...).
|
|
// Since the acceptor doesn't dial, the peer_direct_addr arg is
|
|
// unused on the direct branch but we still pass a placeholder
|
|
// because the API takes one. Use a stub addr that would error
|
|
// if it were ever dialed — proving the Acceptor really doesn't
|
|
// reach it.
|
|
let unused_addr: SocketAddr = "127.0.0.1:2".parse().unwrap();
|
|
|
|
// We can't race both sides in the same task because each race
|
|
// call has its own direct endpoint that needs to talk to the
|
|
// OTHER side's endpoint. So spawn the Acceptor in a task and
|
|
// let it expose its listen addr via a oneshot back to the test,
|
|
// then run the Dialer in the test's main task.
|
|
//
|
|
// There's a chicken-and-egg issue: the Acceptor's listen addr
|
|
// is only known after race() creates its endpoint. To avoid
|
|
// reaching into race()'s internals, we instead play a slight
|
|
// trick: create the Acceptor's endpoint ourselves (outside
|
|
// race()) to learn its addr, spin up an accept loop on it
|
|
// ourselves, and pass THAT addr as the Dialer's peer addr.
|
|
// This tests the Dialer->Acceptor handshake end-to-end without
|
|
// running the full race() on both sides.
|
|
|
|
let (sc, _cert_der) = server_config();
|
|
let acceptor_bind: SocketAddr = (Ipv4Addr::LOCALHOST, 0).into();
|
|
let acceptor_ep = create_endpoint(acceptor_bind, Some(sc)).expect("acceptor ep");
|
|
let acceptor_listen_addr = acceptor_ep.local_addr().expect("acceptor addr");
|
|
|
|
// Drop the external acceptor after the test finishes, not
|
|
// before — spawn a dedicated accept task.
|
|
let acceptor_accept_task = tokio::spawn(async move {
|
|
// Accept one connection and hold it for a while so the
|
|
// Dialer side can complete its QUIC handshake.
|
|
if let Some(incoming) = acceptor_ep.accept().await {
|
|
if let Ok(_conn) = incoming.await {
|
|
tokio::time::sleep(Duration::from_secs(5)).await;
|
|
}
|
|
}
|
|
});
|
|
|
|
// Now run the Dialer in the race — peer_direct_addr = acceptor's
|
|
// listen addr. The relay is the mock from above. Direct path
|
|
// should win.
|
|
let result = race(
|
|
Role::Dialer,
|
|
acceptor_listen_addr,
|
|
relay_addr,
|
|
"test-room".into(),
|
|
"call-test".into(),
|
|
None, // Phase 5: tests use fresh endpoints (no shared signal)
|
|
)
|
|
.await
|
|
.expect("race must succeed");
|
|
|
|
assert_eq!(result.1, WinningPath::Direct, "direct should win on loopback");
|
|
|
|
// Cancel the acceptor accept task so the test finishes.
|
|
acceptor_accept_task.abort();
|
|
// Suppress unused-var warning for the placeholder.
|
|
let _ = unused_addr;
|
|
}
|
|
|
|
// -----------------------------------------------------------------------
|
|
// Test 2: relay wins when the direct peer is dead
|
|
// -----------------------------------------------------------------------
|
|
//
|
|
// Dialer role, peer_direct_addr = a port nothing is listening on,
|
|
// relay is the working mock. Direct dial will sit waiting for a
|
|
// QUIC handshake that never comes; the 2s direct timeout kicks in
|
|
// and the relay path wins the fallback.
|
|
|
|
#[tokio::test(flavor = "multi_thread", worker_threads = 4)]
|
|
async fn dual_path_relay_wins_when_direct_is_dead() {
|
|
let _ = rustls::crypto::ring::default_provider().install_default();
|
|
let (relay_addr, _relay_handle) = spawn_mock_relay().await;
|
|
|
|
// A port that nothing is listening on — dead direct target.
|
|
// Port 1 on loopback is almost never bound and UDP packets to
|
|
// it will be dropped silently, so the QUIC handshake times out.
|
|
let dead_peer: SocketAddr = "127.0.0.1:1".parse().unwrap();
|
|
|
|
let result = race(
|
|
Role::Dialer,
|
|
dead_peer,
|
|
relay_addr,
|
|
"test-room".into(),
|
|
"call-test".into(),
|
|
None, // Phase 5: tests use fresh endpoints (no shared signal)
|
|
)
|
|
.await
|
|
.expect("race must succeed via relay fallback");
|
|
|
|
assert_eq!(
|
|
result.1,
|
|
WinningPath::Relay,
|
|
"relay should win when direct dial has nowhere to land"
|
|
);
|
|
}
|
|
|
|
// -----------------------------------------------------------------------
|
|
// Test 3: race errors cleanly when both paths are dead
|
|
// -----------------------------------------------------------------------
|
|
//
|
|
// Dialer role, peer_direct_addr = dead, relay_addr = dead.
|
|
// Expected: race returns an Err within ~7s (2s direct timeout +
|
|
// 5s relay timeout fallback).
|
|
|
|
#[tokio::test(flavor = "multi_thread", worker_threads = 4)]
|
|
async fn dual_path_errors_cleanly_when_both_paths_dead() {
|
|
let _ = rustls::crypto::ring::default_provider().install_default();
|
|
|
|
let dead_peer: SocketAddr = "127.0.0.1:1".parse().unwrap();
|
|
let dead_relay: SocketAddr = "127.0.0.1:2".parse().unwrap();
|
|
|
|
let start = std::time::Instant::now();
|
|
let result = race(
|
|
Role::Dialer,
|
|
dead_peer,
|
|
dead_relay,
|
|
"test-room".into(),
|
|
"call-test".into(),
|
|
None, // Phase 5: tests use fresh endpoints (no shared signal)
|
|
)
|
|
.await;
|
|
let elapsed = start.elapsed();
|
|
|
|
assert!(result.is_err(), "both-dead must return Err");
|
|
// Upper bound: direct 2s timeout + relay 5s fallback + small
|
|
// slack for scheduling. If this blows, something is looping.
|
|
assert!(
|
|
elapsed < Duration::from_secs(10),
|
|
"race took too long to give up: {:?}",
|
|
elapsed
|
|
);
|
|
}
|