Same-LAN P2P was failing because MikroTik masquerade (like most
consumer NATs) doesn't support NAT hairpinning — the advertised
WAN reflex addr is unreachable from a peer on the same LAN as
the advertiser. Phase 5 got us Cone NAT classification and fixed
the measurement artifact, but same-LAN direct dials still had
nowhere to land.
Phase 5.5 adds ICE-style host candidates: each client enumerates
its LAN-local network interface addresses, includes them in the
DirectCallOffer/Answer alongside the reflex addr, and the
dual-path race fans out to ALL peer candidates in parallel.
Same-LAN peers find each other via their RFC1918 IPv4 + ULA /
global-unicast IPv6 addresses without touching the NAT at all.
Dual-stack IPv6 is in scope from the start — on modern ISPs
(including Starlink) the v6 path often works even when v4
hairpinning doesn't, because there's no NAT on the v6 side.
## Changes
### `wzp_client::reflect::local_host_candidates(port)` (new)
Enumerates network interfaces via `if-addrs` and returns
SocketAddrs paired with the caller's port. Filters:
- IPv4: RFC1918 (10/8, 172.16/12, 192.168/16) + CGNAT (100.64/10)
- IPv6: global unicast (2000::/3) + ULA (fc00::/7)
- Skipped: loopback, link-local (169.254, fe80::), public v4
(already covered by reflex-addr), unspecified
Safe from any thread, one `getifaddrs(3)` syscall.
### Wire protocol (wzp-proto/packet.rs)
Three new `#[serde(default, skip_serializing_if = "Vec::is_empty")]`
fields, backward-compat with pre-5.5 clients/relays by
construction:
- `DirectCallOffer.caller_local_addrs: Vec<String>`
- `DirectCallAnswer.callee_local_addrs: Vec<String>`
- `CallSetup.peer_local_addrs: Vec<String>`
### Call registry (wzp-relay/call_registry.rs)
`DirectCall` gains `caller_local_addrs` + `callee_local_addrs`
Vec<String> fields. New `set_caller_local_addrs` /
`set_callee_local_addrs` setters. Follow the same pattern as
the reflex addr fields.
### Relay cross-wiring (wzp-relay/main.rs)
Both the local-call and cross-relay-federation paths now track
the local_addrs through the registry and inject them into the
CallSetup's peer_local_addrs. Cross-wiring is identical to the
existing peer_direct_addr logic — each party's CallSetup
carries the OTHER party's LAN candidates.
### Client side (desktop/src-tauri/lib.rs)
- `place_call`: gathers local host candidates via
`local_host_candidates(signal_endpoint.local_addr().port())`
and includes them in `DirectCallOffer.caller_local_addrs`.
The port match is critical — it's the Phase 5 shared signal
socket, so incoming dials to these addrs land on the same
endpoint that's already listening.
- `answer_call`: same, AcceptTrusted only (privacy mode keeps
LAN addrs hidden too, for consistency with the reflex addr).
- `connect` Tauri command: new `peer_local_addrs: Vec<String>`
arg. Builds a `PeerCandidates` bundle and passes it to the
dual-path race.
- Recv loop's CallSetup handler: destructures + forwards the
new field to JS via the signal-event payload.
### `dual_path::race` (wzp-client/dual_path.rs)
Signature change: takes `PeerCandidates` (reflex + local Vec)
instead of a single SocketAddr. The D-role branch now fans out
N parallel dials via `tokio::task::JoinSet` — one per candidate
— and the first successful dial wins (losers are aborted
immediately via `set.abort_all()`). Only when ALL candidates
have failed do we return Err; individual candidate failures are
just traced at debug level and the race waits for the others.
LAN host candidates are tried BEFORE the reflex addr in
`PeerCandidates::dial_order()` — they're faster when they work,
and the reflex addr is the fallback for the not-on-same-LAN
case.
### JS side (desktop/main.ts)
`connect` invoke now passes `peerLocalAddrs: data.peer_local_addrs ?? []`
alongside the existing `peerDirectAddr`.
### Tests
All existing test callsites updated for the new Vec<String>
fields (defaults to Vec::new() in tests — they don't exercise
the multi-candidate path). `dual_path.rs` integration tests
wrap the single `dead_peer` / `acceptor_listen_addr` in a
`PeerCandidates { reflexive: Some(_), local: Vec::new() }`.
Full workspace test: 423 passing (same as before 5.5).
## Expected behavior on the reporter's setup
Two phones behind MikroTik, both on the same LAN:
place_call:host_candidates {"local_addrs": ["192.168.88.21:XXX", "2001:...:YY:XXX"]}
recv:DirectCallAnswer {"callee_local_addrs": ["192.168.88.22:ZZZ", "2001:...:WW:ZZZ"]}
recv:CallSetup {"peer_direct_addr":"150.228.49.65:NN",
"peer_local_addrs":["192.168.88.22:ZZZ","2001:...:WW:ZZZ"]}
connect:dual_path_race_start {"peer_reflex":"...","peer_local":[...]}
dual_path: direct dial succeeded on candidate 0 ← LAN v4 wins
connect:dual_path_race_won {"path":"Direct"}
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
375 lines
16 KiB
Rust
375 lines
16 KiB
Rust
//! Phase 3.5 — dual-path QUIC connect race for P2P hole-punching.
|
|
//!
|
|
//! When both peers advertised reflex addrs in the
|
|
//! DirectCallOffer/Answer flow, the relay cross-wires them into
|
|
//! `CallSetup.peer_direct_addr`. This module races a direct QUIC
|
|
//! handshake against the existing relay dial and returns whichever
|
|
//! completes first — with automatic drop of the loser via
|
|
//! `tokio::select!`.
|
|
//!
|
|
//! Role determination is deterministic and symmetric
|
|
//! (`wzp_client::reflect::determine_role`): whichever peer has the
|
|
//! lexicographically smaller reflex addr becomes the **Acceptor**
|
|
//! (listens on a server-capable endpoint), the other becomes the
|
|
//! **Dialer** (dials the peer's addr). Because the rule is
|
|
//! identical on both sides, the Acceptor's inbound QUIC session
|
|
//! and the Dialer's outbound are the SAME connection — no
|
|
//! negotiation needed, no two-conns-per-call confusion.
|
|
//!
|
|
//! Timeout policy:
|
|
//! - Direct path: 2s from the start of `race`. Cone-NAT hole-punch
|
|
//! typically completes in < 500ms on a LAN; 2s gives us tolerance
|
|
//! for a single QUIC Initial retry on unreliable networks.
|
|
//! - Relay path: 10s (existing behavior elsewhere in the codebase).
|
|
//! - Overall: `tokio::select!` returns as soon as either succeeds.
|
|
|
|
use std::net::SocketAddr;
|
|
use std::sync::Arc;
|
|
use std::time::Duration;
|
|
|
|
use crate::reflect::Role;
|
|
use wzp_transport::QuinnTransport;
|
|
|
|
/// Which path won the race. Used by the `connect` command for
|
|
/// logging + (in the future) metrics.
|
|
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
|
|
pub enum WinningPath {
|
|
Direct,
|
|
Relay,
|
|
}
|
|
|
|
/// Attempt a direct QUIC connection to the peer in parallel with
|
|
/// the relay dial and return the winning `QuinnTransport`.
|
|
///
|
|
/// `role` selects the direction of the direct attempt:
|
|
/// - `Role::Acceptor` creates a server-capable endpoint and waits
|
|
/// for the peer to dial in.
|
|
/// - `Role::Dialer` creates a client-only endpoint and dials
|
|
/// `peer_direct_addr`.
|
|
///
|
|
/// The relay path is always attempted in parallel as a fallback so
|
|
/// the race ALWAYS produces a working transport unless both paths
|
|
/// genuinely fail (network partition). Returns
|
|
/// `Err(anyhow::anyhow!(...))` if both paths fail within the
|
|
/// timeout.
|
|
/// Phase 5.5 candidate bundle — full ICE-ish candidate list for
|
|
/// the peer. The race tries them all in parallel alongside the
|
|
/// relay path. At minimum this should contain the peer's
|
|
/// server-reflexive address; `local_addrs` carries LAN host
|
|
/// candidates gathered from their physical interfaces.
|
|
///
|
|
/// Empty is valid: the D-role has nothing to dial and the race
|
|
/// reduces to "relay only" + (if A-role) accepting on the
|
|
/// shared endpoint.
|
|
#[derive(Debug, Clone, Default)]
|
|
pub struct PeerCandidates {
|
|
/// Peer's server-reflexive address (Phase 3). `None` if the
|
|
/// peer didn't advertise one.
|
|
pub reflexive: Option<SocketAddr>,
|
|
/// Peer's LAN host addresses (Phase 5.5). Tried first on
|
|
/// same-LAN pairs — direct dials to these bypass the NAT
|
|
/// entirely.
|
|
pub local: Vec<SocketAddr>,
|
|
}
|
|
|
|
impl PeerCandidates {
|
|
/// Flatten into the list of addrs the D-role should dial.
|
|
/// Order: LAN host candidates first (fastest when they
|
|
/// work), then reflexive (covers the non-LAN case).
|
|
pub fn dial_order(&self) -> Vec<SocketAddr> {
|
|
let mut out = Vec::with_capacity(self.local.len() + 1);
|
|
out.extend(self.local.iter().copied());
|
|
if let Some(a) = self.reflexive {
|
|
// Only add if it's not already in the list (some
|
|
// edge cases on same-LAN could have the same addr
|
|
// in both).
|
|
if !out.contains(&a) {
|
|
out.push(a);
|
|
}
|
|
}
|
|
out
|
|
}
|
|
|
|
/// Is there anything for the D-role to dial? If not, the
|
|
/// race reduces to relay-only.
|
|
pub fn is_empty(&self) -> bool {
|
|
self.reflexive.is_none() && self.local.is_empty()
|
|
}
|
|
}
|
|
|
|
#[allow(clippy::too_many_arguments)]
|
|
pub async fn race(
|
|
role: Role,
|
|
peer_candidates: PeerCandidates,
|
|
relay_addr: SocketAddr,
|
|
room_sni: String,
|
|
call_sni: String,
|
|
// Phase 5: when `Some`, reuse this endpoint for BOTH the
|
|
// direct-path branch AND the relay dial. Pass the signal
|
|
// endpoint. The endpoint MUST be server-capable (created
|
|
// with a server config) for the A-role accept branch to
|
|
// work.
|
|
//
|
|
// When `None`, falls back to fresh endpoints per role.
|
|
// Used by tests.
|
|
shared_endpoint: Option<wzp_transport::Endpoint>,
|
|
) -> anyhow::Result<(Arc<QuinnTransport>, WinningPath)> {
|
|
// Rustls provider must be installed before any quinn endpoint
|
|
// is created. Install attempt is idempotent.
|
|
let _ = rustls::crypto::ring::default_provider().install_default();
|
|
|
|
// Build the direct-path endpoint + future based on role.
|
|
//
|
|
// A-role: one accept future on the shared endpoint. The
|
|
// first incoming QUIC connection wins — we don't care
|
|
// which peer candidate the dialer used to reach us.
|
|
//
|
|
// D-role: N parallel dial futures, one per peer candidate
|
|
// (all LAN host addrs + the reflex addr), consolidated
|
|
// into a single direct_fut via FuturesUnordered-style
|
|
// "first OK wins" semantics. The first successful dial
|
|
// becomes the direct path; the losers are dropped (quinn
|
|
// will abort the in-flight handshakes via the dropped
|
|
// Connecting futures).
|
|
//
|
|
// Either way, direct_fut resolves to a single QuinnTransport
|
|
// (or an error) and is raced against the relay_fut by the
|
|
// outer tokio::select!.
|
|
let direct_ep: wzp_transport::Endpoint;
|
|
let direct_fut: std::pin::Pin<
|
|
Box<dyn std::future::Future<Output = anyhow::Result<QuinnTransport>> + Send>,
|
|
>;
|
|
|
|
match role {
|
|
Role::Acceptor => {
|
|
let ep = match shared_endpoint.clone() {
|
|
Some(ep) => {
|
|
tracing::info!(
|
|
local_addr = ?ep.local_addr().ok(),
|
|
"dual_path: A-role reusing shared endpoint for accept"
|
|
);
|
|
ep
|
|
}
|
|
None => {
|
|
let (sc, _cert_der) = wzp_transport::server_config();
|
|
let bind: SocketAddr = "0.0.0.0:0".parse().unwrap();
|
|
let fresh = wzp_transport::create_endpoint(bind, Some(sc))?;
|
|
tracing::info!(
|
|
local_addr = ?fresh.local_addr().ok(),
|
|
"dual_path: A-role fresh endpoint up, awaiting peer dial"
|
|
);
|
|
fresh
|
|
}
|
|
};
|
|
let ep_for_fut = ep.clone();
|
|
direct_fut = Box::pin(async move {
|
|
// `wzp_transport::accept` wraps the same
|
|
// `endpoint.accept().await?.await?` dance we want.
|
|
// If `ep_for_fut` is the shared signal endpoint,
|
|
// this pulls the NEXT incoming connection —
|
|
// normally that's the peer's direct-P2P dial.
|
|
// Signal recv is done via the signal CONNECTION
|
|
// (accept_bi), not the endpoint, so no conflict.
|
|
let conn = wzp_transport::accept(&ep_for_fut)
|
|
.await
|
|
.map_err(|e| anyhow::anyhow!("direct accept: {e}"))?;
|
|
Ok(QuinnTransport::new(conn))
|
|
});
|
|
direct_ep = ep;
|
|
}
|
|
Role::Dialer => {
|
|
let ep = match shared_endpoint.clone() {
|
|
Some(ep) => {
|
|
tracing::info!(
|
|
local_addr = ?ep.local_addr().ok(),
|
|
candidates = ?peer_candidates.dial_order(),
|
|
"dual_path: D-role reusing shared endpoint to dial peer candidates"
|
|
);
|
|
ep
|
|
}
|
|
None => {
|
|
let bind: SocketAddr = "0.0.0.0:0".parse().unwrap();
|
|
let fresh = wzp_transport::create_endpoint(bind, None)?;
|
|
tracing::info!(
|
|
local_addr = ?fresh.local_addr().ok(),
|
|
candidates = ?peer_candidates.dial_order(),
|
|
"dual_path: D-role fresh endpoint up, dialing peer candidates"
|
|
);
|
|
fresh
|
|
}
|
|
};
|
|
let ep_for_fut = ep.clone();
|
|
let dial_order = peer_candidates.dial_order();
|
|
let sni = call_sni.clone();
|
|
direct_fut = Box::pin(async move {
|
|
if dial_order.is_empty() {
|
|
// No candidates — the race reduces to
|
|
// relay-only. Surface a stable error so the
|
|
// outer select falls through to relay_fut
|
|
// without a spurious "direct failed" warning.
|
|
// Use a pending future that never resolves so
|
|
// the select's "other side wins" branch is
|
|
// the natural outcome.
|
|
std::future::pending::<anyhow::Result<QuinnTransport>>().await
|
|
} else {
|
|
// Fan out N parallel dials via JoinSet. First
|
|
// `Ok` wins; `Err` from a single candidate is
|
|
// not fatal — we wait for the others. Only
|
|
// when ALL have failed do we return Err.
|
|
let mut set = tokio::task::JoinSet::new();
|
|
for (idx, candidate) in dial_order.iter().enumerate() {
|
|
let ep = ep_for_fut.clone();
|
|
let client_cfg = wzp_transport::client_config();
|
|
let sni = sni.clone();
|
|
let candidate = *candidate;
|
|
set.spawn(async move {
|
|
let result = wzp_transport::connect(
|
|
&ep,
|
|
candidate,
|
|
&sni,
|
|
client_cfg,
|
|
)
|
|
.await;
|
|
(idx, candidate, result)
|
|
});
|
|
}
|
|
let mut last_err: Option<String> = None;
|
|
while let Some(join_res) = set.join_next().await {
|
|
let (idx, candidate, dial_res) = match join_res {
|
|
Ok(t) => t,
|
|
Err(e) => {
|
|
last_err = Some(format!("join {e}"));
|
|
continue;
|
|
}
|
|
};
|
|
match dial_res {
|
|
Ok(conn) => {
|
|
tracing::info!(
|
|
%candidate,
|
|
candidate_idx = idx,
|
|
"dual_path: direct dial succeeded on candidate"
|
|
);
|
|
// Abort the remaining in-flight
|
|
// dials so they don't complete
|
|
// and leak QUIC sessions.
|
|
set.abort_all();
|
|
return Ok(QuinnTransport::new(conn));
|
|
}
|
|
Err(e) => {
|
|
tracing::debug!(
|
|
%candidate,
|
|
candidate_idx = idx,
|
|
error = %e,
|
|
"dual_path: direct dial failed, trying others"
|
|
);
|
|
last_err = Some(format!("candidate {candidate}: {e}"));
|
|
}
|
|
}
|
|
}
|
|
Err(anyhow::anyhow!(
|
|
"all {} direct candidates failed; last: {}",
|
|
dial_order.len(),
|
|
last_err.unwrap_or_else(|| "n/a".into())
|
|
))
|
|
}
|
|
});
|
|
direct_ep = ep;
|
|
}
|
|
}
|
|
|
|
// Relay path: classic dial to the relay's media room. Phase 5:
|
|
// reuse the shared endpoint here too so MikroTik-style NATs
|
|
// keep a stable external port across all flows from this
|
|
// client. Falls back to a fresh endpoint when not shared.
|
|
let relay_ep = match shared_endpoint.clone() {
|
|
Some(ep) => ep,
|
|
None => {
|
|
let relay_bind: SocketAddr = "0.0.0.0:0".parse().unwrap();
|
|
wzp_transport::create_endpoint(relay_bind, None)?
|
|
}
|
|
};
|
|
let relay_ep_for_fut = relay_ep.clone();
|
|
let relay_client_cfg = wzp_transport::client_config();
|
|
let relay_sni = room_sni.clone();
|
|
let relay_fut = async move {
|
|
let conn =
|
|
wzp_transport::connect(&relay_ep_for_fut, relay_addr, &relay_sni, relay_client_cfg)
|
|
.await
|
|
.map_err(|e| anyhow::anyhow!("relay dial: {e}"))?;
|
|
Ok::<_, anyhow::Error>(QuinnTransport::new(conn))
|
|
};
|
|
|
|
// Race the two with a shared 2s ceiling on the direct attempt.
|
|
// Pin both so we can poll them from multiple branches of the
|
|
// select without moving the futures — the "direct failed, wait
|
|
// for relay" and "relay failed, wait for direct" fallback paths
|
|
// below need to await the OPPOSITE future after the winning
|
|
// branch fires. Without pinning, tokio::select! moves the
|
|
// future out and we can't touch it again.
|
|
tracing::info!(
|
|
?role,
|
|
candidates = ?peer_candidates.dial_order(),
|
|
%relay_addr,
|
|
"dual_path: racing direct vs relay"
|
|
);
|
|
let direct_timed = tokio::time::timeout(Duration::from_secs(2), direct_fut);
|
|
tokio::pin!(direct_timed, relay_fut);
|
|
|
|
let result = tokio::select! {
|
|
biased; // prefer direct win if both arrive in the same tick
|
|
direct_result = &mut direct_timed => {
|
|
match direct_result {
|
|
Ok(Ok(transport)) => {
|
|
tracing::info!("dual_path: direct WON");
|
|
Ok((Arc::new(transport), WinningPath::Direct))
|
|
}
|
|
Ok(Err(e)) => {
|
|
// Direct failed — fall back to waiting for relay.
|
|
tracing::warn!(error = %e, "dual_path: direct failed, awaiting relay");
|
|
match tokio::time::timeout(Duration::from_secs(5), &mut relay_fut).await {
|
|
Ok(Ok(transport)) => Ok((Arc::new(transport), WinningPath::Relay)),
|
|
Ok(Err(e2)) => Err(anyhow::anyhow!("both paths failed: direct={e}, relay={e2}")),
|
|
Err(_) => Err(anyhow::anyhow!("both paths failed: direct={e}, relay=timeout(5s)")),
|
|
}
|
|
}
|
|
Err(_elapsed) => {
|
|
tracing::warn!("dual_path: direct timed out (2s), awaiting relay");
|
|
match tokio::time::timeout(Duration::from_secs(5), &mut relay_fut).await {
|
|
Ok(Ok(transport)) => Ok((Arc::new(transport), WinningPath::Relay)),
|
|
Ok(Err(e2)) => Err(anyhow::anyhow!("direct timeout + relay failed: {e2}")),
|
|
Err(_) => Err(anyhow::anyhow!("direct timeout + relay timeout")),
|
|
}
|
|
}
|
|
}
|
|
}
|
|
relay_result = &mut relay_fut => {
|
|
match relay_result {
|
|
Ok(transport) => {
|
|
tracing::info!("dual_path: relay WON (direct still pending)");
|
|
Ok((Arc::new(transport), WinningPath::Relay))
|
|
}
|
|
Err(e) => {
|
|
tracing::warn!(error = %e, "dual_path: relay failed, awaiting direct remainder");
|
|
match tokio::time::timeout(Duration::from_millis(1500), &mut direct_timed).await {
|
|
Ok(Ok(Ok(transport))) => Ok((Arc::new(transport), WinningPath::Direct)),
|
|
_ => Err(anyhow::anyhow!("relay failed + direct unavailable: {e}")),
|
|
}
|
|
}
|
|
}
|
|
}
|
|
};
|
|
|
|
// Let both endpoint clones drop at end-of-scope. With the
|
|
// Phase 5 shared-endpoint path, these clones are Arc<Endpoint>
|
|
// clones of the signal endpoint — dropping them just decrements
|
|
// the ref count, the socket stays alive for the signal loop +
|
|
// any further direct-P2P attempts. With the fresh-endpoint
|
|
// fallback, the drops are the last refs so the sockets close
|
|
// promptly. Either way the winning transport already owns its
|
|
// own quinn::Connection reference which is independent of the
|
|
// Endpoint lifetime.
|
|
let _ = (direct_ep, relay_ep);
|
|
|
|
result
|
|
}
|