1618ff6c9d6402ec5fb693d80aa79bfb93df2210
2 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
1618ff6c9d |
feat(p2p): Phase 5 — single-socket architecture (Nebula-style)
Before Phase 5 WarzonePhone used THREE separate UDP sockets per
client:
1. Signal endpoint (register_signal, client-only)
2. Reflect probe endpoints (one fresh socket per relay probe)
3. Dual-path race endpoint (fresh per call setup)
This broke two things in production on port-preserving NATs
(MikroTik masquerade, most consumer routers):
a. Phase 2 NAT detection was WRONG. Each probe used a fresh
internal port, so MikroTik mapped each one to a different
external port, and the classifier saw "different port per
relay" and labeled it SymmetricPort. The real NAT was
cone-like but measurement via fresh sockets hid that.
b. Phase 3.5 dual-path P2P race was BROKEN. The reflex addr
we advertised in DirectCallOffer was observed by the signal
endpoint's socket. The actual dual-path race listened on a
DIFFERENT fresh socket, on a different internal (and
therefore external) port. Peers dialed the advertised addr
and hit MikroTik's mapping for the signal socket, which
forwarded to the signal endpoint — a client-only endpoint
that doesn't accept incoming connections. Direct path
silently failed, relay always won the race.
Nebula-style fix: one socket for everything. The signal endpoint
is now dual-purpose (client + server_config), and both the
reflect probes and the dual-path race reuse it instead of
creating fresh ones. MikroTik's port-preservation then gives us
a stable external port across all flows → classifier correctly
sees Cone NAT → advertised reflex addr is the actual listening
port → direct dials from peers land on the right socket →
`endpoint.accept()` in the A-role branch of the dual-path race
picks up the incoming connection.
## Changes
### `register_signal` (desktop/src-tauri/src/lib.rs)
- Endpoint now created with `Some(server_config())` instead of
`None`. The socket can now accept incoming QUIC connections as
well as dial outbound.
- Every code path that previously read `sig.endpoint` for the
relay-dial reuse benefits automatically — same socket is now
ALSO listening for peer dials.
### `probe_reflect_addr` (wzp-client/src/reflect.rs)
- New `existing_endpoint: Option<Endpoint>` arg. `Some` reuses
the caller's socket (production: pass the signal endpoint).
`None` creates a fresh one (tests + pre-registration).
- Removed the `drop(endpoint)` at the end — was correct for
fresh endpoints (explicit early socket close) but incorrect
for shared ones. End-of-scope drop does the right thing in
both cases via Arc semantics.
### `detect_nat_type` (wzp-client/src/reflect.rs)
- New `shared_endpoint: Option<Endpoint>` arg, forwarded to
every probe in the JoinSet fan-out. One shared socket means
the classifier sees the true NAT type.
### `detect_nat_type` Tauri command (desktop/src-tauri/src/lib.rs)
- Reads `state.signal.endpoint` and passes it as the shared
endpoint. Falls back to None when not registered. NAT detection
now produces accurate classifications against MikroTik / most
consumer NATs.
### `dual_path::race` (wzp-client/src/dual_path.rs)
- New `shared_endpoint: Option<Endpoint>` arg.
- A-role: when `Some`, reuses it for `accept()`. This is the
critical change — the reflex addr advertised to peers is now
the address listening for incoming direct dials.
- D-role: when `Some`, reuses it for the outbound direct dial.
MikroTik keeps the same external port for the dial as for
the signal flow → direct dial through a cone-mapped NAT.
- Relay path: also reuses the shared endpoint so MikroTik has
a single consistent mapping across the whole call (saves one
extra external port and makes firewall traces cleaner).
- When `None`, falls back to fresh per-role endpoints as before.
### `connect` Tauri command (desktop/src-tauri/src/lib.rs)
- Reads `state.signal.endpoint` once when acquiring own reflex
addr and passes it through to `dual_path::race`.
### Tests
- `wzp-client/tests/dual_path.rs` and
`wzp-relay/tests/multi_reflect.rs` updated to pass `None` for
the new endpoint arg — tests use fresh sockets and that's
fine because the loopback harness doesn't care about
port-preserving NAT behavior.
Full workspace test: 423 passing (no regressions).
## Expected behavior after this commit on real hardware
Behind MikroTik + Starlink-bypass (the reporter's setup):
- Phase 2 NAT detect → **Cone NAT** (was SymmetricPort — false
positive from the measurement artifact)
- Phase 3.5 direct-P2P dial → succeeds for both cone-cone and
cone-CGNAT cases where the remote side was previously blocked
by our own socket mismatch
- LTE ↔ LTE cross-carrier → still likely relay fallback; that's
genuinely strict symmetric and needs Phase 5.5 port prediction.
## Phase 5.5 (next, separate PRD)
Multi-candidate port prediction + ICE-style candidate aggregation
for truly strict symmetric NATs. Not needed for the 95% case —
Phase 5 alone fixes most consumer-router setups.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
||
|
|
59ce52f8e8 |
feat(p2p): Phase 3.5 dual-path QUIC race + GUI call-flow debug logs
Two features in one commit because they ship and test together:
Phase 3.5 closes the hole-punching loop and the call-flow debug
logs give the user live visibility into every step of a call so
real-hardware testing of the new P2P path is debuggable.
## Phase 3.5 — dual-path QUIC connect race
Completes the hole-punching work Phase 3 scaffolded. On receiving
a CallSetup with peer_direct_addr, the client now actually races a
direct QUIC handshake against the relay dial and uses whichever
completes first. Symmetric role assignment avoids the two-conns-
per-call problem:
- Both peers compare `own_reflex_addr` vs `peer_reflex_addr`
lexicographically.
- Smaller addr → **Acceptor** (A-role): builds a server-capable
dual endpoint, awaits an incoming QUIC session. Does NOT dial.
- Larger addr → **Dialer** (D-role): builds a client-only
endpoint, dials the peer's addr with `call-<id>` SNI. Does NOT
listen.
- Both sides always dial the relay in parallel as fallback.
- `tokio::select!` with `biased` preference for direct, `tokio::pin!`
so each branch can await the losing opposite as fallback.
- Direct timeout 2s, relay fallback timeout 5s (so 7s worst case
from CallSetup to "no media path" error).
New crate module `wzp_client::dual_path::{race, WinningPath}`
(moved here from desktop/src-tauri so it's testable from a
workspace test). `determine_role` in `wzp_client::reflect` is
pure-function and unit-tested.
### CallEngine integration
- New `pre_connected_transport: Option<Arc<QuinnTransport>>` arg
on both android + desktop `CallEngine::start` branches. Skips
the internal wzp_transport::connect step when Some. Backward-
compat: None keeps Phase 0 relay-only behavior.
- `connect` Tauri command reads own_reflex_addr from SignalState,
computes role, runs the race, passes the winning transport
into CallEngine. If ANY input is missing (no peer addr, no own
addr, equal addrs), falls back to classic relay path —
identical to pre-Phase-3.5 behavior.
### Tests (9 new, all passing)
- 6 unit tests for `determine_role` truth table in
`wzp-client/src/reflect.rs` (smaller=Acceptor, larger=Dialer,
port-only diff, equal, missing-side, symmetry)
- 3 integration tests in `crates/wzp-client/tests/dual_path.rs`:
* `dual_path_direct_wins_on_loopback` — two-endpoint test
rig, Dialer wins direct path vs loopback mock relay
* `dual_path_relay_wins_when_direct_is_dead` — dead peer
port, 2s direct timeout, relay fallback wins
* `dual_path_errors_cleanly_when_both_paths_dead` — <10s
error, no hang
## GUI call-flow debug logs
Runtime-toggled structured events at every step of a call so the
user can see where a call progressed or stalled on real hardware.
Modeled on the existing DRED_VERBOSE_LOGS pattern.
### Rust side
- `static CALL_DEBUG_LOGS: AtomicBool` + `emit_call_debug(&app,
step, details)` helper. Always logs via `tracing::info!`
(logcat always has a copy); GUI Tauri `call-debug-log` event
only fires when the flag is on.
- Tauri commands `set_call_debug_logs` / `get_call_debug_logs`.
### Instrumented steps (24 emit_call_debug sites)
- `register_signal`: start, identity loaded, endpoint created,
connect failed/ok, RegisterPresence sent, ack received/failed,
recv loop spawning
- Recv loop: CallRinging, DirectCallOffer (w/ caller_reflexive_addr),
DirectCallAnswer (w/ callee_reflexive_addr), CallSetup (w/
peer_direct_addr), Hangup
- `place_call`: start, reflect query start/ok/none, offer sent,
send failed
- `answer_call`: start, reflect query start/ok/none or privacy
skip, answer sent, send failed
- `connect`: start, dual_path_race_start (w/ role), won (w/
path), failed, skipped (w/ reasons), call_engine_starting/
started/failed
### JS side
- New `callDebugLogs: boolean` field on Settings type.
- Boot-time hydrate of the Rust flag from localStorage so the
choice survives restarts (like `dredDebugLogs`).
- Settings panel: new "Call flow debug logs" checkbox alongside
the DRED toggle.
- New "Call Debug Log" section that ONLY shows when the flag is
on. Rolling in-memory buffer of the last 200 events, rendered
as monospace `HH:MM:SS.mmm step {details}` lines with auto-
scroll and a Clear button.
- `listen("call-debug-log", ...)` subscribed at app startup,
appends to the buffer, re-renders on every event.
Full workspace test goes from 404 → 413 passing. Clippy clean
on touched crates.
PRD: .taskmaster/docs/prd_phase35_dual_path_race.txt
Tasks: 61-69 all completed
Next: APK + desktop build carrying everything — Phase 2 NAT
detect, Phase 3 advertising, Phase 3.5 dual-path + call debug
logs, plus the earlier Android first-join diagnostics — so the
user can validate the P2P path on real hardware with live
per-step visibility into where any failures happen.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|