Files
wz-phone/docs/bugs/001-android-join-voice-hang.md
Siavash Sameni 2aa6582585 fix(android): call-debug instrumentation for audio startup path
Add emit_call_debug events at every step of the Android connect/audio
path so failures are visible in the Settings debug log without needing
adb logcat:

- connect:handshake_start/done/failed (with timing)
- connect:android_audio_preflight (wzp_native loaded + RECORD_AUDIO
  permission check via new has_record_audio_permission() JNI helper)
- connect:audio_stop_start/done
- connect:audio_mode_start/done/failed
- connect:audio_start_start/failed/panic/done (with oboe error code)
- connect:reuse_endpoint (endpoint reuse diagnostic)

Also adds has_record_audio_permission() to android_audio.rs — used in
the preflight event to confirm the OS has granted mic access before
wzp_oboe_start is called.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-25 07:38:38 +04:00

11 KiB
Raw Permalink Blame History

BUG-001: Android "Connecting…" Hangs / Join Voice Never Completes

Severity: P0 — renders the app non-functional for room joins on a fresh install Status: Partially mitigated (5a13f12), narrowed by static review; Android repro/logcat still needed Branch: experimental-ui Last investigated: 2026-05-25 Device confirmed affected: Nothing Phone A059 (Android 15)


Symptom

User taps "Join Voice". Button changes to "Connecting…" and stays there indefinitely. No error toast, no drawer, no progress. The only recovery is force-quitting the app.

2026-05-25 Static Review Update

The exact indefinite "Connecting…" symptom most likely came from an APK older than 5a13f12, because current desktop/src/main.ts has a 15s JS-side timeout for manual room joins. The current branch can still produce closely related failures:

  1. Native Oboe start can report false success when Android leaves capture/playout in Starting for 2s. That manifests as "joined but silent/dead audio", not a true JS hang.
  2. First-run microphone permission can still race the first openStream(Direction::Input), especially when the user joins immediately after granting permission.
  3. Direct-call auto-connect did not have the 15s JS timeout even after 5a13f12.
  4. Toasts used ${e}, so object-shaped Tauri errors could appear as [object Object].

Working-tree diagnostic changes applied during this investigation:

  • crates/wzp-native/cpp/oboe_bridge.cpp: return -6 if both streams do not reach Started before the 2s poll deadline. This turns Oboe false-success into a visible Rust/JS error.
  • desktop/src/main.ts: shared connectWithTimeout() for room joins and direct-call auto-connect; shared errorMessage() for useful toast text.
  • desktop/src-tauri/src/engine.rs: emit connect:handshake_*, connect:android_audio_preflight, connect:audio_* markers around each Android-only join step.
  • desktop/src-tauri/src/lib.rs: emit connect:reuse_endpoint so we can see whether the room join is sharing the signal QUIC endpoint.

Next Android repro should distinguish:

Toast / log Meaning
Join failed: wzp_native_audio_start failed: code -2 mic permission / capture open failure
Join failed: wzp_native_audio_start failed: code -6 Oboe streams opened/requested start, but HAL never transitioned both to Started
Join failed: transport: timeout after 10000ms or similar after connect:handshake_start QUIC connected, but relay media handshake did not return CallAnswer
Join failed: connect timed out (15s) - check audio permissions Tauri command did not resolve to JS; collect Rust/Tauri logs around connect:call_engine_starting

Root Cause Chain

The invoke("connect") Tauri command runs the full CallEngine::start coroutine on Android. Execution order:

  1. Parse relay address → QUIC dial → crypto handshake (~200ms, works — relay logs confirm room join succeeds)
  2. audio_stop() (no-op on first launch)
  3. tokio::time::sleep(50ms)
  4. set_audio_mode_communication() (JNI into Kotlin)
  5. tokio::task::spawn_blocking(crate::wzp_native::audio_start) ← primary hang point

audio_start calls wzp_oboe_start() (C++ FFI in crates/wzp-native/cpp/oboe_bridge.cpp), which:

  • Opens capture stream (captureBuilder.openStream)
  • Opens playout stream (playoutBuilder.openStream)
  • g_capture_stream->requestStart()
  • g_playout_stream->requestStart()
  • Polls up to 2 seconds in a std::this_thread::sleep_for(10ms) busy-wait loop waiting for both streams to reach Started state (oboe_bridge.cpp:404423)

Before the working-tree -6 diagnostic change, if the HAL never transitioned to Started, wzp_oboe_start returned 0 (success!) after the 2s timeout even though streams were not functional. Rust saw ret == 0, considered it success, and CallEngine::start returned Ok.

The invoke("connect") promise resolves successfully, enterVoice(false) is called, the voice drawer appears — but audio streams are dead. The send task reads silence, the playout ring never drains.

However, relay log evidence shows the connection is established and then dropped 166ms later with forwarded=0, which means CallEngine::start did return to the connect command. If the user still sees "Connecting…" at that point, the JS await connectRace is not resolving — suggesting either the Rust command returned an error (which should show as a toast) or the invoke promise is hanging for a different reason.


Evidence

Relay log (pangolin, session at 06:40:04 UTC):

room "general" join accepted
crypto handshake complete  t=+184ms
connection dropped          t=+350ms  forwarded=0

The relay sees a clean connection that self-terminates in ~350ms total. forwarded=0 means no media was exchanged. Consistent with audio_start failing or the call task throwing before media loops start.

Four rapid connects at 06:40:04 in the relay log suggest multiple taps (no connectPending guard in the APK installed at that time, or user was on an older build).


Fixes Applied in 5a13f12

# Problem Fix File
1 wzp_oboe_start called directly on tokio worker thread → froze entire runtime including timeouts Changed to spawn_blocking desktop/src-tauri/src/engine.rs:609
2 No JS-side timeout → "Connecting…" hangs forever if Rust never returns Added 15s Promise.race desktop/src/main.ts:338
3 No error feedback to user Added showToast() in catch block desktop/src/main.ts:352
4 Button disappeared on click Changed to disabled + "Connecting…" text desktop/src/main.ts:335
5 Handshake could hang forever waiting for CallAnswer Added 10s tokio::time::timeout crates/wzp-client/src/handshake.rs:105

Open Issues (Not Yet Fixed)

Issue A: g_running flag race between audio_stop and audio_start

Current status: likely fixed in current branch. crates/wzp-native/cpp/oboe_bridge.cpp:430 now clears g_running at the top of wzp_oboe_stop.

oboe_bridge.cpp:244 checks g_running.load() at entry to wzp_oboe_start. The engine calls audio_stop() then waits 50ms then calls audio_start(). If wzp_oboe_stop does not synchronously clear g_running before returning, the next wzp_oboe_start sees g_running == true and returns -1 immediately (line 246247).

With 5a13f12, Rust now propagates this as "wzp_native_audio_start failed: code -1" → toast. Confirm via logcat.

Issue B: Mic permission granted at runtime causes audio HAL delay

After clearing app data, Android prompts for mic permission. The OS grants it but the audio HAL may not immediately honor it. The first openStream(Direction::Input) within ~1s of permission grant can fail with ErrorPermissionDenied → Oboe returns -2.

With 5a13f12 this should surface as toast: "Join failed: wzp_native_audio_start failed: code -2".

Issue C: wzp_oboe_start 2s poll timeout returns 0 (false success)

oboe_bridge.cpp:404423: if streams don't reach Started state within 2s, the poll loop exits with no error — wzp_oboe_start returns 0. Rust treats this as success. The drawer appears but audio is dead. This is the "joined but silent" failure mode, distinct from "stuck on Connecting…".

Fix: return a distinct error code (e.g. -6) from wzp_oboe_start when the poll times out without both streams reaching Started.

Working-tree status: implemented as -6; needs Android NDK/device validation.

Issue D: Error object serialization in JS toast

The connect command returns Result<String, String>. Tauri wraps the Err as a JS exception. If e in the catch block is a Tauri error object rather than a plain string, ${e} renders as "[object Object]". Should use e?.message ?? String(e) for robust stringification.

Working-tree status: implemented via errorMessage(e).


wzp_oboe_start Return Codes Reference

Code Meaning
0 Success
-1 Already running (g_running == true at entry)
-2 captureBuilder.openStream failed
-3 playoutBuilder.openStream failed
-4 g_capture_stream->requestStart() failed
-5 g_playout_stream->requestStart() failed
-6 streams failed to reach Started before poll timeout

Reproduction Steps

  1. Fresh install (or clear app data) on Nothing Phone A059
  2. Grant microphone permission when prompted
  3. Configure relay 193.180.213.68:4433, room general
  4. Tap "Join Voice"
  5. Observe: button shows "Connecting…" indefinitely

Diagnostic Steps

We have never captured adb logcat from a failing connect. This is the single highest-value diagnostic:

adb logcat -s "wzp-native" "wzp-desktop" "RustStd" | grep -E "audio|oboe|start|handshake|connect"

Key log lines to look for:

Log line Diagnosis
connect:reuse_endpoint Whether media is sharing the existing signal endpoint
connect:handshake_start followed by 10s timeout Relay media handshake is stuck before Android audio starts
connect:handshake_done Network/relay handshake succeeded; continue to audio diagnostics
connect:android_audio_preflight Shows wzp-native load state and RECORD_AUDIO permission
connect:audio_start_start with no done/failed Native Oboe call is hanging
wzp_oboe_start: already running Issue A — g_running not cleared
Failed to open capture stream: ErrorPermissionDenied Issue B — mic permission delay
Failed to start capture / Failed to start playout Oboe HAL error, code -4 or -5
both streams Started after N polls audio_start succeeded
audio_start task panic spawn_blocking panic (shouldn't happen)
wzp_native_audio_start failed: code X Rust caught it, toast should be visible

Alternatively: enable Call debug logs in Settings, reproduce, use the share button to extract logs without USB.


Proposed Fixes (Prioritized)

  1. Validate -6 from wzp_oboe_start on poll timeout on Android builder/device — eliminates silent false-success
  2. Add mic permission pre-check in Kotlin before calling into Rust — surface a cleaner error if permission is not yet effective
  3. If -6 reproduces on Nothing A059, test startup sequencing: request/start capture before MODE_IN_COMMUNICATION, add a short post-permission delay, or retry once after a full wzp_oboe_stop

  • crates/wzp-native/cpp/oboe_bridge.cppwzp_oboe_start implementation
  • crates/wzp-native/src/lib.rs:238audio_start_inner (Rust FFI wrapper)
  • desktop/src-tauri/src/engine.rs:576635CallEngine::start audio section
  • desktop/src/main.ts:328360joinVoiceBtn click handler
  • crates/wzp-client/src/handshake.rs:105 — handshake timeout