From 6597b5bd8612ad6ab4c5da8fea1d8c28ec1e5446 Mon Sep 17 00:00:00 2001 From: Claude Date: Mon, 6 Apr 2026 09:21:35 +0000 Subject: [PATCH] docs: incident report + fix spec for capture thread use-after-free crash MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit SIGSEGV on hangup: capture thread calls writeAudio() via JNI after teardown() has freed the native engine handle. TOCTOU race between the nativeHandle==0L check and destroy() on the ViewModel thread. Fix: CountDownLatch(2) — audio threads count down after exiting loops, teardown() awaits before destroy(). 2 Kotlin files, no Rust changes. Co-Authored-By: Claude Opus 4.6 (1M context) --- ...026-04-06-capture-thread-use-after-free.md | 175 ++++++++++++++++++ docs/android/fix-capture-thread-crash.md | 149 +++++++++++++++ 2 files changed, 324 insertions(+) create mode 100644 debug/INCIDENT-2026-04-06-capture-thread-use-after-free.md create mode 100644 docs/android/fix-capture-thread-crash.md diff --git a/debug/INCIDENT-2026-04-06-capture-thread-use-after-free.md b/debug/INCIDENT-2026-04-06-capture-thread-use-after-free.md new file mode 100644 index 0000000..0c4162f --- /dev/null +++ b/debug/INCIDENT-2026-04-06-capture-thread-use-after-free.md @@ -0,0 +1,175 @@ +# Incident Report: Native Crash in Capture Thread — Use-After-Free on Engine Handle + +**Date:** 2026-04-06 +**Severity:** Critical — app crash (SIGSEGV) on call hangup +**Status:** Root-caused, fix pending +**Affects:** Android client only + +## Summary + +The app crashes with a native SIGSEGV during or shortly after call hangup. The crash occurs in JIT-compiled code inside `AudioPipeline.runCapture()`. The root cause is a use-after-free: the capture thread calls `engine.writeAudio()` via JNI after the engine's native handle has been freed by `teardown()` on the ViewModel thread. + +## Crash Stacktrace + +``` +04-06 13:05:42.707 F DEBUG: #09 pc 000000000250696c /memfd:jit-cache (deleted) (com.wzp.audio.AudioPipeline.runCapture+3228) +04-06 13:05:42.707 F DEBUG: #14 pc 0000000000005270 (com.wzp.audio.AudioPipeline.start$lambda$0+0) +04-06 13:05:42.708 F DEBUG: #19 pc 00000000000044cc (com.wzp.audio.AudioPipeline.$r8$lambda$0rYcivupwvyN4SgBXhsroKmTlo8+0) +04-06 13:05:42.708 F DEBUG: #24 pc 00000000000042e4 (com.wzp.audio.AudioPipeline$$ExternalSyntheticLambda0.run+0) +``` + +This is a tombstone (signal crash), not a Java exception. The `F DEBUG` tag indicates a native crash handler (debuggerd) captured the signal. + +## Root Cause + +### The Race Condition + +Two threads operate on the engine concurrently without synchronization: + +**Thread 1: `wzp-capture` (AudioRecord thread, MAX_PRIORITY)** +```kotlin +// AudioPipeline.runCapture() — runs in a tight loop +while (running) { + val read = recorder.read(pcm, 0, FRAME_SAMPLES) + if (read > 0) { + engine.writeAudio(pcm) // <-- JNI call to native engine + } +} +``` + +**Thread 2: ViewModel/UI thread (normal priority)** +```kotlin +// CallViewModel.teardown() +stopAudio() // sets AudioPipeline.running = false +engine?.stopCall() // tells Rust to stop +engine?.destroy() // frees native memory, sets nativeHandle = 0L +engine = null +``` + +### The Kotlin Guard is Insufficient + +`WzpEngine.writeAudio()` has a guard: +```kotlin +fun writeAudio(pcm: ShortArray): Int { + if (nativeHandle == 0L) return 0 // check + return nativeWriteAudio(nativeHandle, pcm) // use +} +``` + +This is a **TOCTOU (time-of-check/time-of-use) race**: +1. Capture thread checks `nativeHandle != 0L` → true +2. ViewModel thread calls `destroy()`, which calls `nativeDestroy(handle)` then sets `nativeHandle = 0L` +3. Capture thread calls `nativeWriteAudio(handle, pcm)` with the now-freed handle +4. The JNI function dereferences `handle` as a pointer → **SIGSEGV** + +The same race exists for `readAudio()` on the `wzp-playout` thread. + +### Why `stopAudio()` Doesn't Prevent This + +`AudioPipeline.stop()` sets `running = false` but does **NOT join or wait** for the threads: +```kotlin +fun stop() { + running = false + // Don't join — threads are parked as daemons to avoid native TLS crash + captureThread = null + playoutThread = null +} +``` + +The threads are intentionally not joined because of a separate bug: exiting a JNI-calling thread triggers a `SIGSEGV in OPENSSL_free` due to libcrypto TLS destructors on Android. The threads instead "park" with `Thread.sleep(Long.MAX_VALUE)` after the loop exits. + +But the problem is the **window between `running = false` and the thread actually checking it**. The capture thread may be blocked in `recorder.read()` (which blocks for 20ms per frame) or in the middle of `engine.writeAudio()` when `destroy()` is called. + +### Timeline of the Crash + +``` +T=0ms ViewModel: stopAudio() → sets running=false +T=0ms ViewModel: stopStatsPolling() +T=0ms ViewModel: engine.stopCall() — Rust stops internal tasks +T=1ms ViewModel: engine.destroy() — frees native memory + ↑ nativeHandle = 0L + +T=0-20ms Capture thread: still in recorder.read() or writeAudio() + → if in writeAudio(), the nativeHandle check passed BEFORE destroy() + → JNI dereferences freed pointer → SIGSEGV +``` + +## Affected Code + +### Files with the race + +| File | Line(s) | Issue | +|------|---------|-------| +| `android/.../WzpEngine.kt` | 107-108, 116-117 | TOCTOU on `nativeHandle` in `writeAudio()` / `readAudio()` | +| `android/.../CallViewModel.kt` | 257-262 | `stopAudio()` + `destroy()` without waiting for audio threads to quiesce | +| `android/.../AudioPipeline.kt` | 80-82 | `stop()` doesn't synchronize with running threads | + +### Files with the thread parking workaround + +| File | Line(s) | Context | +|------|---------|---------| +| `android/.../AudioPipeline.kt` | 57-58, 69-70 | Threads parked after loop exit to avoid libcrypto TLS crash | +| `android/.../AudioPipeline.kt` | 96-101 | `parkThread()` — `Thread.sleep(Long.MAX_VALUE)` | + +## Constraints for the Fix + +1. **Cannot join audio threads** — joining triggers a separate SIGSEGV in `OPENSSL_free` when the thread's TLS destructors fire (documented in `AudioPipeline.kt` comments). The parking workaround must be preserved. + +2. **Must guarantee no JNI calls after `destroy()`** — the native handle is a raw pointer; any dereference after free is undefined behavior. + +3. **Must not add blocking waits on the UI thread** — `teardown()` runs on the ViewModel thread which must remain responsive. + +4. **The `@Volatile running` flag is necessary but not sufficient** — it prevents new loop iterations but doesn't help with in-flight JNI calls. + +5. **Both `writeAudio` and `readAudio` have the same race** — the fix must cover both the capture and playout paths. + +## Reproduction + +The crash is timing-dependent. It's most likely to occur when: +- The capture thread is in the middle of a `writeAudio()` JNI call when `destroy()` is called +- More likely on slower devices or under CPU pressure (GC, thermal throttling) +- Can happen on every hangup, but only crashes ~10-30% of the time due to the timing window + +## Analysis of Possible Fix Approaches + +### Approach A: Add a synchronization gate in the JNI bridge + +Use a `ReentrantReadWriteLock` or `AtomicBoolean` in `WzpEngine.kt`: +- Audio threads acquire a read lock / check the flag before JNI calls +- `destroy()` acquires a write lock / sets the flag and waits for in-flight calls to drain + +**Pro:** Clean, solves the race directly. +**Con:** Adding a lock to the audio hot path (every 20ms). `ReentrantReadWriteLock` is not lock-free. However, the read-lock path is uncontended 99.99% of the time (write-lock only during destroy), so contention is negligible. + +### Approach B: Defer `destroy()` until audio threads have stopped + +Instead of calling `destroy()` in `teardown()`, set a flag and have the audio threads call `destroy()` after they exit the loop (before parking). + +**Pro:** No locks on hot path. +**Con:** Complex lifecycle — which thread calls destroy? What if both threads race to destroy? Need a `CountDownLatch` or similar. + +### Approach C: Make the JNI handle atomically invalidated + +Use `AtomicLong` for `nativeHandle` and use `compareAndExchange` in `destroy()` + `getAndCheck` pattern in audio calls. + +**Pro:** Lock-free. +**Con:** Still has a TOCTOU window — the thread can load the handle, then it gets CAS'd to 0, then the thread uses the stale handle. Doesn't fully solve the race without combining with a reference count or epoch. + +### Approach D: Introduce a destroy latch + +Add a `CountDownLatch(1)` that audio threads wait on before parking. `teardown()` sets `running=false`, then `await`s the latch (with timeout), then calls `destroy()`. Each audio thread counts down the latch after exiting the loop. + +Actually this needs a `CountDownLatch(2)` — one for each thread (capture + playout). + +**Pro:** Guarantees no in-flight JNI calls at destroy time. No locks on hot path. +**Con:** `teardown()` blocks for up to one frame duration (~20ms) waiting for threads to exit their loops. Acceptable for a hangup path. + +### Recommendation + +**Approach D (destroy latch)** is the cleanest. The 20ms worst-case wait is imperceptible on the hangup path, and it provides a hard guarantee that no JNI calls are in flight when `destroy()` runs. Combined with the existing `running` volatile flag, the audio threads exit their loops within one frame and count down the latch. + +If the latch times out (e.g., AudioRecord.read() is stuck), `destroy()` proceeds anyway — the `panic::catch_unwind` in the JNI bridge will catch the invalid access as a panic rather than a SIGSEGV (though this is best-effort; a true SIGSEGV from freed memory is not catchable). + +## Data Files + +The crash was captured from the Nothing A059 device at 13:05:42 on 2026-04-06. The tombstone is in the device's `/data/tombstones/` directory. The logcat output shows the crash frames. diff --git a/docs/android/fix-capture-thread-crash.md b/docs/android/fix-capture-thread-crash.md new file mode 100644 index 0000000..7f2926d --- /dev/null +++ b/docs/android/fix-capture-thread-crash.md @@ -0,0 +1,149 @@ +# Fix: Capture/Playout Thread Use-After-Free on Hangup + +## Problem + +App crashes (SIGSEGV) when hanging up a call. The capture thread (`wzp-capture`) calls `engine.writeAudio()` via JNI after `teardown()` has freed the native engine handle. Same race exists for the playout thread's `readAudio()`. + +**Root cause:** TOCTOU race between the `nativeHandle == 0L` check in `WzpEngine.writeAudio()`/`readAudio()` and `destroy()` freeing the native memory on the ViewModel thread. Audio threads can't be joined (libcrypto TLS destructor crash), so there's no synchronization between `stopAudio()` and `destroy()`. + +**Full forensics:** `debug/INCIDENT-2026-04-06-capture-thread-use-after-free.md` + +--- + +## Solution: Destroy Latch + +Add a `CountDownLatch(2)` that both audio threads count down after exiting their loops. `teardown()` awaits the latch (with timeout) before calling `destroy()`, guaranteeing no in-flight JNI calls. + +--- + +## Implementation Steps + +### Step 1: Add a drain latch to `AudioPipeline` + +**File:** `android/app/src/main/java/com/wzp/audio/AudioPipeline.kt` + +Add a `CountDownLatch` field: + +```kotlin +import java.util.concurrent.CountDownLatch +import java.util.concurrent.TimeUnit + +class AudioPipeline(private val context: Context) { + // ... existing fields ... + + /** Latch counted down by each audio thread after exiting its loop. + * stop() does NOT wait on this — teardown waits via awaitDrain(). */ + private var drainLatch: CountDownLatch? = null +``` + +In `start()`, create the latch before spawning threads: + +```kotlin +fun start(engine: WzpEngine) { + if (running) return + running = true + drainLatch = CountDownLatch(2) // one for capture, one for playout + + captureThread = Thread({ + runCapture(engine) + drainLatch?.countDown() // signal: capture loop exited + parkThread() + }, "wzp-capture").apply { ... } + + playoutThread = Thread({ + runPlayout(engine) + drainLatch?.countDown() // signal: playout loop exited + parkThread() + }, "wzp-playout").apply { ... } + // ... +} +``` + +Add `awaitDrain()` — called by ViewModel before `destroy()`: + +```kotlin +/** Block until both audio threads have exited their loops (max 200ms). + * After this returns, no more JNI calls to the engine will be made. */ +fun awaitDrain(): Boolean { + return drainLatch?.await(200, TimeUnit.MILLISECONDS) ?: true +} +``` + +`stop()` remains unchanged (non-blocking, sets `running = false`). + +### Step 2: Update `CallViewModel.teardown()` to await drain + +**File:** `android/app/src/main/java/com/wzp/ui/call/CallViewModel.kt` + +Change teardown to wait for audio threads before destroying: + +```kotlin +private fun teardown(stopService: Boolean = true) { + Log.i(TAG, "teardown: stopping audio, stopService=$stopService") + val hadCall = audioStarted + CallService.onStopFromNotification = null + stopAudio() // sets running=false (non-blocking) + stopStatsPolling() + + // Wait for audio threads to exit their loops before destroying the engine. + // This guarantees no in-flight JNI calls to writeAudio/readAudio. + val drained = audioPipeline?.awaitDrain() ?: true + if (!drained) { + Log.w(TAG, "teardown: audio threads did not drain in time") + } + audioPipeline = null + + Log.i(TAG, "teardown: stopping engine") + try { engine?.stopCall() } catch (e: Exception) { Log.w(TAG, "stopCall err: $e") } + try { engine?.destroy() } catch (e: Exception) { Log.w(TAG, "destroy err: $e") } + engine = null + engineInitialized = false + // ... rest unchanged +} +``` + +**Key change:** `awaitDrain()` is called AFTER `stopAudio()` (which sets `running=false`) but BEFORE `engine?.destroy()`. The latch guarantees both threads have exited their `while(running)` loops and will never call `writeAudio`/`readAudio` again. + +Also move `audioPipeline = null` to after `awaitDrain()` to keep the reference alive for the latch call. + +### Step 3: Move `stopAudio()` pipeline nulling + +**File:** `android/app/src/main/java/com/wzp/ui/call/CallViewModel.kt` + +In `stopAudio()`, do NOT null out the pipeline — let `teardown()` handle it after drain: + +```kotlin +private fun stopAudio() { + if (!audioStarted) return + audioPipeline?.stop() // sets running=false + // DON'T null audioPipeline here — teardown() needs it for awaitDrain() + audioRouteManager?.unregister() + audioRouteManager?.setSpeaker(false) + _isSpeaker.value = false + audioStarted = false +} +``` + +--- + +## Files to Modify + +| File | What changes | +|------|-------------| +| `android/.../audio/AudioPipeline.kt` | Add `CountDownLatch`, `countDown()` in threads, `awaitDrain()` method | +| `android/.../ui/call/CallViewModel.kt` | `teardown()` calls `awaitDrain()` before `destroy()`; `stopAudio()` doesn't null pipeline | + +## What Does NOT Change + +- `WzpEngine.kt` — the `nativeHandle == 0L` guard stays as defense-in-depth +- `jni_bridge.rs` — `panic::catch_unwind` stays as last resort +- `AudioPipeline.stop()` — remains non-blocking +- Thread parking — still needed to avoid libcrypto TLS crash + +## Verification + +1. Build APK, install on test device +2. Make a call, hang up — verify no crash in logcat (`adb logcat -s AndroidRuntime:E DEBUG:F`) +3. Rapid call/hangup/call/hangup cycles — stress the teardown path +4. Check logcat for `teardown: audio threads did not drain in time` — should never appear under normal conditions +5. Verify debug report still works after hangup (latch doesn't interfere with report collection)