Files
wz-phone/docs/BRANCH-android-rewrite.md
Siavash Sameni 75bc72a884 docs: add BRANCH-android-rewrite.md and update ARCH/ADMIN/USER_GUIDE
Documents the android-rewrite branch story end-to-end:
- Why the Kotlin+JNI stack was abandoned (stack overflow, libcrypto
  TLS race, __init_tcb TCB leak, ring runtime reuse crash)
- The Tauri 2.x Mobile pivot that reuses the desktop codebase verbatim
- Android-specific pieces: wzp-native standalone cdylib loaded via
  libloading, android_audio.rs JVM routing, Oboe audio config quirks
- Build pipeline via build-tauri-android.sh + wzp-android-builder image
- Known quirks (API 34/36 coexistence, NDK path absolutes, etc.)

Also appends shared-doc sections (identical on both branches):
- ARCHITECTURE.md: "Audio Backend Architecture (Platform Matrix)"
  covering CPAL / VPIO / WASAPI / Oboe backends, selection matrix,
  the wzp-native cdylib rationale, and the vendored audiopus_sys fix.
- ADMINISTRATION.md: "Build Pipelines" with Docker images
  (wzp-android-builder, wzp-windows-builder), per-pipeline usage
  (Android APK, Linux x86_64, Windows .exe), the Hetzner Cloud
  alternative, ntfy/rustypaste integration, and credential locations.
- USER_GUIDE.md: "Direct 1:1 Calling (Desktop + Android)" covering
  history + recent contacts + deregister UI, and "Windows AEC
  Variants" explaining the AEC vs noAEC builds and driver caveats.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-10 15:20:12 +04:00

140 lines
11 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Branch: `android-rewrite`
Pivot away from the legacy Kotlin + JNI Android client to a pure-Rust **Tauri 2.x Mobile** app that shares the same frontend and backend code as the desktop client.
## Why this branch exists
The Kotlin + JNI stack was a crash factory. Every failure mode we hit was at the Kotlin ↔ Rust boundary, and each fix uncovered the next layer of the onion:
| Symptom | Root cause | Fix |
|---|---|---|
| App crashed on launch before `onCreate` returned | `__init_tcb` / `pthread_create` bionic private symbols leaking out of `libwzp_android.so` because the Rust crate used `crate-type = ["cdylib", "staticlib"]`. rust-lang/rust#104707 documents that staticlib alongside cdylib leaks non-exported symbols from the staticlib into the cdylib, and Bionic's private internal pthread symbols got bound LOCALLY inside our `.so` instead of resolved against `libc.so` at `dlopen` time | Dropped `staticlib` from the crate-type list. `crate-type = ["cdylib", "rlib"]` only. |
| Stack overflow on `place_call` | `Dispatchers.IO` threads have a ~512 KB stack, too small for the Rust signal-connect path that does TLS handshake + quinn setup inside one closure | Launched JNI calls from a dedicated `java.lang.Thread` with an explicit 8 MB stack |
| `ring` / `libcrypto` TLS reuse crash on second call | tokio runtime got dropped between calls, but `ring` keeps a TLS-stored SSL context that is invalidated when the runtime thread is reused by a new runtime — `ring` sees stale context and segfaults | Single long-lived tokio runtime for the entire signal client lifetime; split `start()` into an inline `connect+register` path and a `run()` path on a separate thread to avoid the `thread::spawn` closure's stack overflow |
| Null dereference on register with fresh install | Identity seed file empty when it existed-but-was-blank, Rust side deref'd the zero-length slice | Generate seed if empty on register |
Every fix kept the app limping along but the fundamental design problem remained: **state management was split across a Kotlin ViewModel and a Rust engine, with a hand-rolled JNI bridge in between that had to be perfect to not crash**. The working desktop Tauri client (with the same Rust backend) had none of these problems because it spoke to the Rust code via in-process `invoke()` from a WebView, not JNI.
So: rewrite the Android app as a **Tauri 2.x Mobile app**, reusing the entire desktop codebase verbatim (`main.ts`, `style.css`, `index.html`, `main.rs`, `engine.rs` — everything). Tauri Mobile added Android support in v2, it's production-ready, and it eliminates the JNI boundary entirely.
The incident postmortem lives at [`docs/incident-tauri-android-init-tcb.md`](incident-tauri-android-init-tcb.md).
## Architecture
```
┌─────────────────────────────────────────────────┐
│ Tauri 2.x Mobile │
│ │
│ Android WebView ────────── HTML/JS/CSS │ ← Shared with desktop
│ │ (main.ts) │
│ │ │
│ invoke() ─────────────── Rust Commands │ ← Shared with desktop
│ (main.rs) │
│ │ │
│ ┌───────────────┼────────────┐ │
│ │ │ │ │
│ SignalMgr CallEngine Identity │ ← Shared crates
│ (signal_hub) (wzp-client) (wzp-crypto)│
│ │ │ │
│ │ │ │
│ ▼ ▼ │
│ QUIC to relay Oboe audio (Android) │
│ via wzp-native cdylib │
└─────────────────────────────────────────────────┘
```
**What is reused from desktop verbatim** (zero rewrite):
- `desktop/src/main.ts` — entire frontend
- `desktop/src/style.css` — all styling
- `desktop/src/identicon.ts` — identicon rendering
- `desktop/index.html` — HTML structure
- `desktop/src-tauri/src/main.rs` — all Tauri commands (`connect`, `disconnect`, `register_signal`, `place_call`, …)
- `desktop/src-tauri/src/engine.rs``CallEngine` wrapper
**What is Android-specific**:
- `desktop/src-tauri/src/android_audio.rs` — JVM-side audio routing (`AudioManager.setSpeakerphoneOn` for earpiece/speaker toggle). Runs from Tauri's existing JNI context — no hand-rolled bridge, Tauri owns the JVM hookup.
- `desktop/src-tauri/src/wzp_native.rs` — runtime `dlopen` of `libwzp_native.so`, a standalone cdylib crate (`crates/wzp-native`) that owns all C++ (Oboe bridge). Kept in its own crate so its C/C++ static archives never get statically linked into `wzp-desktop`'s `.so`, which would re-trigger the `__init_tcb` / pthread leak.
- `crates/wzp-native/` — the standalone C++/Oboe bridge cdylib. Loaded via `libloading` at runtime from `wzp_native.rs`. Provides capture + playout streams using Oboe's `Usage::VoiceCommunication` + `MODE_IN_COMMUNICATION` combo.
- Android-specific target dependencies in `desktop/src-tauri/Cargo.toml` (`jni`, `ndk-context`, `libloading`) — no CPAL, no VPIO.
## Key architectural decisions
### 1. `wzp-native` as a standalone cdylib loaded via `libloading`
The alternative — linking `wzp-native` as a regular Rust dep with C++ static archives — would cause the same `__init_tcb` crash that killed the Kotlin version. By making `wzp-native` its own cdylib and `dlopen`-ing it at runtime, Bionic's `libc.so` resolves every symbol at load time the way it's supposed to, and no private TCB symbols leak.
### 2. `crate-type = ["cdylib", "rlib"]` only (no `staticlib`)
Same reason. The `rlib` output is needed so the `wzp-desktop` binary target can link against the library; `cdylib` is needed for Android's `System.loadLibrary`; `staticlib` would reintroduce the symbol-leak bug.
### 3. Oboe audio config
`Usage::VoiceCommunication` + Java-side `MODE_IN_COMMUNICATION`. **Never** call `setAudioApi(AAudio)` explicitly — on some devices (Nothing Phone in particular) it causes Oboe to open the wrong stream type and audio goes silent. Let Oboe pick the audio API automatically. This is documented in the auto-memory `project_tauri_android_audio.md`.
### 4. Speaker/earpiece toggle uses `tokio::task::spawn_blocking`
Oboe's `stop()` + `start()` cycle is synchronous and can block for 50200 ms. Calling it on the tokio executor stalls every other async task (including the QUIC datagram loop), dropping audio packets. Wrapping the toggle in `spawn_blocking` isolates it to a dedicated thread pool. Fixed in commit `76a4c53`.
## Build pipeline
Docker on SepehrHomeserverdk, same pattern as the Android legacy pipeline and the Windows pipeline:
```
./scripts/build-tauri-android.sh # Full: pull + build + ntfy + rustypaste
./scripts/build-tauri-android.sh --pull # Explicit git pull (default)
./scripts/build-tauri-android.sh --clean # Blow away the Rust target cache
```
**Image**: `wzp-android-builder` (shared with the legacy Kotlin pipeline). The Dockerfile was extended to install Node.js 20 LTS, Android API level 36, build-tools 35.0.0, tauri-cli 2.x, and all four Android Rust targets on top of the legacy NDK 26.1 + cargo-ndk + Gradle setup. Both pipelines coexist in the same image.
**Output**: `wzp-release.apk` uploaded to rustypaste, URL delivered via `ntfy.sh/wzp`.
## Known quirks (Tauri Mobile specific)
1. **tauri-cli `android init` writes absolute paths** into `gradle.properties` for the NDK path. Those paths are local to wherever `android init` was run, so they break any cross-machine build unless overridden with `ANDROID_NDK_HOME` at build time. The build script exports `ANDROID_NDK_HOME` explicitly to work around this.
2. **API 36 vs API 34 coexistence**: the legacy Kotlin pipeline targets API 34, Tauri Mobile 2.x wants compileSdk 36. The shared Docker image installs both SDK levels so neither pipeline needs to reinstall.
3. **Identity seed lives in Android-specific app data dir**: `/data/data/com.wzp.phone/files/.wzp/identity` instead of `$HOME/.wzp/identity`. The shared `load_or_create_seed()` function in `desktop/src-tauri/src/lib.rs` uses Tauri's `app_data_dir()` which resolves correctly on both Android and desktop — no per-platform code needed.
4. **Direct calls on macOS previously hit an identity mismatch bug** — the `CallEngine` was using `$HOME/.wzp/identity` directly while `register_signal` used Tauri's `app_data_dir()`. Fixed by routing both through `load_or_create_seed()` (commit `2fd9465`). This was important for cross-platform consistency.
## Current state (snapshot)
What works:
- Tauri Mobile scaffold builds and runs on Android
- Signal hub connect + register works
- Room mode (SFU group calls) works with Oboe audio
- Direct 1:1 calls work with full parity to desktop
- Speaker/earpiece toggle works without stalling the audio pipeline
- Call history, recent contacts, deregister UI all present (inherited from desktop)
What remains (task list refs in parens):
- Background service for keeping signal alive when app is backgrounded (#19)
- Proper permission requests (microphone, notifications) on first launch (#19)
- Incoming call notification while backgrounded (#19)
- App icon + splash screen (#19)
## Testing
- **Build**: `./scripts/build-tauri-android.sh` — verify the APK lands on rustypaste and installs on device.
- **Smoke test**: Install → open app → Register → Place call → Receive call. No crashes, audio flows both ways.
- **Speaker toggle**: During a call, toggle speaker/earpiece several times in rapid succession. Audio should never stop, and the toggle should respond within ~200 ms.
- **Stress test**: Call for 10+ minutes continuous. No memory growth, no packet loss beyond what's attributable to the network.
## Files of interest
| Path | Purpose |
|---|---|
| `desktop/src-tauri/src/lib.rs` | Shared Tauri commands (desktop + Android) |
| `desktop/src-tauri/src/android_audio.rs` | JVM-side speaker/earpiece routing |
| `desktop/src-tauri/src/wzp_native.rs` | Runtime dlopen of libwzp_native.so |
| `crates/wzp-native/` | Standalone C++/Oboe cdylib, loaded at runtime |
| `scripts/build-tauri-android.sh` | Remote Docker build pipeline |
| `scripts/Dockerfile.android-builder` | Shared Android Docker image (legacy + Tauri) |
| `docs/incident-tauri-android-init-tcb.md` | Postmortem of the Kotlin+JNI crash cascade |