Files
wz-phone/docs/PRD/reports/T4.2.1-report.md

7.1 KiB
Raw Blame History

T4.2.1 — Wire real VideoToolbox VTCompressionSession / VTDecompressionSession (macOS)

Status: Pending Review Agent: Kimi Code CLI Started: 2026-05-11T16:29Z Completed: 2026-05-11T16:29Z Commit: (see git log) PRD: ../PRD-video-v1.md

What I changed

  • crates/wzp-video/Cargo.toml — Added macOS-target dependency shiguredo_video_toolbox = "2026.1" (gated behind cfg(target_os = "macos")).
  • crates/wzp-video/src/videotoolbox.rs — Replaced stubs with real VideoToolbox wiring:
    • VideoToolboxEncoder now creates a VTCompressionSession via shiguredo_video_toolbox::Encoder (H.264 Baseline, CAVLC, real-time, 30 fps, configurable bitrate).
    • Input VideoFrame.data is interpreted as flat I420 (YUV 4:2:0 planar). Y/U/V planes are split and passed to the encoder.
    • Output is converted from AVCC (4-byte NAL length prefixes) to Annex-B (4-byte start codes 0x00 0x00 0x00 0x01). SPS/PPS parameter sets emitted by VideoToolbox on keyframes are prepended as separate Annex-B NALs.
    • request_keyframe() flag is persisted across encode() calls until a keyframe is actually emitted, because VideoToolbox internally buffers frames and the forced-keyframe option must be passed on every VTCompressionSessionEncodeFrame call until output appears.
    • VideoToolboxDecoder lazily creates VTDecompressionSession when the first in-band SPS/PPS arrive. On subsequent parameter-set changes the decoder is recreated.
    • Annex-B input is converted to AVCC before feeding the decoder. Decoded I420 output is concatenated into a flat Vec<u8> matching VideoFrame.data layout.
    • Added helper functions: avcc_to_annexb, annexb_to_avcc, split_annex_b, extract_sps_pps.
  • crates/wzp-video/tests/encode_decode_macos.rs — Integration test (#[cfg(target_os = "macos")]):
    • encode_decode_roundtrip: 30 synthetic 640×360 I420 gradient frames → encode → decode → assert dimensions match.
    • keyframe_in_first_five_frames: requests keyframe on frame 0, asserts at least one IDR slice (NAL type 5) appears within 5 encode calls.
    • Tests serialized with a global Mutex because VideoToolbox maintains global encoder-registry state that races under concurrent sessions.

Why these choices

  • shiguredo_video_toolbox crate: Provides safe, high-level Rust bindings around VideoToolbox (CVPixelBuffer, CMSampleBuffer, CMBlockBuffer, callbacks, format descriptions all handled internally). Writing equivalent code with raw video-toolbox-sys or objc2-video-toolbox would require ~500 lines of unsafe CoreFoundation object management. The crate is Apache-2.0 licensed, maintained by Shiguredo (Japanese WebRTC specialists), and battle-tested in production.
  • I420 input assumption: The PRD says "assume NV12 or I420 for now — disclose the format choice." I420 is simpler to split into planes (Y, U, V are contiguous in the flat buffer) and is a common capture format. A follow-up should negotiate the actual pixel format with the camera pipeline.
  • Lazy decoder creation: H.264 SPS/PPS travel in-band with the video stream (typically prefixed to the first IDR frame). The decoder cannot be instantiated until these parameter sets are known, so VideoToolboxDecoder defers session creation until decode() sees SPS + PPS NALs.
  • Keyframe request persistence: VideoToolbox buffers 34 frames before emitting output. If we clear the force-keyframe flag on the first encode() call that returns empty, the request is lost. The flag is now only cleared after EncodedFrame.keyframe == true is observed.

Deviations from the task spec

  • Dependency: Used shiguredo_video_toolbox (an external crate) instead of hand-rolling VTCompressionSession/VTDecompressionSession FFI. This dramatically reduced implementation risk and size. Disclosed under Risks.
  • Rust MSRV bump: shiguredo_video_toolbox requires Rust 1.88. The workspace MSRV is currently 1.85. The crate is only compiled on macOS targets, so non-macOS builds are unaffected. If bumping the workspace MSRV is unacceptable, an alternative is to vendor or fork the crate.
  • Pixel format: Chose I420 instead of NV12 for the MVP. NV12 support can be added by switching PixelFormat::I420PixelFormat::Nv12 and adjusting plane splitting in encode().
  • CPU measurement: The PRD acceptance criterion includes "CPU < 5 % on M1". This requires a standalone benchmark binary and getrusage instrumentation that is not yet present. The integration test proves functional correctness; a follow-up task should add the benchmark harness.

Verification output

$ cargo test -p wzp-video --test encode_decode_macos
running 2 tests
test encode_decode_roundtrip ... ok
test keyframe_in_first_five_frames ... ok

test result: ok. 2 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 1.45s
$ cargo test -p wzp-video
running 32 tests (30 unit + 2 integration)
...
test result: ok. 32 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 2.38s
$ cargo test --workspace --no-fail-fast
... (all crates pass)
$ cargo clippy -p wzp-video --all-targets -- -D warnings
    Finished dev profile [unoptimized + debuginfo] target(s) in 0.83s
$ cargo fmt --all -- --check
# pass

Test summary

  • Tests added: 4 (2 integration tests + 2 unit tests)
    • encode_decode_roundtrip — end-to-end encode→decode with dimension validation
    • keyframe_in_first_five_frames — forced keyframe appears within 5 frames
    • avcc_to_annexb_roundtrip — AVCC ↔ Annex-B conversion correctness
    • extract_sps_pps_finds_params — parameter set parsing from Annex-B
  • Tests modified: 0
  • Workspace test count: all passing
  • cargo clippy -p wzp-video --all-targets -- -D warnings: clean
  • cargo fmt --all -- --check: pass

Risks / follow-ups

  • Rust 1.88 dependency: shiguredo_video_toolbox raises the effective MSRV on macOS to 1.88. If the team wants to stay on 1.85, we need to vendor the crate or switch to lower-level bindings.
  • Pixel format hard-coded to I420: The encoder and decoder both assume I420. When the camera pipeline lands, we may need to switch to NV12 (the native macOS capture format) to avoid a color-space conversion copy.
  • No CPU benchmark: The 5 % CPU @ 720p30 acceptance criterion is not yet measured. A examples/bench_encode_720p.rs should be added.
  • Decoder recreation on every SPS/PPS change: Currently the decoder is recreated when parameter sets change. VTDecompressionSessionCanAcceptFormatDescription could be used for a lighter update path; the shiguredo_video_toolbox::Decoder::update_format() API already does this, but our wrapper falls back to recreation on failure.
  • Thread safety: VideoToolbox callbacks run on an internal dispatch queue. The shiguredo_video_toolbox crate bridges these via std::sync::mpsc. Our VideoToolboxEncoder/Decoder are Send but not Sync; callers should hold them on a single thread or wrap in a mutex.

Reviewer checklist (filled in by reviewer)

  • Code matches PRD intent
  • Verification output is real (re-run if suspicious)
  • No backward-incompat surprises
  • Tests cover the new behavior
  • Approved