T4.2.1 — Wire real VideoToolbox VTCompressionSession / VTDecompressionSession (macOS)

Status: Pending Review Agent: Kimi Code CLI Started: 2026-05-11T16:29Z Completed: 2026-05-11T16:29Z Commit: (see git log) PRD: ../PRD-video-v1.md

What I changed

crates/wzp-video/Cargo.toml — Added macOS-target dependency shiguredo_video_toolbox = "2026.1" (gated behind cfg(target_os = "macos")).
crates/wzp-video/src/videotoolbox.rs — Replaced stubs with real VideoToolbox wiring:
- VideoToolboxEncoder now creates a VTCompressionSession via shiguredo_video_toolbox::Encoder (H.264 Baseline, CAVLC, real-time, 30 fps, configurable bitrate).
- Input VideoFrame.data is interpreted as flat I420 (YUV 4:2:0 planar). Y/U/V planes are split and passed to the encoder.
- Output is converted from AVCC (4-byte NAL length prefixes) to Annex-B (4-byte start codes 0x00 0x00 0x00 0x01). SPS/PPS parameter sets emitted by VideoToolbox on keyframes are prepended as separate Annex-B NALs.
- request_keyframe() flag is persisted across encode() calls until a keyframe is actually emitted, because VideoToolbox internally buffers frames and the forced-keyframe option must be passed on every VTCompressionSessionEncodeFrame call until output appears.
- VideoToolboxDecoder lazily creates VTDecompressionSession when the first in-band SPS/PPS arrive. On subsequent parameter-set changes the decoder is recreated.
- Annex-B input is converted to AVCC before feeding the decoder. Decoded I420 output is concatenated into a flat Vec<u8> matching VideoFrame.data layout.
- Added helper functions: avcc_to_annexb, annexb_to_avcc, split_annex_b, extract_sps_pps.
crates/wzp-video/tests/encode_decode_macos.rs — Integration test (#[cfg(target_os = "macos")]):
- encode_decode_roundtrip: 30 synthetic 640×360 I420 gradient frames → encode → decode → assert dimensions match.
- keyframe_in_first_five_frames: requests keyframe on frame 0, asserts at least one IDR slice (NAL type 5) appears within 5 encode calls.
- Tests serialized with a global Mutex because VideoToolbox maintains global encoder-registry state that races under concurrent sessions.

Why these choices

shiguredo_video_toolbox crate: Provides safe, high-level Rust bindings around VideoToolbox (CVPixelBuffer, CMSampleBuffer, CMBlockBuffer, callbacks, format descriptions all handled internally). Writing equivalent code with raw video-toolbox-sys or objc2-video-toolbox would require ~500 lines of unsafe CoreFoundation object management. The crate is Apache-2.0 licensed, maintained by Shiguredo (Japanese WebRTC specialists), and battle-tested in production.
I420 input assumption: The PRD says "assume NV12 or I420 for now — disclose the format choice." I420 is simpler to split into planes (Y, U, V are contiguous in the flat buffer) and is a common capture format. A follow-up should negotiate the actual pixel format with the camera pipeline.
Lazy decoder creation: H.264 SPS/PPS travel in-band with the video stream (typically prefixed to the first IDR frame). The decoder cannot be instantiated until these parameter sets are known, so VideoToolboxDecoder defers session creation until decode() sees SPS + PPS NALs.
Keyframe request persistence: VideoToolbox buffers 3–4 frames before emitting output. If we clear the force-keyframe flag on the first encode() call that returns empty, the request is lost. The flag is now only cleared after EncodedFrame.keyframe == true is observed.

Deviations from the task spec

Dependency: Used shiguredo_video_toolbox (an external crate) instead of hand-rolling VTCompressionSession/VTDecompressionSession FFI. This dramatically reduced implementation risk and size. Disclosed under Risks.
Rust MSRV bump: shiguredo_video_toolbox requires Rust 1.88. The workspace MSRV is currently 1.85. The crate is only compiled on macOS targets, so non-macOS builds are unaffected. If bumping the workspace MSRV is unacceptable, an alternative is to vendor or fork the crate.
Pixel format: Chose I420 instead of NV12 for the MVP. NV12 support can be added by switching PixelFormat::I420 → PixelFormat::Nv12 and adjusting plane splitting in encode().
CPU measurement: The PRD acceptance criterion includes "CPU < 5 % on M1". This requires a standalone benchmark binary and getrusage instrumentation that is not yet present. The integration test proves functional correctness; a follow-up task should add the benchmark harness.

Verification output

$ cargo test -p wzp-video --test encode_decode_macos
running 2 tests
test encode_decode_roundtrip ... ok
test keyframe_in_first_five_frames ... ok

test result: ok. 2 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 1.45s

$ cargo test -p wzp-video
running 32 tests (30 unit + 2 integration)
...
test result: ok. 32 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 2.38s

$ cargo test --workspace --no-fail-fast
... (all crates pass)

$ cargo clippy -p wzp-video --all-targets -- -D warnings
    Finished dev profile [unoptimized + debuginfo] target(s) in 0.83s

$ cargo fmt --all -- --check
# pass

Test summary

Tests added: 4 (2 integration tests + 2 unit tests)
- encode_decode_roundtrip — end-to-end encode→decode with dimension validation
- keyframe_in_first_five_frames — forced keyframe appears within 5 frames
- avcc_to_annexb_roundtrip — AVCC ↔ Annex-B conversion correctness
- extract_sps_pps_finds_params — parameter set parsing from Annex-B
Tests modified: 0
Workspace test count: all passing
cargo clippy -p wzp-video --all-targets -- -D warnings: clean
cargo fmt --all -- --check: pass

Risks / follow-ups

Rust 1.88 dependency: shiguredo_video_toolbox raises the effective MSRV on macOS to 1.88. If the team wants to stay on 1.85, we need to vendor the crate or switch to lower-level bindings.
Pixel format hard-coded to I420: The encoder and decoder both assume I420. When the camera pipeline lands, we may need to switch to NV12 (the native macOS capture format) to avoid a color-space conversion copy.
No CPU benchmark: The 5 % CPU @ 720p30 acceptance criterion is not yet measured. A examples/bench_encode_720p.rs should be added.
Decoder recreation on every SPS/PPS change: Currently the decoder is recreated when parameter sets change. VTDecompressionSessionCanAcceptFormatDescription could be used for a lighter update path; the shiguredo_video_toolbox::Decoder::update_format() API already does this, but our wrapper falls back to recreation on failure.
Thread safety: VideoToolbox callbacks run on an internal dispatch queue. The shiguredo_video_toolbox crate bridges these via std::sync::mpsc. Our VideoToolboxEncoder/Decoder are Send but not Sync; callers should hold them on a single thread or wrap in a mutex.

Reviewer checklist (filled in by reviewer)

Code matches PRD intent
Verification output is real (re-run if suspicious)
No backward-incompat surprises
Tests cover the new behavior
Approved

7.1 KiB Raw Blame History Unescape Escape

T4.2.1 — Wire real VideoToolbox VTCompressionSession / VTDecompressionSession (macOS)

What I changed

Why these choices

Deviations from the task spec

Verification output

Test summary

Risks / follow-ups

Reviewer checklist (filled in by reviewer)

7.1 KiB

Raw Blame History