Skip to content

[next] [5] Rework stream architecture with lifecycle state machine and latency optimizations#593

Draft
joaoantoniocardoso wants to merge 16 commits intomavlink:masterfrom
joaoantoniocardoso:pr/stream-architecture-overhaul
Draft

[next] [5] Rework stream architecture with lifecycle state machine and latency optimizations#593
joaoantoniocardoso wants to merge 16 commits intomavlink:masterfrom
joaoantoniocardoso:pr/stream-architecture-overhaul

Conversation

@joaoantoniocardoso
Copy link
Copy Markdown
Collaborator

@joaoantoniocardoso joaoantoniocardoso commented Mar 26, 2026

CI note: The test-webrtc-thread-leak check fails because the MavlinkCamera persistence (new in this PR) creates a long-lived heartbeat thread that master's thread leak detection interprets as a leak. This is resolved by #594 which updates the detection to use snapshot-based filtering that correctly excludes infrastructure threads.

Note: #594 (integration tests) depend on this PR.

Summary

Major rework of the stream pipeline architecture, introducing a lifecycle state machine for lazy pipeline management, WebRTC latency optimizations, and supporting infrastructure.

Commits (review progression)

  1. cargo: Add gstreamer-rtp dependency
  2. helper/threads: Add thread priority helpers for Linux real-time scheduling
  3. cli: Add --rtsp-port, --enable-dot, --disable-onvif, --enable-realtime-threads flags
  4. server: Remove redundant middleware, gate /dot endpoint
  5. settings: Adapt for new stream configuration fields
  6. custom: Update BlueROV and test configurations
  7. pipeline: Rework architecture — RTSP through video_tee, shm→appsink, depay split, rtspsrc tuning
  8. RTSP server+sink: Appsrc bridge, persistent state, FKU on connect
  9. lifecycle: State machine (Idle/Waking/Running/Draining), lazy idle with valve flow control, consumer tracking, MavlinkCamera persistence
  10. sink interface: Adapt to lifecycle, harden appsink across sinks
  11. webrtc_sink: FEC/RED excision, queue bypass, clocksync disable, codec-preferences, playout-delay RTP extension, double-free fix
  12. gst/utils: Element excision rework, jitterbuffer bypass, probe reliability
  13. webrtc signalling+TURN: Orphan session cleanup, faster disconnect detection, scheduling priority
  14. main: Thread priority initialization, explicit Tokio runtime builder

Test plan

  • Verify RTSP streaming works end-to-end
  • Verify WebRTC session establishment and teardown
  • Verify lazy idle: pipeline suspends after 5s with no consumers, resumes on demand
  • Verify MavlinkCamera heartbeats persist across idle/wake cycles
  • Verify no RTSP freeze after WebRTC client disconnection
  • Verify thumbnail capture works during active streaming
  • Measure WebRTC latency improvement vs master

@joaoantoniocardoso joaoantoniocardoso force-pushed the pr/stream-architecture-overhaul branch 2 times, most recently from ea151a8 to fa8166b Compare March 26, 2026 17:25
@joaoantoniocardoso joaoantoniocardoso changed the title [next] Rework stream architecture with lifecycle state machine and latency optimizations [next] [5] Rework stream architecture with lifecycle state machine and latency optimizations Mar 26, 2026
@joaoantoniocardoso joaoantoniocardoso force-pushed the pr/stream-architecture-overhaul branch from fa8166b to 1e5a93d Compare March 27, 2026 02:32
Add gstreamer-rtp for RTP header extension manipulation (playout-delay
negotiation in WebRTC send path). Update .gitignore for new build
artifacts.
Add lower_thread_priority() (nice 10) and lower_to_background_priority()
(SCHED_OTHER, nice 19) to allow non-GStreamer threads and auxiliary
pipelines to yield CPU to realtime video stream threads.
Add --rtsp-port for configurable RTSP server port (enables parallel
test instances), --enable-dot to gate pipeline graph endpoint,
--disable-onvif to skip ONVIF discovery, and --enable-realtime-threads
to opt-in to SCHED_RR scheduling. Fix Args::parse under nextest
per-process execution by using parse_from with empty argv in cfg(test).
Remove the redundant debug wrap_fn trace and actix_web Logger middleware
(already covered by TracingLogger). Gate the /dot WebSocket pipeline
graph endpoint behind --enable-dot CLI flag so it returns 404 in
production.
Update settings manager to handle the extended stream configuration
fields (disable_lazy) used by the lifecycle state machine.
Update BlueROV defaults to use the runtime RTSP port from CLI args.
Set disable_lazy: true on test streams so the pipeline stays
continuously running during thread leak detection tests.
Route RTSP sink through video_tee eliminating redundant depay/repay.
Replace shmsink/shmsrc with appsink/appsrc for direct buffer passing.
Split depay into separate instances so RTP path gets NAL-level
forwarding while video path keeps AU alignment. Set rtspsrc
buffer-mode=none to bypass jitterbuffer, udp-buffer-size=2.5MB to
prevent socket drops, and disable retransmission. Add source-info and
perfect-rtptime to RTP depay/pay elements. Add background priority
mode for auxiliary pipeline runners. Add redirect pipeline for testing.
Update RTSP server to detect stream format from video caps structure
names instead of RTP encoding-name fields. Configure appsrc caps and
PTS offset reset on each new connection via media-configure callback.
Add persistent RTSP sink state with FKU-on-connect and a swappable
RtspFlowHandle valve so connected clients survive pipeline recreations.
Add StreamStatusState enum (Running/Idle/Stopped) and lifecycle module
with states: Idle, Waking, Running, Draining. Rework Stream to use
lifecycle state machine with per-stream consumer tracking and idle
watcher that suspends pipelines after a 5s grace period with zero
consumers. RTSP flow managed through valve element that drops buffers
when no clients are connected. Wake idle pipelines before WebRTC
session setup and thumbnail capture. Differentiate idle vs active
wakeup to avoid 30s timeouts on live rtspsrc pipelines. Persist
MavlinkCamera across idle/wake cycles. Add thumbnail cooldown to
keep pipeline alive after capture. Fix consumer leak in
remove_session. Use spawn_blocking for pipeline teardown.
…ppsink

Adapt SinkInterface for lifecycle state machine with improved unlink.
Set leaky-type=downstream and silent=true on appsinks so they always
keep the freshest buffer. Add async=false and enable-last-sample=false
on ZenohSink appsink to match conventions.
…down

Excise unused FEC/RED elements from send path after peer connection.
Bypass AIMD queue (made default, AIMD removed). Disable clocksync
pacing and sync on all webrtcbin internal sinks. Reduce RTP storage
to 100ms. Use truthful H.264 level in SDP. Set codec-preferences for
immediate SDP offer. Negotiate playout-delay RTP header extension
(min=0/max=0) to eliminate browser jitter-buffer smoothing.

Fix webrtcbin double-free by releasing request pad in Drop impl.
Fix spurious EOS from sink removal by making eos() a no-op.
Handle webrtcbin elements separately in unlink (EOS while linked,
then set to Null directly).
Rework element excision for jitterbuffer bypass with parent state
guards. Add try_set_property helper for safe property setting on
older GStreamer versions. Add dump_bin_elements() for recursive
pipeline element enumeration (gated behind --enable-dot). Improve
probe pipeline reliability: remove close-socket=false from udpsrc,
add async=false to fakesink, reduce state transition timeouts.
Track active sessions per WebSocket connection to clean up orphans on
disconnect. Replace tokio::join with tokio::select so WebSocket
receiver exit immediately triggers cleanup. Reduce ping interval
from 30s to 5s for faster network-loss detection. Enable ICE
keepalive-conncheck for STUN peer loss detection. Forward EndSession
to client via WebSocket before server-side cleanup. Filter stopped
streams from producer list. Lower scheduling priority on TURN server
runtime threads.
Replace the #[tokio::main] attribute macro with an explicit runtime
builder so on_thread_start can set nice=10 on all Tokio worker threads.
Set nice=10 on the main thread via setpriority(). This ensures
GStreamer pipeline threads are always preferred by the OS scheduler,
reducing latency spikes on resource-constrained systems.
Tests don't need ONVIF camera discovery and it adds unnecessary
network overhead. Pass --disable-onvif from the test harness.
Request an upstream ForceKeyUnit event before waiting for a snapshot
frame. The identity filter drops all non-keyframes (DELTA_UNIT), so
without this the appsink must wait for the encoder's next natural
keyframe which can take 8+ seconds with default key-int-max settings.

Also increase the timeout from 2s to 15s as a safety net for slow
pipelines.
@joaoantoniocardoso joaoantoniocardoso force-pushed the pr/stream-architecture-overhaul branch from 1e5a93d to 487a6e4 Compare April 1, 2026 04:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant