perf(aec): 2x streaming throughput via ONNX session tuning + tensor reuse#5071
perf(aec): 2x streaming throughput via ONNX session tuning + tensor reuse#5071ComputelessComputer wants to merge 1 commit intomainfrom
Conversation
…euse Baseline: 457k samples/sec → 920k samples/sec (+101%, ~57x real-time at 16kHz). Stacked wins in crates/aec/src/onnx/mod.rs, all validated by the existing snapshot tests (test_aec_hyprnote/doubletalk/theo — bit-identical audio output). Key changes: - Direct Session::builder() over hypr_onnx::load_model_from_bytes to unlock intra-op threading (was hardcoded to 1). Asymmetric config: session_1=2, session_2=4 intra-threads matches the asymmetric model sizes (1.9MB vs 5.0MB). - with_independent_thread_pool on both sessions — eliminates scheduling contention between the two inferences that run per block. - with_log_level(LogLevel::Fatal) — suppresses per-run log-severity checks. - Thread-local TensorCache: 4 Tensor<f32> + cached raw pointers reused across _process_internal calls. Eliminates ~16 C FFI calls per block (tensor alloc + try_extract_tensor_mut) × 7500 blocks per 30s audio = ~120k calls saved per pass. ProcessingContext stays stack-local so the hot loop keeps its register allocation. - In-place IFFT → ctx.estimated_block (removes a 512-f32 copy per block × 7500 blocks per pass) and #[inline(always)] on calculate_fft_magnitude. Discovered with evo (https://github.com/evo-hq/evo) — tree-search optimization over 113 experiments across 12 rounds on an M4 Pro. All experiments preserved the snapshot-test gate; the improvement is from better ORT scheduling and eliminated allocations, not algorithmic changes. Verified: cargo test -p aec --no-default-features --features onnx,128 passes.
✅ Deploy Preview for hyprnote ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
❌ Deploy Preview for unsigned-char failed.
|
✅ Deploy Preview for char-cli-web canceled.
|
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit a7958b3. Configure here.
| // Cell<*mut> has zero locking overhead. ProcessingContext remains stack-local. | ||
| thread_local! { | ||
| static TENSOR_CACHE: Cell<*mut TensorCache> = Cell::new(std::ptr::null_mut()); | ||
| } |
There was a problem hiding this comment.
Thread-local TensorCache leaks memory on thread exit
Low Severity
TENSOR_CACHE is a Cell<*mut TensorCache>. When a thread exits, the Cell is dropped, but raw pointers have no destructor — so the Box<TensorCache> stored via Box::into_raw is never freed, leaking the four ORT tensors per thread. Using Cell<Option<Box<TensorCache>>> would give identical zero-locking performance while properly dropping on thread exit.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit a7958b3. Configure here.


Summary
AEC streaming throughput: 457k → 920k samples/sec (+101%, ~57× real-time at 16kHz).
Snapshot tests pass bit-identically (
test_aec_hyprnote,test_aec_doubletalk,test_aec_theo), so audio output is unchanged — this is pure ORT scheduling and allocation work.What changed
Stacked wins in
crates/aec/src/onnx/mod.rs:hypr_onnx::load_model_from_byteshardcodeswith_intra_threads(1), with_inter_threads(1). AEC needs its ownSession::builder()path to set thread counts. Asymmetric sizing to match model asymmetry —session_1=2, session_2=4(1.9MB vs 5.0MB models).with_independent_thread_poolon both sessions. The two sessions run back-to-back per block; sharing the global ORT pool caused scheduling contention. Independent pools removed it.with_log_level(LogLevel::Fatal). Suppresses per-run log-severity checks.TensorCachewith cached raw pointers. The 4 input tensors (in_mag,lpb_mag,estimated_block,in_lpb) are now allocated once and reused.ProcessingContextstays stack-local so the hot loop keeps its register allocation; only the tensor objects live in the thread-local. Eliminates ~16 C FFI calls per block × 7500 blocks per 30s audio ≈ 120k calls saved per pass.ctx.estimated_block. Removes one 512-f32 copy per block × 7500 blocks.#[inline(always)]oncalculate_fft_magnitude(hot inner loop).Safety
Sendimpl onTensorCacheis scoped to single-threaded use; raw pointers point into ORT-allocated stable memory that outlives theTensorCache(pointer → tensor lifetime coupling is enforced in the struct).crates/aec/src/onnx/testsstill pass — if any of these optimizations had perturbed the spectral output, they would fail.How this was found
Experimental optimization loop via evo — tree-search over 113 experiments across 12 rounds on an M4 Pro. Round 1 showed every hand-written Rust micro-optimization regressed vs baseline; the orchestrator pivoted to the ONNX execution layer in round 2, which is where the real wins were hiding. The chosen experiment (
exp_0097) is a composition of the best points on the frontier.Test plan
cargo test -p aec --no-default-features --features onnx,128All 3 snapshot tests pass locally on M4 Pro (macOS 15).
Notes
AECAPI.hypr_onnx— this keepshypr_onnx's other callers unaffected.benchmark.py,time_aec.rs) intentionally omitted from this PR; it lives in the evo experiment branches.Note
Medium Risk
Touches AEC’s ONNX runtime integration and introduces
unsaferaw-pointer tensor writes plus new session threading/pool settings, which could cause subtle correctness or performance regressions under different runtimes/targets despite unchanged APIs.Overview
Improves AEC ONNX execution performance by replacing
hypr_onnx::load_model_from_byteswith explicitSession::builder()configuration (custom intra/inter threads, Level3 optimizations, independent thread pools, and fatal-only logging) for both models.Refactors the streaming hot path to stop using
ndarrayallocations for inputs/streaming state: model states move to ORTTensor<f32>, and a thread-localTensorCachereuses the four per-block input tensors.ProcessingContextnow carries cached raw pointers/lengths and writes magnitudes/blocks directly into ORT tensor buffers beforerun_model_1/run_model_2, reducing per-block allocation/FFI overhead.Reviewed by Cursor Bugbot for commit a7958b3. Bugbot is set up for automated code reviews on this repo. Configure here.