perf(aec): 2x streaming throughput via ONNX session tuning + tensor reuse by ComputelessComputer · Pull Request #5071 · fastrepl/char

ComputelessComputer · 2026-04-16T15:47:26Z

Summary

AEC streaming throughput: 457k → 920k samples/sec (+101%, ~57× real-time at 16kHz).

Snapshot tests pass bit-identically (test_aec_hyprnote, test_aec_doubletalk, test_aec_theo), so audio output is unchanged — this is pure ORT scheduling and allocation work.

What changed

Stacked wins in crates/aec/src/onnx/mod.rs:

Unlock ONNX intra-op threading. hypr_onnx::load_model_from_bytes hardcodes with_intra_threads(1), with_inter_threads(1). AEC needs its own Session::builder() path to set thread counts. Asymmetric sizing to match model asymmetry — session_1=2, session_2=4 (1.9MB vs 5.0MB models).
with_independent_thread_pool on both sessions. The two sessions run back-to-back per block; sharing the global ORT pool caused scheduling contention. Independent pools removed it.
with_log_level(LogLevel::Fatal). Suppresses per-run log-severity checks.
Thread-local TensorCache with cached raw pointers. The 4 input tensors (in_mag, lpb_mag, estimated_block, in_lpb) are now allocated once and reused. ProcessingContext stays stack-local so the hot loop keeps its register allocation; only the tensor objects live in the thread-local. Eliminates ~16 C FFI calls per block × 7500 blocks per 30s audio ≈ 120k calls saved per pass.
In-place IFFT into ctx.estimated_block. Removes one 512-f32 copy per block × 7500 blocks.
#[inline(always)] on calculate_fft_magnitude (hot inner loop).

Safety

Send impl on TensorCache is scoped to single-threaded use; raw pointers point into ORT-allocated stable memory that outlives the TensorCache (pointer → tensor lifetime coupling is enforced in the struct).
All three snapshot tests in crates/aec/src/onnx/tests still pass — if any of these optimizations had perturbed the spectral output, they would fail.

How this was found

Experimental optimization loop via evo — tree-search over 113 experiments across 12 rounds on an M4 Pro. Round 1 showed every hand-written Rust micro-optimization regressed vs baseline; the orchestrator pivoted to the ONNX execution layer in round 2, which is where the real wins were hiding. The chosen experiment (exp_0097) is a composition of the best points on the frontier.

Test plan

cargo test -p aec --no-default-features --features onnx,128

All 3 snapshot tests pass locally on M4 Pro (macOS 15).

Notes

No changes to the public AEC API.
No changes to hypr_onnx — this keeps hypr_onnx's other callers unaffected.
Benchmark scaffolding (benchmark.py, time_aec.rs) intentionally omitted from this PR; it lives in the evo experiment branches.

Note

Medium Risk
Touches AEC’s ONNX runtime integration and introduces unsafe raw-pointer tensor writes plus new session threading/pool settings, which could cause subtle correctness or performance regressions under different runtimes/targets despite unchanged APIs.

Overview
Improves AEC ONNX execution performance by replacing hypr_onnx::load_model_from_bytes with explicit Session::builder() configuration (custom intra/inter threads, Level3 optimizations, independent thread pools, and fatal-only logging) for both models.

Refactors the streaming hot path to stop using ndarray allocations for inputs/streaming state: model states move to ORT Tensor<f32>, and a thread-local TensorCache reuses the four per-block input tensors. ProcessingContext now carries cached raw pointers/lengths and writes magnitudes/blocks directly into ORT tensor buffers before run_model_1/run_model_2, reducing per-block allocation/FFI overhead.

^{Reviewed by Cursor Bugbot for commit a7958b3. Bugbot is set up for automated code reviews on this repo. Configure here.}

…euse Baseline: 457k samples/sec → 920k samples/sec (+101%, ~57x real-time at 16kHz). Stacked wins in crates/aec/src/onnx/mod.rs, all validated by the existing snapshot tests (test_aec_hyprnote/doubletalk/theo — bit-identical audio output). Key changes: - Direct Session::builder() over hypr_onnx::load_model_from_bytes to unlock intra-op threading (was hardcoded to 1). Asymmetric config: session_1=2, session_2=4 intra-threads matches the asymmetric model sizes (1.9MB vs 5.0MB). - with_independent_thread_pool on both sessions — eliminates scheduling contention between the two inferences that run per block. - with_log_level(LogLevel::Fatal) — suppresses per-run log-severity checks. - Thread-local TensorCache: 4 Tensor<f32> + cached raw pointers reused across _process_internal calls. Eliminates ~16 C FFI calls per block (tensor alloc + try_extract_tensor_mut) × 7500 blocks per 30s audio = ~120k calls saved per pass. ProcessingContext stays stack-local so the hot loop keeps its register allocation. - In-place IFFT → ctx.estimated_block (removes a 512-f32 copy per block × 7500 blocks per pass) and #[inline(always)] on calculate_fft_magnitude. Discovered with evo (https://github.com/evo-hq/evo) — tree-search optimization over 113 experiments across 12 rounds on an M4 Pro. All experiments preserved the snapshot-test gate; the improvement is from better ORT scheduling and eliminated allocations, not algorithmic changes. Verified: cargo test -p aec --no-default-features --features onnx,128 passes.

netlify · 2026-04-16T15:47:32Z

✅ Deploy Preview for hyprnote ready!

Name	Link
🔨 Latest commit	`a7958b3`
🔍 Latest deploy log	https://app.netlify.com/projects/hyprnote/deploys/69e1049242d7230008d33b97
😎 Deploy Preview	https://deploy-preview-5071--hyprnote.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

netlify · 2026-04-16T15:47:32Z

❌ Deploy Preview for unsigned-char failed.

Name	Link
🔨 Latest commit	`a7958b3`
🔍 Latest deploy log	https://app.netlify.com/projects/unsigned-char/deploys/69e1049292242a00084ca62d

netlify · 2026-04-16T15:47:33Z

✅ Deploy Preview for char-cli-web canceled.

Name	Link
🔨 Latest commit	`a7958b3`
🔍 Latest deploy log	https://app.netlify.com/projects/char-cli-web/deploys/69e1049246dcaf00097c2587

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit a7958b3. Configure here.}

cursor · 2026-04-16T15:59:45Z

+// Cell<*mut> has zero locking overhead. ProcessingContext remains stack-local.
+thread_local! {
+    static TENSOR_CACHE: Cell<*mut TensorCache> = Cell::new(std::ptr::null_mut());
+}


Thread-local TensorCache leaks memory on thread exit

Low Severity

TENSOR_CACHE is a Cell<*mut TensorCache>. When a thread exits, the Cell is dropped, but raw pointers have no destructor — so the Box<TensorCache> stored via Box::into_raw is never freed, leaking the four ORT tensors per thread. Using Cell<Option<Box<TensorCache>>> would give identical zero-locking performance while properly dropping on thread exit.

Additional Locations (1)

crates/aec/src/onnx/mod.rs#L473-L474

^{Reviewed by Cursor Bugbot for commit a7958b3. Configure here.}

cursor Bot reviewed Apr 16, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(aec): 2x streaming throughput via ONNX session tuning + tensor reuse#5071

perf(aec): 2x streaming throughput via ONNX session tuning + tensor reuse#5071
ComputelessComputer wants to merge 1 commit intomainfrom
perf/aec-onnx-throughput

ComputelessComputer commented Apr 16, 2026 •

edited by cursor Bot

Loading

Uh oh!

netlify Bot commented Apr 16, 2026 •

edited

Loading

Uh oh!

netlify Bot commented Apr 16, 2026 •

edited

Loading

Uh oh!

netlify Bot commented Apr 16, 2026 •

edited

Loading

Uh oh!

cursor Bot left a comment

Uh oh!

cursor Bot Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ComputelessComputer commented Apr 16, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What changed

Safety

How this was found

Test plan

Notes

Uh oh!

netlify Bot commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for hyprnote ready!

Uh oh!

netlify Bot commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

❌ Deploy Preview for unsigned-char failed.

Uh oh!

netlify Bot commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for char-cli-web canceled.

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot Apr 16, 2026

Choose a reason for hiding this comment

Thread-local TensorCache leaks memory on thread exit

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ComputelessComputer commented Apr 16, 2026 •

edited by cursor Bot

Loading

netlify Bot commented Apr 16, 2026 •

edited

Loading

netlify Bot commented Apr 16, 2026 •

edited

Loading

netlify Bot commented Apr 16, 2026 •

edited

Loading