Skip to content

perf(aec): 2x streaming throughput via ONNX session tuning + tensor reuse#5071

Open
ComputelessComputer wants to merge 1 commit intomainfrom
perf/aec-onnx-throughput
Open

perf(aec): 2x streaming throughput via ONNX session tuning + tensor reuse#5071
ComputelessComputer wants to merge 1 commit intomainfrom
perf/aec-onnx-throughput

Conversation

@ComputelessComputer
Copy link
Copy Markdown
Collaborator

@ComputelessComputer ComputelessComputer commented Apr 16, 2026

Summary

AEC streaming throughput: 457k → 920k samples/sec (+101%, ~57× real-time at 16kHz).

Snapshot tests pass bit-identically (test_aec_hyprnote, test_aec_doubletalk, test_aec_theo), so audio output is unchanged — this is pure ORT scheduling and allocation work.

What changed

Stacked wins in crates/aec/src/onnx/mod.rs:

  1. Unlock ONNX intra-op threading. hypr_onnx::load_model_from_bytes hardcodes with_intra_threads(1), with_inter_threads(1). AEC needs its own Session::builder() path to set thread counts. Asymmetric sizing to match model asymmetry — session_1=2, session_2=4 (1.9MB vs 5.0MB models).
  2. with_independent_thread_pool on both sessions. The two sessions run back-to-back per block; sharing the global ORT pool caused scheduling contention. Independent pools removed it.
  3. with_log_level(LogLevel::Fatal). Suppresses per-run log-severity checks.
  4. Thread-local TensorCache with cached raw pointers. The 4 input tensors (in_mag, lpb_mag, estimated_block, in_lpb) are now allocated once and reused. ProcessingContext stays stack-local so the hot loop keeps its register allocation; only the tensor objects live in the thread-local. Eliminates ~16 C FFI calls per block × 7500 blocks per 30s audio ≈ 120k calls saved per pass.
  5. In-place IFFT into ctx.estimated_block. Removes one 512-f32 copy per block × 7500 blocks.
  6. #[inline(always)] on calculate_fft_magnitude (hot inner loop).

Safety

  • Send impl on TensorCache is scoped to single-threaded use; raw pointers point into ORT-allocated stable memory that outlives the TensorCache (pointer → tensor lifetime coupling is enforced in the struct).
  • All three snapshot tests in crates/aec/src/onnx/tests still pass — if any of these optimizations had perturbed the spectral output, they would fail.

How this was found

Experimental optimization loop via evo — tree-search over 113 experiments across 12 rounds on an M4 Pro. Round 1 showed every hand-written Rust micro-optimization regressed vs baseline; the orchestrator pivoted to the ONNX execution layer in round 2, which is where the real wins were hiding. The chosen experiment (exp_0097) is a composition of the best points on the frontier.

Test plan

cargo test -p aec --no-default-features --features onnx,128

All 3 snapshot tests pass locally on M4 Pro (macOS 15).

Notes

  • No changes to the public AEC API.
  • No changes to hypr_onnx — this keeps hypr_onnx's other callers unaffected.
  • Benchmark scaffolding (benchmark.py, time_aec.rs) intentionally omitted from this PR; it lives in the evo experiment branches.

Note

Medium Risk
Touches AEC’s ONNX runtime integration and introduces unsafe raw-pointer tensor writes plus new session threading/pool settings, which could cause subtle correctness or performance regressions under different runtimes/targets despite unchanged APIs.

Overview
Improves AEC ONNX execution performance by replacing hypr_onnx::load_model_from_bytes with explicit Session::builder() configuration (custom intra/inter threads, Level3 optimizations, independent thread pools, and fatal-only logging) for both models.

Refactors the streaming hot path to stop using ndarray allocations for inputs/streaming state: model states move to ORT Tensor<f32>, and a thread-local TensorCache reuses the four per-block input tensors. ProcessingContext now carries cached raw pointers/lengths and writes magnitudes/blocks directly into ORT tensor buffers before run_model_1/run_model_2, reducing per-block allocation/FFI overhead.

Reviewed by Cursor Bugbot for commit a7958b3. Bugbot is set up for automated code reviews on this repo. Configure here.

…euse

Baseline: 457k samples/sec → 920k samples/sec (+101%, ~57x real-time at 16kHz).

Stacked wins in crates/aec/src/onnx/mod.rs, all validated by the existing
snapshot tests (test_aec_hyprnote/doubletalk/theo — bit-identical audio output).

Key changes:
- Direct Session::builder() over hypr_onnx::load_model_from_bytes to unlock
  intra-op threading (was hardcoded to 1). Asymmetric config: session_1=2,
  session_2=4 intra-threads matches the asymmetric model sizes (1.9MB vs 5.0MB).
- with_independent_thread_pool on both sessions — eliminates scheduling
  contention between the two inferences that run per block.
- with_log_level(LogLevel::Fatal) — suppresses per-run log-severity checks.
- Thread-local TensorCache: 4 Tensor<f32> + cached raw pointers reused across
  _process_internal calls. Eliminates ~16 C FFI calls per block (tensor alloc
  + try_extract_tensor_mut) × 7500 blocks per 30s audio = ~120k calls saved
  per pass. ProcessingContext stays stack-local so the hot loop keeps its
  register allocation.
- In-place IFFT → ctx.estimated_block (removes a 512-f32 copy per block × 7500
  blocks per pass) and #[inline(always)] on calculate_fft_magnitude.

Discovered with evo (https://github.com/evo-hq/evo) — tree-search optimization
over 113 experiments across 12 rounds on an M4 Pro. All experiments preserved
the snapshot-test gate; the improvement is from better ORT scheduling and
eliminated allocations, not algorithmic changes.

Verified: cargo test -p aec --no-default-features --features onnx,128 passes.
@netlify
Copy link
Copy Markdown

netlify Bot commented Apr 16, 2026

Deploy Preview for hyprnote ready!

Name Link
🔨 Latest commit a7958b3
🔍 Latest deploy log https://app.netlify.com/projects/hyprnote/deploys/69e1049242d7230008d33b97
😎 Deploy Preview https://deploy-preview-5071--hyprnote.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@netlify
Copy link
Copy Markdown

netlify Bot commented Apr 16, 2026

Deploy Preview for unsigned-char failed.

Name Link
🔨 Latest commit a7958b3
🔍 Latest deploy log https://app.netlify.com/projects/unsigned-char/deploys/69e1049292242a00084ca62d

@netlify
Copy link
Copy Markdown

netlify Bot commented Apr 16, 2026

Deploy Preview for char-cli-web canceled.

Name Link
🔨 Latest commit a7958b3
🔍 Latest deploy log https://app.netlify.com/projects/char-cli-web/deploys/69e1049246dcaf00097c2587

Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit a7958b3. Configure here.

// Cell<*mut> has zero locking overhead. ProcessingContext remains stack-local.
thread_local! {
static TENSOR_CACHE: Cell<*mut TensorCache> = Cell::new(std::ptr::null_mut());
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thread-local TensorCache leaks memory on thread exit

Low Severity

TENSOR_CACHE is a Cell<*mut TensorCache>. When a thread exits, the Cell is dropped, but raw pointers have no destructor — so the Box<TensorCache> stored via Box::into_raw is never freed, leaking the four ORT tensors per thread. Using Cell<Option<Box<TensorCache>>> would give identical zero-locking performance while properly dropping on thread exit.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit a7958b3. Configure here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant