Tried integrate libfork into blis (blas lib) as a multithread backend to replace openmp, it got benefit when matrix size >= 2048 but slower when matrix <= 512

  ## Use case: replacing OpenMP in BLIS GEMM

  I integrated libfork as a threading backend in [BLIS](https://github.com/flame/blis)
  to evaluate replacing `#pragma omp parallel` with libfork for GEMM parallelism.

  Repo: https://github.com/Frandy/blis_libfork

  ### What I did

  Added a new BLIS threading backend (`BLIS_LIBFORK`) that uses `lf::busy_pool`
  (persistent) + recursive fork/join to launch N independent BLIS worker threads:

  ```cpp
  // Current approach: recursive divide-and-conquer
  auto launch = [](auto self, int lo, int hi, ctx* ctxs) -> lf::task<> {
      if (hi - lo == 1) { ctxs[lo].func(...); co_return; }
      int mid = lo + (hi - lo) / 2;
      co_await lf::fork(self)(lo, mid, ctxs);
      co_await lf::call(self)(mid, hi, ctxs);
      co_await lf::join;
  };

  What worked well

  - Persistent busy_pool matches OpenMP's thread pool model
  - Fork/join overhead is microseconds (~12µs for 6 threads)
  - SGEMM dim=2048: libfork matches OpenMP within ±1%, 6 threads is 6% faster
  - lf::sync_wait + busy_pool integrates cleanly with C API via extern "C"

  What could be improved

  1. No flat bulk spawn API — the recursive fork/join creates log2(N) levels
  of coroutine frames. For N=6, that's 5 fork+call pairs. Each fork dispatches
  to the work-stealing pool. A flat spawn_n(fn, n) or parallel_invoke(fn0, fn1, ...)
  would map more directly to the OpenMP pattern and reduce depth.

  2. lf::for_each usability — I tried using lf::for_each for this but hit
  template issues with coroutine lambdas in the C-to-C++ bridge context.
  The iterator-based overload requires random_access_iterator which is awkward
  when the source data is an integer range. A simple index-based overload
  for_each(0, n, grain, fn) would help.

  3. Barrier interaction with busy_pool — at 6 threads on a 6-core machine,
  BLIS's atomic barrier (spin-wait) combined with busy_pool's work-stealing poll
  creates ~0.3ms extra cache-coherency overhead on small workloads. Not a bug,
  but worth noting for HPC use.

  Performance data (SGEMM, i5-9400F 6-core)

  ┌──────┬────────────┬────────────┬────────┐
  │ dim  │ OpenMP 6t  │ libfork 6t │ lf/omp │
  ├──────┼────────────┼────────────┼────────┤
  │ 512  │ 546 GFLOPS │ 302 GFLOPS │ 55%    │
  ├──────┼────────────┼────────────┼────────┤
  │ 1024 │ 582 GFLOPS │ 517 GFLOPS │ 89%    │
  ├──────┼────────────┼────────────┼────────┤
  │ 2048 │ 541 GFLOPS │ 573 GFLOPS │ 106%   │
  └──────┴────────────┴────────────┴────────┘

  Regression at small sizes is from barrier cache-line bouncing, not fork/join
  overhead. For realistic HPC sizes (dim ≥ 1024), libfork = OpenMP.

  Question

 Is there any suggestion to improve the performance ?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tried integrate libfork into blis (blas lib) as a multithread backend to replace openmp, it got benefit when matrix size >= 2048 but slower when matrix <= 512 #113

Use case: replacing OpenMP in BLIS GEMM

What I did

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Tried integrate libfork into blis (blas lib) as a multithread backend to replace openmp, it got benefit when matrix size >= 2048 but slower when matrix <= 512 #113

Description

Use case: replacing OpenMP in BLIS GEMM

What I did

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions