Use case: replacing OpenMP in BLIS GEMM
I integrated libfork as a threading backend in BLIS
to evaluate replacing #pragma omp parallel with libfork for GEMM parallelism.
Repo: https://github.com/Frandy/blis_libfork
What I did
Added a new BLIS threading backend (BLIS_LIBFORK) that uses lf::busy_pool
(persistent) + recursive fork/join to launch N independent BLIS worker threads:
// Current approach: recursive divide-and-conquer
auto launch = [](auto self, int lo, int hi, ctx* ctxs) -> lf::task<> {
if (hi - lo == 1) { ctxs[lo].func(...); co_return; }
int mid = lo + (hi - lo) / 2;
co_await lf::fork(self)(lo, mid, ctxs);
co_await lf::call(self)(mid, hi, ctxs);
co_await lf::join;
};
What worked well
- Persistent busy_pool matches OpenMP's thread pool model
- Fork/join overhead is microseconds (~12µs for 6 threads)
- SGEMM dim=2048: libfork matches OpenMP within ±1%, 6 threads is 6% faster
- lf::sync_wait + busy_pool integrates cleanly with C API via extern "C"
What could be improved
1. No flat bulk spawn API — the recursive fork/join creates log2(N) levels
of coroutine frames. For N=6, that's 5 fork+call pairs. Each fork dispatches
to the work-stealing pool. A flat spawn_n(fn, n) or parallel_invoke(fn0, fn1, ...)
would map more directly to the OpenMP pattern and reduce depth.
2. lf::for_each usability — I tried using lf::for_each for this but hit
template issues with coroutine lambdas in the C-to-C++ bridge context.
The iterator-based overload requires random_access_iterator which is awkward
when the source data is an integer range. A simple index-based overload
for_each(0, n, grain, fn) would help.
3. Barrier interaction with busy_pool — at 6 threads on a 6-core machine,
BLIS's atomic barrier (spin-wait) combined with busy_pool's work-stealing poll
creates ~0.3ms extra cache-coherency overhead on small workloads. Not a bug,
but worth noting for HPC use.
Performance data (SGEMM, i5-9400F 6-core)
┌──────┬────────────┬────────────┬────────┐
│ dim │ OpenMP 6t │ libfork 6t │ lf/omp │
├──────┼────────────┼────────────┼────────┤
│ 512 │ 546 GFLOPS │ 302 GFLOPS │ 55% │
├──────┼────────────┼────────────┼────────┤
│ 1024 │ 582 GFLOPS │ 517 GFLOPS │ 89% │
├──────┼────────────┼────────────┼────────┤
│ 2048 │ 541 GFLOPS │ 573 GFLOPS │ 106% │
└──────┴────────────┴────────────┴────────┘
Regression at small sizes is from barrier cache-line bouncing, not fork/join
overhead. For realistic HPC sizes (dim ≥ 1024), libfork = OpenMP.
Question
Is there any suggestion to improve the performance ?
Use case: replacing OpenMP in BLIS GEMM
I integrated libfork as a threading backend in BLIS
to evaluate replacing
#pragma omp parallelwith libfork for GEMM parallelism.Repo: https://github.com/Frandy/blis_libfork
What I did
Added a new BLIS threading backend (
BLIS_LIBFORK) that useslf::busy_pool(persistent) + recursive fork/join to launch N independent BLIS worker threads: