Skip to content

Tried integrate libfork into blis (blas lib) as a multithread backend to replace openmp, it got benefit when matrix size >= 2048 but slower when matrix <= 512 #113

Description

@Frandy

Use case: replacing OpenMP in BLIS GEMM

I integrated libfork as a threading backend in BLIS
to evaluate replacing #pragma omp parallel with libfork for GEMM parallelism.

Repo: https://github.com/Frandy/blis_libfork

What I did

Added a new BLIS threading backend (BLIS_LIBFORK) that uses lf::busy_pool
(persistent) + recursive fork/join to launch N independent BLIS worker threads:

// Current approach: recursive divide-and-conquer
auto launch = [](auto self, int lo, int hi, ctx* ctxs) -> lf::task<> {
    if (hi - lo == 1) { ctxs[lo].func(...); co_return; }
    int mid = lo + (hi - lo) / 2;
    co_await lf::fork(self)(lo, mid, ctxs);
    co_await lf::call(self)(mid, hi, ctxs);
    co_await lf::join;
};

What worked well

- Persistent busy_pool matches OpenMP's thread pool model
- Fork/join overhead is microseconds (~12µs for 6 threads)
- SGEMM dim=2048: libfork matches OpenMP within ±1%, 6 threads is 6% faster
- lf::sync_wait + busy_pool integrates cleanly with C API via extern "C"

What could be improved

1. No flat bulk spawn API — the recursive fork/join creates log2(N) levels
of coroutine frames. For N=6, that's 5 fork+call pairs. Each fork dispatches
to the work-stealing pool. A flat spawn_n(fn, n) or parallel_invoke(fn0, fn1, ...)
would map more directly to the OpenMP pattern and reduce depth.

2. lf::for_each usability — I tried using lf::for_each for this but hit
template issues with coroutine lambdas in the C-to-C++ bridge context.
The iterator-based overload requires random_access_iterator which is awkward
when the source data is an integer range. A simple index-based overload
for_each(0, n, grain, fn) would help.

3. Barrier interaction with busy_pool — at 6 threads on a 6-core machine,
BLIS's atomic barrier (spin-wait) combined with busy_pool's work-stealing poll
creates ~0.3ms extra cache-coherency overhead on small workloads. Not a bug,
but worth noting for HPC use.

Performance data (SGEMM, i5-9400F 6-core)

┌──────┬────────────┬────────────┬────────┐
│ dim  │ OpenMP 6t  │ libfork 6t │ lf/omp │
├──────┼────────────┼────────────┼────────┤
│ 512  │ 546 GFLOPS │ 302 GFLOPS │ 55%    │
├──────┼────────────┼────────────┼────────┤
│ 1024 │ 582 GFLOPS │ 517 GFLOPS │ 89%    │
├──────┼────────────┼────────────┼────────┤
│ 2048 │ 541 GFLOPS │ 573 GFLOPS │ 106%   │
└──────┴────────────┴────────────┴────────┘

Regression at small sizes is from barrier cache-line bouncing, not fork/join
overhead. For realistic HPC sizes (dim ≥ 1024), libfork = OpenMP.

Question

Is there any suggestion to improve the performance ?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions