[FEAT] Improve speed of `fsum` when OpenMP support is unavailable (macOS)

**What aspect of the package is your request related to?**

Functions like `fsum` cannot benefit from parallelization with OpenMP on macOS because OpenMP is not supported, meaning those functions will not be much faster than their `base` R counterparts. 

**Describe the solution you'd like**

By combining multiple accumulators with SIMD instructions selected by the compiler, we can greatly reduce the runtime of functions like `fsum` on macOS when not checking for missing values. The new code is also shown to be slightly faster on a Linux machine where OpenMP is supported. 

Below is a bare bones implementation of a fast sum function, along with the benchmarking results. The code is written using R's C API, but I put it in a .cpp file to easily source it for benchmarking purposes, so it is missing the `restrict` keyword for pointers. 

I'm not sure how many functions would benefit from this or how much effort it would be to implement for all of them, but it may be something to consider, and I could certainly help if you wanted to go that route.

```Cpp
#include <Rcpp/Lightest>

// I believe this would correspond to the number of registers that can 
// handle floating point numbers. 12 worked best for the two test machines. 
// Larger values will cause register spilling, which worsens performance.
// Recommend using a multiple of 4 for doubles to make use of SIMD 
// instructions that process 128 contiguous bits.
#define UNROLL_LENGTH 12

// [[Rcpp::export]]
SEXP fast_sum(const SEXP x) {
  const int n = Rf_length(x);
  const double *px = REAL(x);

  // Array of accumulators
  double sum_arr[UNROLL_LENGTH];
  double *psum_arr = &sum_arr[0];
  memset(psum_arr, 0, sizeof(double) * UNROLL_LENGTH);

  int i = 0;
  int end = n - UNROLL_LENGTH + 1;

  for (; i < end; i += UNROLL_LENGTH) {
    for (int k = 0; k < UNROLL_LENGTH; ++k) {
      psum_arr[k] += px[i + k];
    }
  }

  // Aggregate the accumulators
  double sum = 0.0;

  for (int k = 0; k < UNROLL_LENGTH; ++k) {
    sum += psum_arr[k];
  }

  // Handle any remaining values (at most UNROLL_LENGTH - 1 values)
  for (; i < n; ++i) {
    sum += px[i];
  }

  return Rf_ScalarReal(sum);
}
```

R code for benchmarking:

```R
library(collapse)

set.seed(0L)
x <- rnorm(1e6L)

bench::mark(
  "base::sum" = sum(x),
  "collapse::fsum" = fsum(x, na.rm = FALSE),
  "Fast sum" = fast_sum(x),
  iterations = 1e3L
)
```

Benchmarking results on an Apple M4 Max chip. The new code is about 7.3x faster than `fsum` and 10x faster than `base::sum`.

```
  expression         min    median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time
  <bch:expr>     <bch:tm> <bch:tm>    <dbl>  <bch:byt>    <dbl> <int> <dbl>   <bch:tm>
1 base::sum       654.3µs    657µs     1492.        0B        0  1000     0    670.1ms
2 collapse::fsum  453.8µs  485.4µs     2040.        0B        0  1000     0    490.3ms
3 Fast sum         63.3µs   65.3µs    14855.        0B        0  1000     0     67.3ms
```

Benchmarking results on a laptop with an Intel Core i5-6300U CPU running Linux Mint. The new code is about 1.3x faster than `fsum`, though this is may be because there are only 2 cores, so OpenMP won't help much.

```
  expression          min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time
  <bch:expr>     <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm>
1 base::sum        1.18ms   1.23ms      775.        0B        0  1000     0      1.29s
2 collapse::fsum 775.79µs 793.61µs     1241.        0B        0  1000     0   805.99ms
3 Fast sum       582.36µs 605.61µs     1623.    2.49KB        0  1000     0   616.23ms
```

I'm not sure why the memory allocation is 2.49 KB in the last row. It seems to be picking up some garbage.

**Additional context**
I learned about these techniques from "Computer Systems: A Programmer's Perspective", 3rd edition, by Randal E. Bryant and David R. O'Hallaron. Specifically, chapter 5: Optimizing Program Performance.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEAT] Improve speed of `fsum` when OpenMP support is unavailable (macOS) #824

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[FEAT] Improve speed of fsum when OpenMP support is unavailable (macOS) #824

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

[FEAT] Improve speed of `fsum` when OpenMP support is unavailable (macOS) #824