Skip to content

[FEAT] Improve speed of fsum when OpenMP support is unavailable (macOS) #824

@TylerSagendorf

Description

@TylerSagendorf

What aspect of the package is your request related to?

Functions like fsum cannot benefit from parallelization with OpenMP on macOS because OpenMP is not supported, meaning those functions will not be much faster than their base R counterparts.

Describe the solution you'd like

By combining multiple accumulators with SIMD instructions selected by the compiler, we can greatly reduce the runtime of functions like fsum on macOS when not checking for missing values. The new code is also shown to be slightly faster on a Linux machine where OpenMP is supported.

Below is a bare bones implementation of a fast sum function, along with the benchmarking results. The code is written using R's C API, but I put it in a .cpp file to easily source it for benchmarking purposes, so it is missing the restrict keyword for pointers.

I'm not sure how many functions would benefit from this or how much effort it would be to implement for all of them, but it may be something to consider, and I could certainly help if you wanted to go that route.

#include <Rcpp/Lightest>

// I believe this would correspond to the number of registers that can 
// handle floating point numbers. 12 worked best for the two test machines. 
// Larger values will cause register spilling, which worsens performance.
// Recommend using a multiple of 4 for doubles to make use of SIMD 
// instructions that process 128 contiguous bits.
#define UNROLL_LENGTH 12

// [[Rcpp::export]]
SEXP fast_sum(const SEXP x) {
  const int n = Rf_length(x);
  const double *px = REAL(x);

  // Array of accumulators
  double sum_arr[UNROLL_LENGTH];
  double *psum_arr = &sum_arr[0];
  memset(psum_arr, 0, sizeof(double) * UNROLL_LENGTH);

  int i = 0;
  int end = n - UNROLL_LENGTH + 1;

  for (; i < end; i += UNROLL_LENGTH) {
    for (int k = 0; k < UNROLL_LENGTH; ++k) {
      psum_arr[k] += px[i + k];
    }
  }

  // Aggregate the accumulators
  double sum = 0.0;

  for (int k = 0; k < UNROLL_LENGTH; ++k) {
    sum += psum_arr[k];
  }

  // Handle any remaining values (at most UNROLL_LENGTH - 1 values)
  for (; i < n; ++i) {
    sum += px[i];
  }

  return Rf_ScalarReal(sum);
}

R code for benchmarking:

library(collapse)

set.seed(0L)
x <- rnorm(1e6L)

bench::mark(
  "base::sum" = sum(x),
  "collapse::fsum" = fsum(x, na.rm = FALSE),
  "Fast sum" = fast_sum(x),
  iterations = 1e3L
)

Benchmarking results on an Apple M4 Max chip. The new code is about 7.3x faster than fsum and 10x faster than base::sum.

  expression         min    median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time
  <bch:expr>     <bch:tm> <bch:tm>    <dbl>  <bch:byt>    <dbl> <int> <dbl>   <bch:tm>
1 base::sum       654.3µs    657µs     1492.        0B        0  1000     0    670.1ms
2 collapse::fsum  453.8µs  485.4µs     2040.        0B        0  1000     0    490.3ms
3 Fast sum         63.3µs   65.3µs    14855.        0B        0  1000     0     67.3ms

Benchmarking results on a laptop with an Intel Core i5-6300U CPU running Linux Mint. The new code is about 1.3x faster than fsum, though this is may be because there are only 2 cores, so OpenMP won't help much.

  expression          min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time
  <bch:expr>     <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm>
1 base::sum        1.18ms   1.23ms      775.        0B        0  1000     0      1.29s
2 collapse::fsum 775.79µs 793.61µs     1241.        0B        0  1000     0   805.99ms
3 Fast sum       582.36µs 605.61µs     1623.    2.49KB        0  1000     0   616.23ms

I'm not sure why the memory allocation is 2.49 KB in the last row. It seems to be picking up some garbage.

Additional context
I learned about these techniques from "Computer Systems: A Programmer's Perspective", 3rd edition, by Randal E. Bryant and David R. O'Hallaron. Specifically, chapter 5: Optimizing Program Performance.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions