What aspect of the package is your request related to?
Functions like fsum cannot benefit from parallelization with OpenMP on macOS because OpenMP is not supported, meaning those functions will not be much faster than their base R counterparts.
Describe the solution you'd like
By combining multiple accumulators with SIMD instructions selected by the compiler, we can greatly reduce the runtime of functions like fsum on macOS when not checking for missing values. The new code is also shown to be slightly faster on a Linux machine where OpenMP is supported.
Below is a bare bones implementation of a fast sum function, along with the benchmarking results. The code is written using R's C API, but I put it in a .cpp file to easily source it for benchmarking purposes, so it is missing the restrict keyword for pointers.
I'm not sure how many functions would benefit from this or how much effort it would be to implement for all of them, but it may be something to consider, and I could certainly help if you wanted to go that route.
#include <Rcpp/Lightest>
// I believe this would correspond to the number of registers that can
// handle floating point numbers. 12 worked best for the two test machines.
// Larger values will cause register spilling, which worsens performance.
// Recommend using a multiple of 4 for doubles to make use of SIMD
// instructions that process 128 contiguous bits.
#define UNROLL_LENGTH 12
// [[Rcpp::export]]
SEXP fast_sum(const SEXP x) {
const int n = Rf_length(x);
const double *px = REAL(x);
// Array of accumulators
double sum_arr[UNROLL_LENGTH];
double *psum_arr = &sum_arr[0];
memset(psum_arr, 0, sizeof(double) * UNROLL_LENGTH);
int i = 0;
int end = n - UNROLL_LENGTH + 1;
for (; i < end; i += UNROLL_LENGTH) {
for (int k = 0; k < UNROLL_LENGTH; ++k) {
psum_arr[k] += px[i + k];
}
}
// Aggregate the accumulators
double sum = 0.0;
for (int k = 0; k < UNROLL_LENGTH; ++k) {
sum += psum_arr[k];
}
// Handle any remaining values (at most UNROLL_LENGTH - 1 values)
for (; i < n; ++i) {
sum += px[i];
}
return Rf_ScalarReal(sum);
}
R code for benchmarking:
library(collapse)
set.seed(0L)
x <- rnorm(1e6L)
bench::mark(
"base::sum" = sum(x),
"collapse::fsum" = fsum(x, na.rm = FALSE),
"Fast sum" = fast_sum(x),
iterations = 1e3L
)
Benchmarking results on an Apple M4 Max chip. The new code is about 7.3x faster than fsum and 10x faster than base::sum.
expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm>
1 base::sum 654.3µs 657µs 1492. 0B 0 1000 0 670.1ms
2 collapse::fsum 453.8µs 485.4µs 2040. 0B 0 1000 0 490.3ms
3 Fast sum 63.3µs 65.3µs 14855. 0B 0 1000 0 67.3ms
Benchmarking results on a laptop with an Intel Core i5-6300U CPU running Linux Mint. The new code is about 1.3x faster than fsum, though this is may be because there are only 2 cores, so OpenMP won't help much.
expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm>
1 base::sum 1.18ms 1.23ms 775. 0B 0 1000 0 1.29s
2 collapse::fsum 775.79µs 793.61µs 1241. 0B 0 1000 0 805.99ms
3 Fast sum 582.36µs 605.61µs 1623. 2.49KB 0 1000 0 616.23ms
I'm not sure why the memory allocation is 2.49 KB in the last row. It seems to be picking up some garbage.
Additional context
I learned about these techniques from "Computer Systems: A Programmer's Perspective", 3rd edition, by Randal E. Bryant and David R. O'Hallaron. Specifically, chapter 5: Optimizing Program Performance.
What aspect of the package is your request related to?
Functions like
fsumcannot benefit from parallelization with OpenMP on macOS because OpenMP is not supported, meaning those functions will not be much faster than theirbaseR counterparts.Describe the solution you'd like
By combining multiple accumulators with SIMD instructions selected by the compiler, we can greatly reduce the runtime of functions like
fsumon macOS when not checking for missing values. The new code is also shown to be slightly faster on a Linux machine where OpenMP is supported.Below is a bare bones implementation of a fast sum function, along with the benchmarking results. The code is written using R's C API, but I put it in a .cpp file to easily source it for benchmarking purposes, so it is missing the
restrictkeyword for pointers.I'm not sure how many functions would benefit from this or how much effort it would be to implement for all of them, but it may be something to consider, and I could certainly help if you wanted to go that route.
R code for benchmarking:
Benchmarking results on an Apple M4 Max chip. The new code is about 7.3x faster than
fsumand 10x faster thanbase::sum.Benchmarking results on a laptop with an Intel Core i5-6300U CPU running Linux Mint. The new code is about 1.3x faster than
fsum, though this is may be because there are only 2 cores, so OpenMP won't help much.I'm not sure why the memory allocation is 2.49 KB in the last row. It seems to be picking up some garbage.
Additional context
I learned about these techniques from "Computer Systems: A Programmer's Perspective", 3rd edition, by Randal E. Bryant and David R. O'Hallaron. Specifically, chapter 5: Optimizing Program Performance.