Skip to content

Optimize horizontal_sample performance#2894

Merged
197g merged 16 commits intoimage-rs:mainfrom
Isotr0py:opt-resize
Apr 11, 2026
Merged

Optimize horizontal_sample performance#2894
197g merged 16 commits intoimage-rs:mainfrom
Isotr0py:opt-resize

Conversation

@Isotr0py
Copy link
Copy Markdown
Contributor

Introduction

This PR aims to optimize horizontal_sample's performance:

  1. Precompute column-wise weights col_ws, so that we can avoid ws recomputation
  2. With modification in 1, we can write pixel row-by-row sequentially in continuous memory instead of col-by-col.

Tests

cargo test imageops
Test logs
running 44 tests
test imageops::affine::test::test_flip_vertical ... ok
test imageops::affine::test::test_rotate180 ... ok
test imageops::affine::test::test_flip_vertical_in_place ... ok
test imageops::affine::test::test_flip_horizontal_in_place ... ok
test imageops::affine::test::test_flip_horizontal ... ok
test imageops::affine::test::test_rotate180_in_place ... ok
test imageops::affine::test::test_rotate270 ... ok
test imageops::affine::test::test_rotate90 ... ok
test imageops::colorops::test::test_brighten ... ok
test imageops::colorops::test::test_brighten_place ... ok
test imageops::colorops::test::test_grayscale ... ok
test imageops::colorops::test::test_invert ... ok
test imageops::colorops::test::test_dither ... ok
test imageops::resize::tests::smoke_test ... ok
test imageops::sample::tests::issue_2340_refl ... ok
test imageops::sample::tests::test_sample_bilinear_correctness ... ok
test imageops::sample::tests::test_sample_nearest_correctness ... ok
test imageops::sample::tests::issue_2340 ... ok
test imageops::tests::test_fast_blur_1_channels ... ok
test imageops::tests::test_fast_blur_empty ... ok
test imageops::tests::test_fast_blur_2_channels ... ok
test imageops::tests::test_image_coordinate_overflow ... ok
test imageops::tests::test_fast_large_sigma ... ok
test imageops::tests::test_image_in_image ... ok
test imageops::tests::test_image_outside_image_no_wrap_around ... ok
test imageops::tests::test_image_in_image_outside_of_bounds ... ok
test imageops::tests::test_blur_zero ... ok
test imageops::tests::test_fast_blur_3_channels ... ok
test imageops::sample::tests::resize_transparent_image ... ok
test imageops::tests::test_image_overlay_transparent ... ok
test imageops::tests::test_image_horizontal_gradient_limits ... ok
test imageops::tests::test_image_vertical_gradient_limits ... ok
test imageops::tests::test_overlay_bounds_ext ... ok
test imageops::tests::test_image_thumbnail ... ok
test imageops::tests::test_fast_blur_zero ... ok
test imageops::tests::test_fast_blur_negative ... ok
test images::dynimage::test::color_space_independent_imageops ... ok
test imageops::sample::tests::test_issue_186 ... ok
test imageops::sample::tests::test_sample_bilinear ... ok
test imageops::sample::tests::test_sample_nearest ... ok
test imageops::sample::tests::test_resize_same_size ... ok
test imageops::sample::tests::bug_1600 ... ok
test imageops::fast_blur::tests::test_box_blur ... ok
test imageops::tests::fast_blur_approximates_gaussian_blur_well ... ok

test result: ok. 44 passed; 0 failed; 0 ignored; 0 measured; 229 filtered out; finished in 0.49s

Benchmark

Compare against main:

cargo bench resize
Bench logs
resize 400x300 Nearest  time:   [3.7787 ms 3.7950 ms 3.8111 ms]
                        change: [-17.335% -16.050% -14.787%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 11 outliers among 100 measurements (11.00%)
  10 (10.00%) low mild
  1 (1.00%) high severe

resize 400x300 Triangle time:   [11.153 ms 11.194 ms 11.240 ms]
                        change: [-8.5832% -7.9978% -7.4291%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  1 (1.00%) high mild
  4 (4.00%) high severe

resize 400x300 CatmullRom
                        time:   [18.376 ms 18.420 ms 18.472 ms]
                        change: [-4.5051% -4.2300% -3.9487%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  2 (2.00%) high mild
  2 (2.00%) high severe

resize 400x300 Gaussian time:   [25.743 ms 25.798 ms 25.857 ms]
                        change: [-5.6979% -5.2078% -4.7450%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  4 (4.00%) high mild

resize 400x300 Lanczos3 time:   [25.751 ms 25.808 ms 25.871 ms]
                        change: [-4.2992% -3.9803% -3.6595%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) high mild
  1 (1.00%) high severe

Benchmarking large/resize 2000x2000 Nearest: Warming up for 1.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 6.1s or enable flat sampling.
large/resize 2000x2000 Nearest
                        time:   [108.46 ms 108.92 ms 109.49 ms]
                        change: [-22.370% -21.867% -21.350%] (p = 0.00 < 0.05)
                        Performance has improved.
Benchmarking large/resize 2000x2000 Triangle: Warming up for 1.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 7.1s or enable flat sampling.
large/resize 2000x2000 Triangle
                        time:   [128.29 ms 128.88 ms 129.70 ms]
                        change: [-21.516% -20.914% -20.371%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high mild
Benchmarking large/resize 2000x2000 CatmullRom: Warming up for 1.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 8.3s or enable flat sampling.
large/resize 2000x2000 CatmullRom
                        time:   [148.07 ms 148.44 ms 149.07 ms]
                        change: [-33.440% -33.081% -32.730%] (p = 0.00 < 0.05)
                        Performance has improved.
Benchmarking large/resize 2000x2000 Gaussian: Warming up for 1.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 9.2s or enable flat sampling.
large/resize 2000x2000 Gaussian
                        time:   [165.25 ms 165.53 ms 165.78 ms]
                        change: [-19.554% -19.165% -18.748%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high mild
Benchmarking large/resize 2000x2000 Lanczos3: Warming up for 1.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 9.3s or enable flat sampling.
large/resize 2000x2000 Lanczos3
                        time:   [165.72 ms 166.30 ms 167.03 ms]
                        change: [-19.450% -19.071% -18.703%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 10 measurements (20.00%)
  1 (10.00%) low mild
  1 (10.00%) high mild

Signed-off-by: Isotr0py <2037008807@qq.com>
Signed-off-by: Isotr0py <2037008807@qq.com>
Signed-off-by: Isotr0py <2037008807@qq.com>
Signed-off-by: Isotr0py <2037008807@qq.com>
Signed-off-by: Isotr0py <2037008807@qq.com>
Signed-off-by: Isotr0py <2037008807@qq.com>
Signed-off-by: Isotr0py <2037008807@qq.com>
Comment thread src/imageops/sample.rs Outdated
Comment on lines +273 to +276
let max_ks = (2.0 * src_support).ceil() as usize + 2;
// Preallocated buffer for precomputed weights.
let mut col_ws: Vec<f32> =
vec_try_with_capacity(col_count * max_ks).expect("capacity overflow in horizontal_sample");
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some kernels can be very large. This will definitely blow through those limits. There may be a way to write it as a fast-path but the case where this space-time tradeoff is inapplicable must be considered.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After some thoughts, I think we can limit the maximum memory usage through batching process. When kernel size is very large, we can split the row into multiple batches to process. So that we can make sure the cache size is small enough to fit.

Comment thread src/imageops/sample.rs
Comment thread src/imageops/sample.rs Outdated
Comment thread src/imageops/sample.rs Outdated
Isotr0py added 3 commits April 4, 2026 12:26
Signed-off-by: Isotr0py <2037008807@qq.com>
Signed-off-by: Isotr0py <2037008807@qq.com>
Signed-off-by: Isotr0py <2037008807@qq.com>
@Isotr0py
Copy link
Copy Markdown
Contributor Author

Isotr0py commented Apr 4, 2026

Performance gains after removing put_pixel boundary validation, almost all cases get ~2% performance gains:

resize 400x300 Nearest  time:   [3.6815 ms 3.7044 ms 3.7293 ms]
                        change: [-19.152% -18.104% -17.080%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild

resize 400x300 Triangle time:   [11.324 ms 11.366 ms 11.413 ms]
                        change: [-6.6814% -6.1490% -5.5823%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  3 (3.00%) high mild
  2 (2.00%) high severe

resize 400x300 CatmullRom
                        time:   [18.816 ms 18.886 ms 18.963 ms]
                        change: [-4.9400% -4.4199% -3.9013%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  3 (3.00%) high mild
  3 (3.00%) high severe

resize 400x300 Gaussian time:   [26.186 ms 26.243 ms 26.302 ms]
                        change: [-4.8502% -4.4794% -4.1189%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild

resize 400x300 Lanczos3 time:   [26.202 ms 26.263 ms 26.331 ms]
                        change: [-5.4348% -4.9670% -4.5354%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  2 (2.00%) high mild
  4 (4.00%) high severe

Benchmarking large/resize 2000x2000 Nearest: Warming up for 1.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 6.0s or enable flat sampling.
large/resize 2000x2000 Nearest
                        time:   [105.42 ms 105.93 ms 106.80 ms]
                        change: [-28.052% -27.170% -26.260%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high mild
Benchmarking large/resize 2000x2000 Triangle: Warming up for 1.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 6.9s or enable flat sampling.
large/resize 2000x2000 Triangle
                        time:   [124.92 ms 125.97 ms 127.42 ms]
                        change: [-27.343% -26.269% -25.198%] (p = 0.00 < 0.05)
                        Performance has improved.
Benchmarking large/resize 2000x2000 CatmullRom: Warming up for 1.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 8.1s or enable flat sampling.
large/resize 2000x2000 CatmullRom
                        time:   [145.90 ms 147.11 ms 148.25 ms]
                        change: [-38.001% -37.307% -36.646%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high mild
Benchmarking large/resize 2000x2000 Gaussian: Warming up for 1.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 9.2s or enable flat sampling.
large/resize 2000x2000 Gaussian
                        time:   [163.14 ms 163.50 ms 163.94 ms]
                        change: [-22.724% -22.122% -21.519%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high mild
Benchmarking large/resize 2000x2000 Lanczos3: Warming up for 1.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 9.1s or enable flat sampling.
large/resize 2000x2000 Lanczos3
                        time:   [163.99 ms 164.66 ms 165.50 ms]
                        change: [-23.341% -22.462% -21.539%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 10 measurements (20.00%)
  1 (10.00%) high mild
  1 (10.00%) high severe

Copy link
Copy Markdown
Member

@197g 197g left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The batch approach looks promising.

Comment thread src/imageops/sample.rs Outdated
Comment thread src/imageops/sample.rs Outdated
Isotr0py and others added 3 commits April 5, 2026 22:06
Signed-off-by: Isotr0py <2037008807@qq.com>
Co-authored-by: Aurelia Molzer <5550310+197g@users.noreply.github.com>
Signed-off-by: Isotr0py <2037008807@qq.com>
Comment thread src/imageops/sample.rs Outdated
Isotr0py and others added 3 commits April 6, 2026 11:35
Co-authored-by: Aurelia Molzer <5550310+197g@users.noreply.github.com>
Signed-off-by: Isotr0py <2037008807@qq.com>
@197g 197g merged commit f2a0cbd into image-rs:main Apr 11, 2026
31 checks passed
@Isotr0py Isotr0py deleted the opt-resize branch April 11, 2026 10:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants