Adds OpenMP to qsort, should also improve test speed a bit by sterrettm2 · Pull Request #179 · numpy/x86-simd-sort

sterrettm2 · 2025-01-17T00:35:21Z

This adds OpenMP acceleration to quicksort, in addition to making some changes to the testing code. On my current testing system, this gives up to a 3x speedup, though I think more may be achieved on a stronger system.

Benchmarks:

10m

Comparing simdsort/random_10m (from /home/sterrettm/simd-sort/x86-simd-sort/.bench/main/builddir/benchexe) to simdsort/random_10m (from /home/sterrettm/simd-sort/x86-simd-sort/.bench/qsort-openmp/builddir/benchexe)
Benchmark                                                                       Time             CPU      Time Old      Time New       CPU Old       CPU New
------------------------------------------------------------------------------------------------------------------------------------------------------------ 50815267
[simdsort/random_10m vs. simdsort/random_10m]/uint64_t_mean                  -0.6462         -0.6461     161374745      57099765     161274843      57081772
[simdsort/random_10m vs. simdsort/random_10m]/int64_t_mean                   -0.6722         -0.6720     161381007      52894903     161242920      52883775
[simdsort/random_10m vs. simdsort/random_10m]/uint32_t_mean                  -0.6683         -0.6684      76346854      25321800      76335248      25316127
[simdsort/random_10m vs. simdsort/random_10m]/int32_t_mean                   -0.6461         -0.6462      74568936      26387521      74555377      26374643
[simdsort/random_10m vs. simdsort/random_10m]/uint16_t_mean                  +0.0286         +0.0286     306422934     315196752     306379370     315156800
[simdsort/random_10m vs. simdsort/random_10m]/int16_t_mean                   +0.0159         +0.0159     308471111     313372843     308432095     313323090
[simdsort/random_10m vs. simdsort/random_10m]/float_mean                     -0.6593         -0.6593      75784101      25821957      75777886      25815941
[simdsort/random_10m vs. simdsort/random_10m]/double_mean                    -0.6613         -0.6613     140408876      47560519     140340074      47530458
OVERALL_GEOMEAN                                                              -0.5520         -0.5520             0             0             0             0

1m

Comparing simdsort/random_1m (from /home/sterrettm/simd-sort/x86-simd-sort/.bench/main/builddir/benchexe) to simdsort/random_1m (from /home/sterrettm/simd-sort/x86-simd-sort/.bench/qsort-openmp/builddir/benchexe)
Benchmark                                                                     Time             CPU      Time Old      Time New       CPU Old       CPU New
----------------------------------------------------------------------------------------------------------------------------------------------------------
[simdsort/random_1m vs. simdsort/random_1m]/uint64_t_mean                  -0.5563         -0.5564      13892838       6163743      13893768       6162738
[simdsort/random_1m vs. simdsort/random_1m]/int64_t_mean                   -0.5439         -0.5441      13943919       6359981      13944683       6358016
[simdsort/random_1m vs. simdsort/random_1m]/uint32_t_mean                  -0.4661         -0.4662       6244794       3334286       6244370       3333435
[simdsort/random_1m vs. simdsort/random_1m]/int32_t_mean                   -0.3862         -0.3861       6172364       3788625       6173740       3789904
[simdsort/random_1m vs. simdsort/random_1m]/uint16_t_mean                  -0.0138         -0.0138      31545637      31108834      31548000      31111137
[simdsort/random_1m vs. simdsort/random_1m]/int16_t_mean                   -0.0076         -0.0076      30964472      30728775      30967064      30730877467       3261958
[simdsort/random_1m vs. simdsort/random_1m]/float_mean                     -0.4568         -0.4566       6418410       3486598       6419908       3488443
[simdsort/random_1m vs. simdsort/random_1m]/double_mean                    -0.5486         -0.5488      11469483       5177166      11469799       5174866
OVERALL_GEOMEAN                                                            -0.4088         -0.4089             0             0             0             0

And to show that there does not seem to be a regression with small sizes due to the SIMD logic:

128

Benchmark                                                                       Time             CPU      Time Old      Time New       CPU Old       CPU New
------------------------------------------------------------------------------------------------------------------------------------------------------------
[simdsort/random_128 vs. simdsort/random_128]/uint64_t_mean                  -0.0442         -0.0512           866           828           878           833
[simdsort/random_128 vs. simdsort/random_128]/int64_t_mean                   -0.0006         -0.0028           821           821           828           826
[simdsort/random_128 vs. simdsort/random_128]/uint32_t_mean                  +0.0152         +0.0150           583           591           587           596
[simdsort/random_128 vs. simdsort/random_128]/int32_t_mean                   +0.0210         +0.0150           582           594           588           597
[simdsort/random_128 vs. simdsort/random_128]/uint16_t_mean                  -0.0034         +0.0008          1933          1926          1939          1941
[simdsort/random_128 vs. simdsort/random_128]/int16_t_mean                   +0.0061         +0.0104          1889          1901          1896          1916
[simdsort/random_128 vs. simdsort/random_128]/float_mean                     +0.0012         +0.0022           599           600           605           606
[simdsort/random_128 vs. simdsort/random_128]/double_mean                    +0.0079         +0.0086           707           712           711           717
OVERALL_GEOMEAN                                                              +0.0003         -0.0003             0             0             0             0

In addition, this adds larger tests for quicksort to test the OpenMP logic. Out of concern for the runtime of the test suite, I modified both these new tests and the older kv tests to only use the larger sizes for the main sort, the only one that uses OpenMP and thus needs them. I believe this should result in a noticeable net reduction in the runtime of the test suite.

r-devulap · 2025-03-25T21:47:49Z

[simdsort/random_10m vs. simdsort/random_10m]/uint16_t_mean                  +0.0286         +0.0286     306422934     315196752     306379370     315156800
[simdsort/random_10m vs. simdsort/random_10m]/int16_t_mean                   +0.0159         +0.0159     308471111     313372843     308432095     313323090

I assume this was benchmarked on a SKX?

r-devulap

LGTM. I noticed you use the same thresholds mostly from the key-value sort. Did you get a chance to experiment with it?

sterrettm2 · 2025-03-26T17:33:01Z

Oops, that int16 slowness seems to be a real issue. Let me fix that quickly

sterrettm2 · 2025-03-26T18:58:40Z

Okay, this should be fixed now. Here are some benchmarks from my SPR machine:
EDIT: Added the 16-bit swizzle patch

SPR Benchmarks

Benchmark                                                                 Time             CPU      Time Old      Time New       CPU Old       CPU New
------------------------------------------------------------------------------------------------------------------------------------------------------
[simdsort/random_10m/ vs. simdsort/random_10m/]uint64_t                -0.6771         -0.6771      76630598      24743835      76628741      24743600
[simdsort/random_10m/ vs. simdsort/random_10m/]int64_t                 -0.6819         -0.6819      77024679      24503926      77018065      24497835
[simdsort/random_10m/ vs. simdsort/random_10m/]uint32_t                -0.7413         -0.7414      31002898       8020485      30998289       8016108
[simdsort/random_10m/ vs. simdsort/random_10m/]int32_t                 -0.7382         -0.7382      30928995       8098572      30928261       8097811
[simdsort/random_10m/ vs. simdsort/random_10m/]uint16_t                -0.9503         -0.9503     113969113       5663682     113952771       5663444
[simdsort/random_10m/ vs. simdsort/random_10m/]int16_t                 -0.9505         -0.9505     114573142       5673764     114556807       5673571
[simdsort/random_10m/ vs. simdsort/random_10m/]float                   -0.7369         -0.7370      30276061       7965491      30275684       7963746
[simdsort/random_10m/ vs. simdsort/random_10m/]double                  -0.6361         -0.6360      63558447      23129980      63549471      23129564
[simdsort/random_10m/ vs. simdsort/random_10m/]_Float16                -0.9435         -0.9435      77698793       4391234      77689627       4390512
OVERALL_GEOMEAN                                                        -0.8346         -0.8346             0             0             0             0

Pulls in 2 major changes: (1) Fixes a performance regression on 16-bit dtype sorting (see numpy/x86-simd-sort#190) (2) Adds openmp support for quicksort which speeds up sorting arrays > 100,000 by up to 3x. See: numpy/x86-simd-sort#179

sterrettm2 requested a review from r-devulap January 29, 2025 19:04

r-devulap mentioned this pull request Feb 6, 2025

Add OpenMP based parallelization also to qsort #175

Closed

sterrettm2 force-pushed the qsort-openmp branch from 9500eb4 to 8ddc73a Compare February 24, 2025 20:13

r-devulap previously approved these changes Mar 25, 2025

View reviewed changes

sterrettm2 added 6 commits March 26, 2025 11:35

Add OpenMP acceleration for normal quicksort

e29d303

Speed up tests, in particular when OpenMP is disabled

dbbf2de

Change threshold for OpenMP/tasks from 10k to 100k

d77d0e3

Fix that weird formatting issue

e30d8b6

Fix missing openmp flags

1c5c19b

Use unrolled partition for _Float16

dba705f

sterrettm2 dismissed r-devulap’s stale review via dba705f March 26, 2025 18:57

sterrettm2 force-pushed the qsort-openmp branch from 8ddc73a to dba705f Compare March 26, 2025 18:57

Change 16-bit swizzle from vector to C arrays

e01e79f

r-devulap approved these changes Mar 28, 2025

View reviewed changes

r-devulap merged commit b12fe4d into numpy:main Mar 28, 2025
11 checks passed

r-devulap mentioned this pull request Apr 1, 2025

ENH: Use openmp on x86-simd-sort to speed up np.sort and np.argsort numpy/numpy#28619

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Adds OpenMP to qsort, should also improve test speed a bit#179

Adds OpenMP to qsort, should also improve test speed a bit#179
r-devulap merged 7 commits intonumpy:mainfrom
sterrettm2:qsort-openmp

sterrettm2 commented Jan 17, 2025

Uh oh!

r-devulap commented Mar 25, 2025

Uh oh!

r-devulap left a comment

Uh oh!

sterrettm2 commented Mar 26, 2025

Uh oh!

sterrettm2 commented Mar 26, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

sterrettm2 commented Jan 17, 2025

Uh oh!

r-devulap commented Mar 25, 2025

Uh oh!

r-devulap left a comment

Choose a reason for hiding this comment

Uh oh!

sterrettm2 commented Mar 26, 2025

Uh oh!

sterrettm2 commented Mar 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

sterrettm2 commented Mar 26, 2025 •

edited

Loading