Skip to content

Conversation

@Maegereg
Copy link
Contributor

@Maegereg Maegereg commented Nov 10, 2025

Covers addition, multiplication, and equality (contains, for arbs) for acb, arb, fmpq, and fmpz.

The primary goal is to use these to measure the performance effect of using the stable API (#338), but they could be useful for other things in the future.

I'm particularly looking for feedback on whether this should include additional types or operations.

@oscarbenjamin
Copy link
Collaborator

Is there some package that could be used for benchmarking here?

Ideally what you want is to be able to compare two different versions to see possible statistically significant differences.

@oscarbenjamin
Copy link
Collaborator

The failed CI job is possibly due to the Cython constraint and might be fixed after gh-350.

@Maegereg
Copy link
Contributor Author

Is there some package that could be used for benchmarking here?

I was initially assuming that we'd want to follow the philosophy of the tests and keep things pretty minimal. But I've done a bit of research now, and it looks like pyperf could be useful here - it has good support for running a suite of benchmarks, and comparing multiple runs (which would allow us to get comparisons between different builds of the library). We'd still need either some manual effort to set up the different builds in different environments, or some scripting on top of pyperf to automate that a little (I was planning to do that anyway in the world where we aren't using pyperf).

If that sounds reasonable to you, I can re-write these benchmarks to use pyperf. I plan to leave the scaffolding for handling multiple builds to a future PR, so that right now we can focus on whether these are the right things to measure.

@Maegereg
Copy link
Contributor Author

I went ahead wrote up a version that uses pyperf.

@oscarbenjamin
Copy link
Collaborator

Sorry this dropped off my radar

@oscarbenjamin
Copy link
Collaborator

It has taken me a while to figure out how to actually run the benchmarks in my dev setup but it is

spin run python benchmarks/simple_benchmarks.py --inherit-environ=PYTHONPATH,LD_LIBRARY_PATH

This is because I am using environment variables to make libflint.so available to the runtime linker but pyperf by default drops environment variables when launching the subprocesses that actually run the benchmarks.

When I run the benchmarks I see these warnings in the output for each case:

$ spin run python benchmarks/simple_benchmarks.py --inherit-environ=PYTHONPATH,LD_LIBRARY_PATH
.....................
WARNING: the benchmark result may be unstable
* the standard deviation (47.8 ns) is 12% of the mean (389 ns)

Try to rerun the benchmark with more runs, values and/or loops.
Run 'python -m pyperf system tune' command to reduce the system jitter.
Use pyperf stats, pyperf dump and pyperf hist to analyze results.
Use --quiet option to hide these warnings.

acb addition: Mean +- std dev: 389 ns +- 48 ns
.....................
WARNING: the benchmark result may be unstable
* Not enough samples to get a stable result (95% certainly of less than 1% variation)

Try to rerun the benchmark with more runs, values and/or loops.
Run 'python -m pyperf system tune' command to reduce the system jitter.
Use pyperf stats, pyperf dump and pyperf hist to analyze results.
Use --quiet option to hide these warnings.
...

Is there a way to write the benchmarking code differently so that the results are considered to be more reliable?

They can be suppressed with --quiet so for now I'll use that and we have

$ meson setup build --reconfigure -Dbuildtype=release
$ spin run python benchmarks/simple_benchmarks.py --inherit-environ=PYTHONPATH,LD_LIBRARY_PATH --quiet
acb addition: Mean +- std dev: 173 ns +- 2 ns
acb contains: Mean +- std dev: 566 ns +- 6 ns
acb multiplication: Mean +- std dev: 165 ns +- 21 ns
arb addition: Mean +- std dev: 138 ns +- 2 ns
arb contains: Mean +- std dev: 1.06 us +- 0.01 us
arb multiplication: Mean +- std dev: 133 ns +- 1 ns
fmpq addition: Mean +- std dev: 184 ns +- 28 ns
fmpq equality: Mean +- std dev: 342 ns +- 6 ns
fmpq multiplication: Mean +- std dev: 208 ns +- 4 ns
fmpz addition: Mean +- std dev: 92.7 ns +- 0.9 ns
fmpz equality: Mean +- std dev: 93.1 ns +- 1.2 ns
fmpz multiplication: Mean +- std dev: 97.4 ns +- 6.0 ns

Then this is using the stable ABI v3.12:

$ meson setup build --reconfigure -Dbuildtype=release -Dpython.allow_limited_api=true -Dlimited_api_version=3.12
$ spin run python benchmarks/simple_benchmarks.py --inherit-environ=PYTHONPATH,LD_LIBRARY_PATH --quiet
acb addition: Mean +- std dev: 236 ns +- 42 ns
acb contains: Mean +- std dev: 573 ns +- 17 ns
acb multiplication: Mean +- std dev: 197 ns +- 11 ns
arb addition: Mean +- std dev: 171 ns +- 25 ns
arb contains: Mean +- std dev: 1.05 us +- 0.01 us
arb multiplication: Mean +- std dev: 159 ns +- 14 ns
fmpq addition: Mean +- std dev: 231 ns +- 16 ns
fmpq equality: Mean +- std dev: 464 ns +- 4 ns
fmpq multiplication: Mean +- std dev: 265 ns +- 12 ns
fmpz addition: Mean +- std dev: 130 ns +- 9 ns
fmpz equality: Mean +- std dev: 99.8 ns +- 7.4 ns
fmpz multiplication: Mean +- std dev: 141 ns +- 13 ns

(Side note that I needed to do rm -r build-install/ when switching to the stable ABI because otherwise you end up with both kinds of extension modules and CPython prefers to import the non-stable-ABI ones at import time.)

This is the stable ABI v3.9:

$ meson setup build --reconfigure -Dbuildtype=release -Dpython.allow_limited_api=true -Dlimited_api_version=3.9
$ spin run python benchmarks/simple_benchmarks.py --inherit-environ=PYTHONPATH,LD_LIBRARY_PATH --quiet
acb addition: Mean +- std dev: 195 ns +- 8 ns
acb contains: Mean +- std dev: 545 ns +- 10 ns
acb multiplication: Mean +- std dev: 182 ns +- 16 ns
arb addition: Mean +- std dev: 165 ns +- 10 ns
arb contains: Mean +- std dev: 1.05 us +- 0.03 us
arb multiplication: Mean +- std dev: 152 ns +- 7 ns
fmpq addition: Mean +- std dev: 206 ns +- 10 ns
fmpq equality: Mean +- std dev: 451 ns +- 70 ns
fmpq multiplication: Mean +- std dev: 247 ns +- 12 ns
fmpz addition: Mean +- std dev: 120 ns +- 10 ns
fmpz equality: Mean +- std dev: 94.1 ns +- 3.2 ns
fmpz multiplication: Mean +- std dev: 117 ns +- 7 ns

Those timings are all quite noisy. I haven't done a systematic analysis of statistical significance but it does look like the stable ABI gives an average slowdown for these micro-operations with stable ABI being maybe about 20% slower overall. I don't see a clear difference between using the 3.9 vs 3.12 version of the stable ABI (the Cython docs say that using 3.12 can make somethings faster). Probably with something bigger like an arb_mat it would be less noticeable but for something like fmpz(2)+fmpz(3) the overheads here are noticeable.

Further investigation could be done especially running the timings again and on a different computer because this is an old not powerful computer. Assuming there is just a 20% slowdown I think what that means is that in general we don't want to just use the stable ABI for all of the wheels uploaded to PyPI. We could however do something hybrid like using the stable ABI for less common platforms or for older Python versions.

CC @da-woods who may be interested to know about the Cython+stable-ABI timings.

@da-woods
Copy link

CC @da-woods who may be interested to know about the Cython+stable-ABI timings.

Thanks - the 20% numbers look broadly similar to what we've measured for Cython itself (although we see a little more version dependence).

It doesn't look like any of your benchmarks are especially outliers so I don't think there's anything specific that's performing badly here.

Side note: I'm a little worried that the future Python 3.15 free-threading-compatible Stable ABI will be more expensive and too much of a performance loss for most people. But that's obviously a future problem.

We could however do something hybrid like using the stable ABI for less common platforms

That's what we've done (although mostly as a "eat your own dog food type thing rather than because it's really necessary).

@oscarbenjamin
Copy link
Collaborator

the 20% numbers look broadly similar to what we've measured for Cython itself (although we see a little more version dependence).

The main difference here is that Cython is all written in Python and compiled by itself whereas here we are just wrapping a C library. The Cython code used is just to bridge from Python into C and ultimately just calls a C function e.g. this is fmpz.__add__:

def __add__(s, t):
cdef fmpz_struct tval[1]
cdef int ttype = FMPZ_UNKNOWN
u = NotImplemented
ttype = fmpz_set_any_ref(tval, t)
if ttype != FMPZ_UNKNOWN:
u = fmpz.__new__(fmpz)
fmpz_add((<fmpz>u).val, (<fmpz>s).val, tval)
if ttype == FMPZ_TMP:
fmpz_clear(tval)
return u

So we're just measuring all of the overhead that happens to check the type of the input and allocate memory for the output before we get to calling the fmpz_add C function. It isn't clear to me exactly what the limited API does that would make that particular method 40% slower as indicated in the timings.

In gh-35 I added methods like __radd__ because I thought that Cython 3 needed them but in retrospect I think that this is maybe adding overhead compared to using c_api_binop_methods=True :

https://cython.readthedocs.io/en/latest/src/userguide/special_methods.html#arithmetic-methods

I assume that is at least partly responsible for this being slower than gmpy2 which uses C directly rather than Cython (I think int is faster because it caches small ints):

In [4]: ai, bi = 2, 3

In [5]: ag, bg = gmpy2.mpz(2), gmpy2.mpz(3)

In [6]: af, bf = flint.fmpz(2), flint.fmpz(3)

In [7]: %timeit ai+bi
35.2 ns ± 0.398 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

In [8]: %timeit ag+bg
85.9 ns ± 0.209 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

In [9]: %timeit af+bf
142 ns ± 0.363 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

I'm a little worried that the future Python 3.15 free-threading-compatible Stable ABI will be more expensive and too much of a performance loss for most people.

I think that a lot of Python libraries that wrap a C library tend to wrap larger operations like multiplying large arrays or something but in python-flint's case some of the important objects are really small and then we're really trying to call a tiny C function and the runtime is all dominated by the overhead around that C call. I suspect that for most other libraries they are either calling something more expensive in C or the operation is just not something that is likely to be done millions of time in a loop. In that case I think that the overheads we are talking about here don't matter.

@da-woods
Copy link

My first guess would be that the difference is in the construction/destruction of the extension types (e.g. u = fmpz.__new__(fmpz) ). I'd need to have a proper look to be certain though. I'll try to have look in the next few days and see if there's anything obvious.

@oscarbenjamin
Copy link
Collaborator

I tried rerunning the timings just to be a bit surer. I was a bit more disciplined about not using the computer while the benchmarks were running and the standard deviations are smaller this time.

This is the normal build:

$  meson setup build --reconfigure -Dbuildtype=release
$  spin test
$  spin run python benchmarks/simple_benchmarks.py --inherit-environ=PYTHONPATH,LD_LIBRARY_PATH --quiet
acb addition: Mean +- std dev: 174 ns +- 6 ns
acb contains: Mean +- std dev: 566 ns +- 8 ns
acb multiplication: Mean +- std dev: 161 ns +- 5 ns
arb addition: Mean +- std dev: 141 ns +- 6 ns
arb contains: Mean +- std dev: 1.08 us +- 0.07 us
arb multiplication: Mean +- std dev: 133 ns +- 4 ns
fmpq addition: Mean +- std dev: 174 ns +- 2 ns
fmpq equality: Mean +- std dev: 342 ns +- 5 ns
fmpq multiplication: Mean +- std dev: 207 ns +- 3 ns
fmpz addition: Mean +- std dev: 94.0 ns +- 4.9 ns
fmpz equality: Mean +- std dev: 93.3 ns +- 1.3 ns
fmpz multiplication: Mean +- std dev: 96.6 ns +- 2.3 ns

This is stable ABI v3.12:

$  meson setup build --reconfigure -Dbuildtype=release -Dpython.allow_limited_api=true -Dlimited_api_version=3.12
$  spin test
$  spin run python benchmarks/simple_benchmarks.py --inherit-environ=PYTHONPATH,LD_LIBRARY_PATH --quiet
acb addition: Mean +- std dev: 238 ns +- 37 ns
acb contains: Mean +- std dev: 563 ns +- 3 ns
acb multiplication: Mean +- std dev: 215 ns +- 30 ns
arb addition: Mean +- std dev: 169 ns +- 25 ns
arb contains: Mean +- std dev: 1.05 us +- 0.01 us
arb multiplication: Mean +- std dev: 156 ns +- 3 ns
fmpq addition: Mean +- std dev: 241 ns +- 25 ns
fmpq equality: Mean +- std dev: 467 ns +- 13 ns
fmpq multiplication: Mean +- std dev: 270 ns +- 12 ns
fmpz addition: Mean +- std dev: 125 ns +- 1 ns
fmpz equality: Mean +- std dev: 95.2 ns +- 0.7 ns
fmpz multiplication: Mean +- std dev: 130 ns +- 10 ns

My first guess would be that the difference is in the construction/destruction of the extension types (e.g. u = fmpz.__new__(fmpz)

That seems right to me. With these new timings it is a bit clearer which things are slower and which are not affected. Methods like contains and equality are mostly unchanged. These are methods that return True or False. Methods that create new extension objects (addition and multiplication) are consistently slower with the stable ABI.

The outlier is the "fmpq equality" benchmark that seems to be consistently slower in the stable ABI even though it only returns a bool. The code for that one is here:

def __richcmp__(s, t, int op):
cdef bint res
s = any_as_fmpq(s)
if s is NotImplemented:
return s
t = any_as_fmpq(t)
if t is NotImplemented:
return t
if op == 2 or op == 3:
res = fmpq_equal((<fmpq>s).val, (<fmpq>t).val)
if op == 3:
res = not res
return res

My guess is that any_as_fmpq slows down for some reason:
cdef int fmpq_set_any_ref(fmpq_t x, obj):
cdef int status
fmpq_init(x)
if typecheck(obj, fmpq):
x[0] = (<fmpq>obj).val[0]
return FMPZ_REF
if typecheck(obj, fmpz):
fmpz_set(fmpq_numref(x), (<fmpz>obj).val)
fmpz_one(fmpq_denref(x))
return FMPZ_TMP
status = fmpz_set_any_ref(fmpq_numref(x), obj)
if status != FMPZ_UNKNOWN:
fmpz_one(fmpq_denref(x))
return FMPZ_TMP
fmpq_clear(x)
return FMPZ_UNKNOWN
cdef any_as_fmpq(obj):
cdef fmpq_t x
cdef int status
cdef fmpq q
status = fmpq_set_any_ref(x, obj)
if status == FMPZ_REF:
q = fmpq.__new__(fmpq)
fmpq_set(q.val, x)
return q
elif status == FMPZ_TMP:
q = fmpq.__new__(fmpq)
fmpq_clear(q.val)
q.val[0] = x[0]
return q
else:
return NotImplemented

Note that in the benchmarks both operands are of the same type so we should be taking all the fast paths.

But wait there is a bug! It is calling fmpq.__new__(fmpq) even when the input was already an fmpq (status == FMPZ_REF).

We can fix that:

diff --git a/src/flint/types/fmpq.pyx b/src/flint/types/fmpq.pyx
index ef4fdb5..5cf06a3 100644
--- a/src/flint/types/fmpq.pyx
+++ b/src/flint/types/fmpq.pyx
@@ -19,7 +19,6 @@ cdef int fmpq_set_any_ref(fmpq_t x, obj):
     cdef int status
     fmpq_init(x)
     if typecheck(obj, fmpq):
-        x[0] = (<fmpq>obj).val[0]
         return FMPZ_REF
     if typecheck(obj, fmpz):
         fmpz_set(fmpq_numref(x), (<fmpz>obj).val)
@@ -38,9 +37,7 @@ cdef any_as_fmpq(obj):
     cdef fmpq q
     status = fmpq_set_any_ref(x, obj)
     if status == FMPZ_REF:
-        q = fmpq.__new__(fmpq)
-        fmpq_set(q.val, x)
-        return q
+        return obj
     elif status == FMPZ_TMP:
         q = fmpq.__new__(fmpq)
         fmpq_clear(q.val)

All tests pass and now new stable ABI v3.12 timings

$  meson setup build --reconfigure -Dbuildtype=release -Dpython.allow_limited_api=true -Dlimited_api_version=3.12
$  spin test
$  spin run python benchmarks/simple_benchmarks.py --inherit-environ=PYTHONPATH,LD_LIBRARY_PATH --quiet
acb addition: Mean +- std dev: 253 ns +- 48 ns
acb contains: Mean +- std dev: 565 ns +- 6 ns
acb multiplication: Mean +- std dev: 205 ns +- 21 ns
arb addition: Mean +- std dev: 169 ns +- 24 ns
arb contains: Mean +- std dev: 1.05 us +- 0.01 us
arb multiplication: Mean +- std dev: 155 ns +- 2 ns
fmpq addition: Mean +- std dev: 158 ns +- 18 ns
fmpq equality: Mean +- std dev: 153 ns +- 1 ns
fmpq multiplication: Mean +- std dev: 181 ns +- 1 ns
fmpz addition: Mean +- std dev: 126 ns +- 1 ns
fmpz equality: Mean +- std dev: 97.0 ns +- 4.0 ns
fmpz multiplication: Mean +- std dev: 135 ns +- 8 ns

Now fmpq equality is much faster to the point that it doesn't even make sense to compare it with the previous non-stable-ABI timings. That is possibly not the correct fix but something like it could be used.

In any case all of the observed timings are now consistent with the hypothesis that all the slowdown we observe is just T.__new__(T) being slower and that if that were somehow solved then we might not be seeing any slowdown at all in these benchmarks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants