stronger criteria to attempt compilation #166

asglover · 2025-11-02T20:07:05Z

added checks for nvcc and hipcc before attempting to compile the torch extensions. I added the lru_cache so that we're not regularly calling out to "which" incase that is an expensive call. Please let me know your thoughts.

vbharadwaj-bk

So this strategy introduces a subtle issue: it is possible for nvcc to not be in PATH, but for CUDA_HOME to be specified, in which case we should be able to compile the extension but shutil.which may fail (canonically, setting CUDA_HOME is acceptable without keeping NVCC on PATH).

Perhaps rather than check each possible failure mode (there could be others we're not thinking of), perhaps opt for the following strategy:

Create two variables in https://github.com/PASSIONLab/OpenEquivariance/blob/main/openequivariance/__init__.py, BUILT_EXTENSION and BUILT_EXTENSION_ERROR, default values True and None, respectively.
Rather than raise an ImportError on this line,

OpenEquivariance/openequivariance/__init__.py

Line 9 in 339d8ea

raise ImportError(f"Unable to load OpenEquivariance extension library:\n{e}")

, place your stub definitions of the major classes from the other file and set the variables above appropriately.
For the import test, check that BUILT_EXTENSION is True and BUILT_EXTENSION_ERROR is not None, otherwise raise BUILT_EXTENSION_ERROR.
This keeps the behavior the same (NVCC not required to be on PATH, but CUDA_HOME can be set), while checking for a range of failure modes.

vbharadwaj-bk · 2025-11-02T20:13:14Z

openequivariance/extlib/__init__.py

+    return torch.version.cuda and _nvcc_present()
+
+
+def _nvcc_present():


Don't need this to be a function if it's only called in one place, ditto with _hpicc_present().

Ah, I was wondering and did a some thinking for CUDA_HOME vs. NVCC... I should have used CUDA home because I think that's what torch checks for. I thought I was being careful. I'll change that.

place your stub definitions of the major classes from the other file and set the variables above appropriately.

So right now, I define the stubs in the extlib init if the compilation doesn't build, but you're saying I should move all those definitions to the top level init, so that they will always get defined even if an exception is thrown somewhere in the compilation process? Am I understanding your intention?

Thanks for the detailed review on this, I appreciate it

Correct, move the stubs to the top level so that any arbitrary error results in the stubs being used, and just define some variables appropriately to log whether the extension was built and any related variables.

This way we handle all failure modes instead of trying to identify every single one manually.

If I just move the stubs up to the top level, then they'll be defined in different places depeneding on if compilation is sucessful or not, which will make them tough to import. So I could bring all those items from the generic module up to the top level name space, but it does feel like those should be hidden implementation details not exposed at the top level namespace.

Maybe I could wrap more of the extlib compilation process in a try-catch block and handle errors still the extlib init file. So it still would provide general coverage for unexpected errors, but would not change where the generic_module members are located and none of the imports elsewhere in the package would change? And by putting everything non-trivial in a try block extlib should ideally always succede at being imported and the block where the members of the generic module are defined with actual or stub implementations should always be executed.

Let me know your thoughts, sorry for making this simple feature an involved design process

I'm fine with keeping the stubs where they are if you remove the existing case-by-case logic (if torch.version_cuda, version_hip, etc.) and wrap the whole extlib compilation process in a try-catch.

You should still define a couple of variables to flag whether the extension was built and any error message - see LINKED_LIBPYTHON_ERROR as a model, and these variables should propagate to the toplevel.

…ric_module. remove constants for _compile_torch_extension() as it's referenced only once. torch_compile_exception -> TORCH_COMPILE_ERROR for consistency.

asglover · 2025-11-07T22:14:16Z

I made the changes you requested and did some other little tidying up. Please let me know what you think.

vbharadwaj-bk

Great, thank you, everything looks good - but on closer inspection I can't see any obvious reason for the LRU cache decorator, this isn't a frequently called function and the cache is not persistent across runs.

Also am assuming you tested this, including fakifying the result for the CI (no need to give me a proof of that, but reminding since this is a pretty core file in the package). We should be able to merge this on the next round.

vbharadwaj-bk · 2025-11-08T04:47:27Z

openequivariance/extlib/__init__.py

+postprocess_kernel = lambda kernel: kernel  # noqa : E731
+
+
+@lru_cache(maxsize=1)


Hmm - do we need these LRU_cache decorators? These functions are only called once (when extlib is loaded) upon library import.

And LRU_cache is not persistent across multiple programs, only within the same program.

That's a fair point, I was calling the function more in an earlier version, but only calling it once now, I'll remove that.

I did a local test where I messed up the source for the module and it failed the test. I haven't done it for the CI again because I was conserving the minutes. But I can run it one more time.

asglover · 2025-11-08T20:32:20Z

Is there some mechanism which caches module builds if nothing changes? I thought I remembered there was a feature like that at one point in time? Does this pytorch extension fully build every python session even if nothing has changed?

vbharadwaj-bk · 2025-11-08T20:43:15Z

Approved, feel free to merge.

There is a caching mechanism, but it is a PyTorch utility, not Python-standard. It operates by taking a hash of the compile parameters, source files, and the extension name and checking if an extension library with the same hash exists in a user-specific cache directory - typically this is .cache/torch_extensions. If so, no recompile triggers.

asglover · 2025-11-08T21:36:02Z

Great point on doing the fakificiation with CI. Because this test now makes test_import work, and I didn't realize that the CI only explicitly tests test/import_test -k test_import. I need to add test_extension_built to the CI to make it test the extension build.

vbharadwaj-bk · 2025-11-09T19:02:36Z

Hmm - this is now a draft, any issue going on with this?

asglover · 2025-11-09T19:26:08Z

Yes, there is an issue, what I expected to pass the CI did not pass the CI. It reported that there was no BUILT_EXTENSION_ERROR but also that BUILT_EXTENSION was false. I think this is only possible if It doesn't attempt to perform the extension build. I'm trying to understand why that might happen. Adding an error message when compilation is not attempted, to confirm this is indeed what's happening. Then I'll go from there.

asglover · 2025-11-09T20:05:46Z

So to confirm the build is not being attempted on the CI. But the built should be attempted so long as either

torch.version.cuda and ("CUDA_HOME" in os.environ)
or
torch.version.hip and ("HIP_HOME" in os.environ)

And it looks like:
we're downloading the cuda version of torch in requirements_cuda_ci.txt
and
we're apt installing nvidia-cuda-toolkit

so I would expect "CUDA_HOME" to be set and torch.version.cuda to be true.

I don't know which is the problem, and I was thinking about adding some tests to pytest, with some flag so that we could force a test to make sure that the CUDA path is actually being taken when running our CUDA CI and that the HIP path is actually being taken when running on HIP CI. maybe that's overdoing it.

I'm surprised by _compile_torch_cuda_extension() not returning true.

vbharadwaj-bk · 2025-11-10T03:25:42Z

Your issue here is that CUDA_HOME may not be specified (but nvcc is probably on PATH); my comment earlier alluded to the fact that the inverse may be true, and that the first version did not catch that fact.

I would revert to 91c34d7 and replace the line "elif _build_cuda_extension() or _build_hip_extension()" with "elif torch_version.cuda or torch_version.hip" or similar. Have confirmed that this works. And we can perhaps revert all the other changes since that commit.

asglover · 2025-11-10T03:35:10Z

Yep, I came to the same understanding. I originally had added the "CUDA_HOME" checks to try to prevent the module from building the extension on machines without the toolkit, but now with the design which catches build errors, then it's no problem. Thank you, sorry for having this drag on

asglover · 2025-11-10T04:36:38Z

I added the torch extension as its own test incase just for futureproofing. ready for final review

stronger criteria to attempt compilation

dbc26df

vbharadwaj-bk requested changes Nov 2, 2025

View reviewed changes

integrating PR feedback

2fb12bb

asglover marked this pull request as draft November 7, 2025 20:56

make TORCH_COMPILE False by default, remove duplicate setting of gene…

351e3b3

…ric_module. remove constants for _compile_torch_extension() as it's referenced only once. torch_compile_exception -> TORCH_COMPILE_ERROR for consistency.

asglover marked this pull request as ready for review November 7, 2025 22:13

asglover added ci-ready Triggers CI checks for a pull request and removed ci-ready Triggers CI checks for a pull request labels Nov 7, 2025

format

4ff819f

asglover added the ci-ready Triggers CI checks for a pull request label Nov 7, 2025

asglover self-assigned this Nov 7, 2025

vbharadwaj-bk requested changes Nov 8, 2025

View reviewed changes

Austin Glover added 2 commits November 8, 2025 12:21

remove lru cache

04c5078

bad commit

4e1a595

asglover added ci-ready Triggers CI checks for a pull request and removed ci-ready Triggers CI checks for a pull request labels Nov 8, 2025

vbharadwaj-bk approved these changes Nov 8, 2025

View reviewed changes

Austin Glover added 2 commits November 8, 2025 13:44

test extension built

0f0bb07

revert intentional mistake

91c34d7

asglover added ci-ready Triggers CI checks for a pull request and removed ci-ready Triggers CI checks for a pull request labels Nov 8, 2025

asglover marked this pull request as draft November 8, 2025 23:13

vbharadwaj-bk self-requested a review November 9, 2025 19:14

asglover removed the ci-ready Triggers CI checks for a pull request label Nov 9, 2025

asglover added the ci-ready Triggers CI checks for a pull request label Nov 9, 2025

asglover added ci-ready Triggers CI checks for a pull request and removed ci-ready Triggers CI checks for a pull request labels Nov 10, 2025

go back to simple checks for attempting build.

d9f3e03

asglover force-pushed the more-import-error-hardening branch from 081d63c to d9f3e03 Compare November 10, 2025 04:00

add tests for the torch specific extension

59329ad

asglover added ci-ready Triggers CI checks for a pull request and removed ci-ready Triggers CI checks for a pull request labels Nov 10, 2025

remove trailing whitespace

580dc15

asglover added ci-ready Triggers CI checks for a pull request and removed ci-ready Triggers CI checks for a pull request labels Nov 10, 2025

vbharadwaj-bk marked this pull request as ready for review November 10, 2025 04:38

vbharadwaj-bk approved these changes Nov 10, 2025

View reviewed changes

vbharadwaj-bk merged commit bcb7fdb into main Nov 10, 2025
2 checks passed

asglover deleted the more-import-error-hardening branch November 10, 2025 04:45

		return torch.version.cuda and _nvcc_present()


		def _nvcc_present():

		postprocess_kernel = lambda kernel: kernel # noqa : E731


		@lru_cache(maxsize=1)

stronger criteria to attempt compilation #166

stronger criteria to attempt compilation #166

Uh oh!

Conversation

asglover commented Nov 2, 2025

Uh oh!

vbharadwaj-bk left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

asglover commented Nov 7, 2025

Uh oh!

vbharadwaj-bk left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

asglover Nov 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

asglover commented Nov 8, 2025

Uh oh!

vbharadwaj-bk commented Nov 8, 2025

Uh oh!

asglover commented Nov 8, 2025

Uh oh!

vbharadwaj-bk commented Nov 9, 2025

Uh oh!

asglover commented Nov 9, 2025

Uh oh!

asglover commented Nov 9, 2025

Uh oh!

vbharadwaj-bk commented Nov 10, 2025

Uh oh!

asglover commented Nov 10, 2025

Uh oh!

asglover commented Nov 10, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

asglover Nov 8, 2025 •

edited

Loading