Skip to content

Conversation

@asglover
Copy link
Collaborator

@asglover asglover commented Nov 2, 2025

added checks for nvcc and hipcc before attempting to compile the torch extensions. I added the lru_cache so that we're not regularly calling out to "which" incase that is an expensive call. Please let me know your thoughts.

Copy link
Member

@vbharadwaj-bk vbharadwaj-bk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this strategy introduces a subtle issue: it is possible for nvcc to not be in PATH, but for CUDA_HOME to be specified, in which case we should be able to compile the extension but shutil.which may fail (canonically, setting CUDA_HOME is acceptable without keeping NVCC on PATH).

Perhaps rather than check each possible failure mode (there could be others we're not thinking of), perhaps opt for the following strategy:

  1. Create two variables in https://github.com/PASSIONLab/OpenEquivariance/blob/main/openequivariance/__init__.py, BUILT_EXTENSION and BUILT_EXTENSION_ERROR, default values True and None, respectively.
  2. Rather than raise an ImportError on this line,
    raise ImportError(f"Unable to load OpenEquivariance extension library:\n{e}")
    , place your stub definitions of the major classes from the other file and set the variables above appropriately.
  3. For the import test, check that BUILT_EXTENSION is True and BUILT_EXTENSION_ERROR is not None, otherwise raise BUILT_EXTENSION_ERROR.
    This keeps the behavior the same (NVCC not required to be on PATH, but CUDA_HOME can be set), while checking for a range of failure modes.

return torch.version.cuda and _nvcc_present()


def _nvcc_present():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't need this to be a function if it's only called in one place, ditto with _hpicc_present().

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I was wondering and did a some thinking for CUDA_HOME vs. NVCC... I should have used CUDA home because I think that's what torch checks for. I thought I was being careful. I'll change that.

place your stub definitions of the major classes from the other file and set the variables above appropriately.

So right now, I define the stubs in the extlib init if the compilation doesn't build, but you're saying I should move all those definitions to the top level init, so that they will always get defined even if an exception is thrown somewhere in the compilation process? Am I understanding your intention?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the detailed review on this, I appreciate it

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, move the stubs to the top level so that any arbitrary error results in the stubs being used, and just define some variables appropriately to log whether the extension was built and any related variables.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This way we handle all failure modes instead of trying to identify every single one manually.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I just move the stubs up to the top level, then they'll be defined in different places depeneding on if compilation is sucessful or not, which will make them tough to import. So I could bring all those items from the generic module up to the top level name space, but it does feel like those should be hidden implementation details not exposed at the top level namespace.

Maybe I could wrap more of the extlib compilation process in a try-catch block and handle errors still the extlib init file. So it still would provide general coverage for unexpected errors, but would not change where the generic_module members are located and none of the imports elsewhere in the package would change? And by putting everything non-trivial in a try block extlib should ideally always succede at being imported and the block where the members of the generic module are defined with actual or stub implementations should always be executed.

Let me know your thoughts, sorry for making this simple feature an involved design process

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with keeping the stubs where they are if you remove the existing case-by-case logic (if torch.version_cuda, version_hip, etc.) and wrap the whole extlib compilation process in a try-catch.

You should still define a couple of variables to flag whether the extension was built and any error message - see LINKED_LIBPYTHON_ERROR as a model, and these variables should propagate to the toplevel.

@asglover asglover marked this pull request as draft November 7, 2025 20:56
…ric_module. remove constants for _compile_torch_extension() as it's referenced only once. torch_compile_exception -> TORCH_COMPILE_ERROR for consistency.
@asglover asglover marked this pull request as ready for review November 7, 2025 22:13
@asglover
Copy link
Collaborator Author

asglover commented Nov 7, 2025

I made the changes you requested and did some other little tidying up. Please let me know what you think.

@asglover asglover added ci-ready Triggers CI checks for a pull request and removed ci-ready Triggers CI checks for a pull request labels Nov 7, 2025
@asglover asglover added the ci-ready Triggers CI checks for a pull request label Nov 7, 2025
@asglover asglover self-assigned this Nov 7, 2025
Copy link
Member

@vbharadwaj-bk vbharadwaj-bk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, thank you, everything looks good - but on closer inspection I can't see any obvious reason for the LRU cache decorator, this isn't a frequently called function and the cache is not persistent across runs.

Also am assuming you tested this, including fakifying the result for the CI (no need to give me a proof of that, but reminding since this is a pretty core file in the package). We should be able to merge this on the next round.

postprocess_kernel = lambda kernel: kernel # noqa : E731


@lru_cache(maxsize=1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm - do we need these LRU_cache decorators? These functions are only called once (when extlib is loaded) upon library import.

And LRU_cache is not persistent across multiple programs, only within the same program.

Copy link
Collaborator Author

@asglover asglover Nov 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a fair point, I was calling the function more in an earlier version, but only calling it once now, I'll remove that.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a local test where I messed up the source for the module and it failed the test. I haven't done it for the CI again because I was conserving the minutes. But I can run it one more time.

@asglover asglover added ci-ready Triggers CI checks for a pull request and removed ci-ready Triggers CI checks for a pull request labels Nov 8, 2025
@asglover
Copy link
Collaborator Author

asglover commented Nov 8, 2025

Is there some mechanism which caches module builds if nothing changes? I thought I remembered there was a feature like that at one point in time? Does this pytorch extension fully build every python session even if nothing has changed?

@vbharadwaj-bk
Copy link
Member

Approved, feel free to merge.

There is a caching mechanism, but it is a PyTorch utility, not Python-standard. It operates by taking a hash of the compile parameters, source files, and the extension name and checking if an extension library with the same hash exists in a user-specific cache directory - typically this is .cache/torch_extensions. If so, no recompile triggers.

@asglover
Copy link
Collaborator Author

asglover commented Nov 8, 2025

Great point on doing the fakificiation with CI. Because this test now makes test_import work, and I didn't realize that the CI only explicitly tests test/import_test -k test_import. I need to add test_extension_built to the CI to make it test the extension build.

@asglover asglover added ci-ready Triggers CI checks for a pull request and removed ci-ready Triggers CI checks for a pull request labels Nov 8, 2025
@asglover asglover marked this pull request as draft November 8, 2025 23:13
@vbharadwaj-bk
Copy link
Member

Hmm - this is now a draft, any issue going on with this?

@vbharadwaj-bk vbharadwaj-bk self-requested a review November 9, 2025 19:14
@asglover
Copy link
Collaborator Author

asglover commented Nov 9, 2025

Yes, there is an issue, what I expected to pass the CI did not pass the CI. It reported that there was no BUILT_EXTENSION_ERROR but also that BUILT_EXTENSION was false. I think this is only possible if It doesn't attempt to perform the extension build. I'm trying to understand why that might happen. Adding an error message when compilation is not attempted, to confirm this is indeed what's happening. Then I'll go from there.

@asglover asglover removed the ci-ready Triggers CI checks for a pull request label Nov 9, 2025
@asglover asglover added the ci-ready Triggers CI checks for a pull request label Nov 9, 2025
@asglover
Copy link
Collaborator Author

asglover commented Nov 9, 2025

So to confirm the build is not being attempted on the CI. But the built should be attempted so long as either

torch.version.cuda and ("CUDA_HOME" in os.environ)
or
torch.version.hip and ("HIP_HOME" in os.environ)

And it looks like:
we're downloading the cuda version of torch in requirements_cuda_ci.txt
and
we're apt installing nvidia-cuda-toolkit

so I would expect "CUDA_HOME" to be set and torch.version.cuda to be true.

I don't know which is the problem, and I was thinking about adding some tests to pytest, with some flag so that we could force a test to make sure that the CUDA path is actually being taken when running our CUDA CI and that the HIP path is actually being taken when running on HIP CI. maybe that's overdoing it.

I'm surprised by _compile_torch_cuda_extension() not returning true.

@asglover asglover added ci-ready Triggers CI checks for a pull request and removed ci-ready Triggers CI checks for a pull request labels Nov 10, 2025
@vbharadwaj-bk
Copy link
Member

Your issue here is that CUDA_HOME may not be specified (but nvcc is probably on PATH); my comment earlier alluded to the fact that the inverse may be true, and that the first version did not catch that fact.

I would revert to 91c34d7 and replace the line "elif _build_cuda_extension() or _build_hip_extension()" with "elif torch_version.cuda or torch_version.hip" or similar. Have confirmed that this works. And we can perhaps revert all the other changes since that commit.

@asglover
Copy link
Collaborator Author

Yep, I came to the same understanding. I originally had added the "CUDA_HOME" checks to try to prevent the module from building the extension on machines without the toolkit, but now with the design which catches build errors, then it's no problem. Thank you, sorry for having this drag on

@asglover asglover force-pushed the more-import-error-hardening branch from 081d63c to d9f3e03 Compare November 10, 2025 04:00
@asglover asglover added ci-ready Triggers CI checks for a pull request and removed ci-ready Triggers CI checks for a pull request labels Nov 10, 2025
@asglover asglover added ci-ready Triggers CI checks for a pull request and removed ci-ready Triggers CI checks for a pull request labels Nov 10, 2025
@asglover
Copy link
Collaborator Author

I added the torch extension as its own test incase just for futureproofing. ready for final review

@vbharadwaj-bk vbharadwaj-bk marked this pull request as ready for review November 10, 2025 04:38
@vbharadwaj-bk vbharadwaj-bk merged commit bcb7fdb into main Nov 10, 2025
2 checks passed
@asglover asglover deleted the more-import-error-hardening branch November 10, 2025 04:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci-ready Triggers CI checks for a pull request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants