Skip to content

Conversation

@hanwen-cluster
Copy link
Contributor

Problem

ParallelCluster clusters should be able to be created in a network without Internet access. However, when the following items are all true, cluster creation fails:

  1. RHEL/Rocky
  2. x86 GPU instances for head node and/or login nodes
  3. DCV enabled

The failure can be seen in chef-client log:

      ================================================================================
      Error executing action `install` on resource 'dnf_package[/opt/parallelcluster/sources/nice-dcv-2024.0-19030-el9-x86_64/nice-dcv-gl-2024.0.1096-1.el9.x86_64.rpm]'
      ================================================================================

      RuntimeError
      ------------
      dnf-helper.py had stderr/stdout output:

      Errors during downloading metadata for repository 'epel':
        - Curl error (28): Timeout was reached for https://mirrors.fedoraproject.org/mirrorlist?repo=epel-9&arch=x86_64 [Failed to connect to mirrors.fedoraproject.org port 443: Connection timed out]
      Error: Failed to download metadata for repo 'epel': Cannot prepare internal mirrorlist: Curl error (28): Timeout was reached for https://mirrors.fedoraproject.org/mirrorlist?repo=epel-9&arch=x86_64 [Failed to connect to mirrors.fedoraproject.org port 443: Connection timed out]
      Errors during downloading metadata for repository 'rhel-9-appstream-rhui-rpms':
        - Curl error (28): Timeout was reached for https://rhui.us-east-1.aws.ce.redhat.com/pulp/mirror/content/dist/rhel9/rhui/9/x86_64/appstream/os [Failed to connect to rhui.us-east-1.aws.ce.redhat.com port 443: Connection timed out]
      Error: Failed to download metadata for repo 'rhel-9-appstream-rhui-rpms': Cannot prepare internal mirrorlist: Curl error (28): Timeout was reached for https://rhui.us-east-1.aws.ce.redhat.com/pulp/mirror/content/dist/rhel9/rhui/9/x86_64/appstream/os [Failed to connect to rhui.us-east-1.aws.ce.redhat.com port 443: Connection timed out]

Workaround

This commit creates a script to download any missing transitive dependencies of DCV GL. This commit modifies the cookbook to install the transitive dependencies, and use --disablerepo=* to avoid yum/dnf contacting Internet for repo Metadata

How to use the script:

  1. Launch an instance with official ParallelCluster RHEL/Rocky AMI
  2. On the instance, run the script as root (e.g. ./fix_dcv_gl_offline_installation.gl)
  3. Create an image from the instance
  4. Use the created image as the CustomAmi when creating clusters

Testing

The following test is successful, using the outcome AMI as CustomAmi from step 1-3:

test-suites:
  networking:
    test_cluster_networking.py::test_cluster_in_no_internet_subnet:
      dimensions:
        - regions: ["us-east-1"]
          instances: ["g5.xlarge"]
          oss: ["rhel9"]
          schedulers: ["slurm"]

Note

This commit should only be merged in integ-tests-3.14.0. Long term fix will be done in the future for other branches

Checklist

  • Make sure you are pointing to the right branch.
  • If you're creating a patch for a branch other than develop add the branch name as prefix in the PR title (e.g. [release-3.6]).
  • Check all commits' messages are clear, describing what and why vs how.
  • Make sure to have added unit tests or integration tests to cover the new/modified code.
  • Check if documentation is impacted by this change.

Please review the guidelines for contributing and Pull Request Instructions.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@hanwen-cluster hanwen-cluster requested review from a team as code owners December 18, 2025 19:37
@gmarciani
Copy link
Contributor

Launch an instance with official ParallelCluster RHEL/Rocky AMI
On the instance, run the script as root (e.g. ./fix_dcv_gl_offline_installation.gl)
Create an image from the instance

What about users who do not want to use the official pcluster AMI but to use their own custom AMI?
A more general and pcluster-native approach would be to vend this patch as a custom component to be specified in https://docs.aws.amazon.com/parallelcluster/latest/ug/Build-v3.html#yaml-build-image-Build-Components-Value.

Why not following this approach?

…line installation

## Problem
ParallelCluster clusters should be able to be created in a network without Internet access. However, when the following items are all true, cluster creation fails:
1. RHEL/Rocky
2. x86 GPU instances for head node and/or login nodes
3. DCV enabled

The failure can be seen in chef-client log:
```
      ================================================================================
      Error executing action `install` on resource 'dnf_package[/opt/parallelcluster/sources/nice-dcv-2024.0-19030-el9-x86_64/nice-dcv-gl-2024.0.1096-1.el9.x86_64.rpm]'
      ================================================================================

      RuntimeError
      ------------
      dnf-helper.py had stderr/stdout output:

      Errors during downloading metadata for repository 'epel':
        - Curl error (28): Timeout was reached for https://mirrors.fedoraproject.org/mirrorlist?repo=epel-9&arch=x86_64 [Failed to connect to mirrors.fedoraproject.org port 443: Connection timed out]
      Error: Failed to download metadata for repo 'epel': Cannot prepare internal mirrorlist: Curl error (28): Timeout was reached for https://mirrors.fedoraproject.org/mirrorlist?repo=epel-9&arch=x86_64 [Failed to connect to mirrors.fedoraproject.org port 443: Connection timed out]
      Errors during downloading metadata for repository 'rhel-9-appstream-rhui-rpms':
        - Curl error (28): Timeout was reached for https://rhui.us-east-1.aws.ce.redhat.com/pulp/mirror/content/dist/rhel9/rhui/9/x86_64/appstream/os [Failed to connect to rhui.us-east-1.aws.ce.redhat.com port 443: Connection timed out]
      Error: Failed to download metadata for repo 'rhel-9-appstream-rhui-rpms': Cannot prepare internal mirrorlist: Curl error (28): Timeout was reached for https://rhui.us-east-1.aws.ce.redhat.com/pulp/mirror/content/dist/rhel9/rhui/9/x86_64/appstream/os [Failed to connect to rhui.us-east-1.aws.ce.redhat.com port 443: Connection timed out]
```

## Workaround
This commit creates a script to download any missing transitive dependencies of DCV GL. This commit modifies the cookbook to install the transitive dependencies, and use `--disablerepo=*` to avoid yum/dnf contacting Internet for repo Metadata

### How to use the script:
1. Launch an instance with official ParallelCluster RHEL/Rocky AMI
2. On the instance, run the script as root (e.g. `./fix_dcv_gl_offline_installation.gl`)
3. Create an image from the instance
4. Use the created image as the [CustomAmi](https://docs.aws.amazon.com/parallelcluster/latest/ug/Image-v3.html#yaml-Image-CustomAmi) when creating clusters

## Testing

The following test is successful, using the outcome AMI as CustomAmi from step 1-3:
```
test-suites:
  networking:
    test_cluster_networking.py::test_cluster_in_no_internet_subnet:
      dimensions:
        - regions: ["us-east-1"]
          instances: ["g5.xlarge"]
          oss: ["rhel9"]
          schedulers: ["slurm"]
```

## Note

This commit should only be merged in integ-tests-3.14.0. Long term fix will be done in the future for other branches
@hanwen-cluster hanwen-cluster force-pushed the integ-tests-3.14.0dec18 branch from 227f70f to d6b49a2 Compare December 19, 2025 17:57
@hanwen-cluster
Copy link
Contributor Author

Launch an instance with official ParallelCluster RHEL/Rocky AMI
On the instance, run the script as root (e.g. ./fix_dcv_gl_offline_installation.gl)
Create an image from the instance

What about users who do not want to use the official pcluster AMI but to use their own custom AMI? A more general and pcluster-native approach would be to vend this patch as a custom component to be specified in https://docs.aws.amazon.com/parallelcluster/latest/ug/Build-v3.html#yaml-build-image-Build-Components-Value.

Why not following this approach?

I agree the current approach is not comprehensive. But this is enough to unblock customers using official AMI. We will make long-term improvement in the next release.

@hanwen-cluster hanwen-cluster added the skip-changelog-update Disables the check that enforces changelog updates in PRs label Dec 19, 2025
@hanwen-cluster hanwen-cluster enabled auto-merge (rebase) December 19, 2025 19:13
@hanwen-cluster hanwen-cluster merged commit d623959 into aws:integ-tests-3.14.0 Dec 19, 2025
23 of 24 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

skip-changelog-update Disables the check that enforces changelog updates in PRs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants