Skip to content

Test VM with MIG VGPU#4372

Open
akri3i wants to merge 4 commits intoRedHatQE:mainfrom
akri3i:test-mig-vgpu
Open

Test VM with MIG VGPU#4372
akri3i wants to merge 4 commits intoRedHatQE:mainfrom
akri3i:test-mig-vgpu

Conversation

@akri3i
Copy link
Copy Markdown
Contributor

@akri3i akri3i commented Apr 8, 2026

Short description:

Add tests for VM with MIG VGPU

More details:
What this PR does / why we need it:
Which issue(s) this PR fixes:
Special notes for reviewer:

RedHatQE/openshift-python-wrapper#2692 wrapper PR for cluster policcy resource

jira-ticket:

Summary by CodeRabbit

  • New Features

    • Added support for NVIDIA A30-1-6C GPU device
  • Tests

    • Comprehensive MIG vGPU test suite for RHEL VMs, including accessibility and resource allocation checks
    • New GPU-focused test helpers and fixtures to orchestrate MIG vGPU flows
  • Chores

    • Simplified GPU node validation fixtures and consolidated a common test helper into shared utilities

@akri3i
Copy link
Copy Markdown
Contributor Author

akri3i commented Apr 8, 2026

/wip

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 8, 2026

📝 Walkthrough

Walkthrough

Consolidated get_resource_by_name into tests/utils.py and updated imports; removed the duplicate from tests/install_upgrade_operators/utils.py. Added vGPU/MIG test infrastructure: new GPU mapping, assertion helper, many MIG vGPU fixtures, and new RHEL vGPU test cases.

Changes

Cohort / File(s) Summary
Resource Helper Consolidation
tests/install_upgrade_operators/conftest.py, tests/install_upgrade_operators/crypto_policy/utils.py, tests/install_upgrade_operators/feature_gates/test_default_featuregates.py, tests/install_upgrade_operators/utils.py, tests/utils.py
Moved get_resource_by_name() implementation from tests/install_upgrade_operators/utils.py to tests/utils.py; updated import sites to use the centralized helper and removed the original exported function.
vGPU PCI Mapping
tests/virt/node/gpu/constants.py
Added GPU PCI device mapping key 10de:20b7 (NVIDIA A30-1-6C) to GPU_CARDS_MAP with vGPU/MDEV/grid metadata.
vGPU Assertion Helper
tests/virt/node/gpu/utils.py
Added assert_mdev_bus_exists_on_nodes(workers_utility_pods, nodes) which uses ExecCommandOnPod and TimeoutSampler (with TIMEOUT_1MIN) to confirm mdev_bus presence on nodes and fails the test with an aggregated message for any missing nodes.
Fixture Refactor (mdev_bus)
tests/virt/conftest.py
Replaced local per-node timeout/polling logic in non_existent_mdev_bus_nodes with a call to assert_mdev_bus_exists_on_nodes(...), delegating checking and failure handling to the new helper.
MIG vGPU Fixtures
tests/virt/node/gpu/vgpu/conftest.py
Added numerous class-scoped fixtures to enable MIG vGPU flows: cluster policy and daemonset patching, filtering eligible nodes (excluding A2), selecting supported MIG GPU device, applying HCO permittedHostDevices, labeling nodes, creating parametrized VMs (including a second RHEL VM), waiting for daemonset readiness, clearing labels on teardown, and asserting mdev_bus on labeled nodes.
MIG vGPU Tests (RHEL VMs)
tests/virt/node/gpu/vgpu/test_rhel_vm_with_mig_vgpu.py
Added TestMIGVGPURHELGPUSpec with parametrized RHEL VM tests: verifies permitted hostdevices visibility, vGPU device accessibility inside a VM, and simultaneous access from two VMs using the same MIG GPU.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 47.37% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check ⚠️ Warning PR description is largely incomplete with most required template sections empty or containing only placeholders. Fill in: 'More details', 'What this PR does / why we need it', 'Which issue(s) this PR fixes', and provide a valid jira-ticket URL (or write 'NONE' if not tracked).
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The title "Test VM with MIG VGPU" directly reflects the primary change: adding comprehensive test fixtures and test cases for VMs with MIG vGPU support across multiple test modules.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-virtualization-qe-bot-4
Copy link
Copy Markdown

Report bugs in Issues

Welcome! 🎉

This pull request will be automatically processed with the following features:

🔄 Automatic Actions

  • Reviewer Assignment: Reviewers are automatically assigned based on the OWNERS file in the repository root
  • Size Labeling: PR size labels (XS, S, M, L, XL, XXL) are automatically applied based on changes
  • Issue Creation: A tracking issue is created for this PR and will be closed when the PR is merged or closed
  • Branch Labeling: Branch-specific labels are applied to track the target branch
  • Auto-verification: Auto-verified users have their PRs automatically marked as verified
  • Labels: Enabled categories: branch, can-be-merged, cherry-pick, has-conflicts, hold, needs-rebase, size, verified, wip

📋 Available Commands

PR Status Management

  • /wip - Mark PR as work in progress (adds WIP: prefix to title)
  • /wip cancel - Remove work in progress status
  • /hold - Block PR merging (approvers only)
  • /hold cancel - Unblock PR merging
  • /verified - Mark PR as verified
  • /verified cancel - Remove verification status
  • /reprocess - Trigger complete PR workflow reprocessing (useful if webhook failed or configuration changed)
  • /regenerate-welcome - Regenerate this welcome message

Review & Approval

  • /lgtm - Approve changes (looks good to me)
  • /approve - Approve PR (approvers only)
  • /assign-reviewers - Assign reviewers based on OWNERS file
  • /assign-reviewer @username - Assign specific reviewer
  • /check-can-merge - Check if PR meets merge requirements

Testing & Validation

  • /retest tox - Run Python test suite with tox
  • /retest build-container - Rebuild and test container image
  • /retest verify-bugs-are-open - verify-bugs-are-open
  • /retest all - Run all available tests

Container Operations

  • /build-and-push-container - Build and push container image (tagged with PR number)
    • Supports additional build arguments: /build-and-push-container --build-arg KEY=value

Cherry-pick Operations

  • /cherry-pick <branch> - Schedule cherry-pick to target branch when PR is merged
    • Multiple branches: /cherry-pick branch1 branch2 branch3

Label Management

  • /<label-name> - Add a label to the PR
  • /<label-name> cancel - Remove a label from the PR

✅ Merge Requirements

This PR will be automatically approved when the following conditions are met:

  1. Approval: /approve from at least one approver
  2. LGTM Count: Minimum 2 /lgtm from reviewers
  3. Status Checks: All required status checks must pass
  4. No Blockers: No wip, hold, has-conflicts labels and PR must be mergeable (no conflicts)
  5. Verified: PR must be marked as verified

📊 Review Process

Approvers and Reviewers

Approvers:

  • dshchedr
  • myakove
  • rnetser
  • vsibirsk

Reviewers:

  • OhadRevah
  • RoniKishner
  • SamAlber
  • SiboWang1997
  • akri3i
  • albarker-rh
  • dshchedr
  • hmeir
  • jerry7z
  • kbidarkar
  • rlobillo
  • rnetser
  • vsibirsk
Available Labels
  • hold
  • verified
  • wip
  • lgtm
  • approve
AI Features
  • Cherry-Pick Conflict Resolution: Enabled (claude/claude-opus-4-6[1m])

💡 Tips

  • WIP Status: Use /wip when your PR is not ready for review
  • Verification: The verified label is removed on new commits unless the push is detected as a clean rebase
  • Cherry-picking: Cherry-pick labels are processed when the PR is merged
  • Container Builds: Container images are automatically tagged with the PR number
  • Permission Levels: Some commands require approver permissions
  • Auto-verified Users: Certain users have automatic verification and merge privileges

For more information, please refer to the project documentation or contact the maintainers.

@openshift-virtualization-qe-bot
Copy link
Copy Markdown

D/S test tox -e verify-tc-requirement-polarion failed: cnv-tests-tox-executor/26706

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/utils.py`:
- Around line 694-704: Add a Google-style docstring to the public helper
function get_resource_by_name that documents its behavior, parameters, return
value, and exceptions: describe that it performs a cluster lookup for a resource
of type Resource using admin_client (DynamicClient), list args (resource_kind:
Resource class, name: str, admin_client: DynamicClient, namespace: str | None),
state the return type (Resource instance) and side effect (may perform API call
to the cluster), and explicitly document Raises ResourceNotFoundError when the
named resource does not exist; also mention that kwargs['client'] is set to
admin_client and that namespace is optional.
- Around line 695-701: The parameter annotation for resource_kind is incorrect:
change its type from Resource to type[Resource] so the helper can call it as a
constructor (resource = resource_kind(**kwargs)); then add a Google-style
docstring to the helper function that describes its purpose (creating/returning
a Resource via cluster API), lists parameters, return type, and documents
exceptions the function may raise (propagate client/API exceptions when cluster
interactions fail) per coding guidelines. Ensure the docstring is public, placed
immediately under the function signature, and mentions that callers should
handle or expect client/API exceptions.

In `@tests/virt/node/gpu/vgpu/conftest.py`:
- Around line 185-200: The fixture logic currently assumes any non-"A2" GPU is
MIG-capable and can misidentify devices; update nodes_with_supported_mig_gpus
and supported_mig_gpu_device to use a single explicit MIG allow-list (e.g., a
list of supported MIG-capable model keys from GPU_CARDS_MAP) and validate each
node by checking actual MIG capability (via get_nodes_gpu_info) against that
allow-list, derive VGPU_CONFIG_LABEL from the selected GPU_CARDS_MAP entry, and
fail fast with a clear error if no nodes match; also mirror this same allow-list
in tests/virt/conftest.py::nodes_with_supported_gpus and the autouse
virt_special_infra_sanity path so both discovery and session sanity use the same
explicit supported set and avoid mixed-cluster misconfiguration.

In `@tests/virt/node/gpu/vgpu/test_rhel_vm_with_mig_vgpu.py`:
- Around line 23-28: The pytest module-level marker list `pytestmark` needs the
`pytest.mark.tier3` marker added so these hardware/platform-specific MIG vGPU
cases are only collected in the correct lane; update the `pytestmark` array
(which currently contains pytest.mark.post_upgrade, pytest.mark.special_infra,
pytest.mark.gpu, pytest.mark.usefixtures("non_existent_mdev_bus_mig_nodes")) to
include pytest.mark.tier3 alongside the others.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 57422f30-11f7-4b05-8494-050a8425c80b

📥 Commits

Reviewing files that changed from the base of the PR and between 50470d6 and 203430e.

📒 Files selected for processing (10)
  • tests/install_upgrade_operators/conftest.py
  • tests/install_upgrade_operators/crypto_policy/utils.py
  • tests/install_upgrade_operators/feature_gates/test_default_featuregates.py
  • tests/install_upgrade_operators/utils.py
  • tests/utils.py
  • tests/virt/conftest.py
  • tests/virt/node/gpu/constants.py
  • tests/virt/node/gpu/utils.py
  • tests/virt/node/gpu/vgpu/conftest.py
  • tests/virt/node/gpu/vgpu/test_rhel_vm_with_mig_vgpu.py
💤 Files with no reviewable changes (1)
  • tests/install_upgrade_operators/utils.py

Comment thread tests/utils.py
Comment on lines +694 to +704
def get_resource_by_name(
resource_kind: Resource, name: str, admin_client: DynamicClient, namespace: str | None = None
) -> Resource:
kwargs = {"name": name}
if namespace:
kwargs["namespace"] = namespace
kwargs["client"] = admin_client
resource = resource_kind(**kwargs)
if resource.exists:
return resource
raise ResourceNotFoundError(f"{resource_kind} {name} not found.")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

MEDIUM: Add a Google-style docstring for this new shared helper.

This is a public helper with non-obvious side effects (cluster lookup + ResourceNotFoundError path); document args/returns/raises explicitly.

Proposed fix
 def get_resource_by_name(
     resource_kind: type[Resource], name: str, admin_client: DynamicClient, namespace: str | None = None
 ) -> Resource:
+    """Return an existing cluster resource by kind and name.
+
+    Args:
+        resource_kind (type[Resource]): Resource class to instantiate.
+        name (str): Resource name.
+        admin_client (DynamicClient): Cluster client.
+        namespace (str | None): Namespace for namespaced resources.
+
+    Returns:
+        Resource: The existing resource object.
+
+    Raises:
+        ResourceNotFoundError: If the resource does not exist.
+    """
     kwargs = {"name": name}

As per coding guidelines "Google-format docstrings REQUIRED for all public functions with non-obvious return values OR side effects".

🧰 Tools
🪛 Ruff (0.15.9)

[warning] 704-704: Avoid specifying long messages outside the exception class

(TRY003)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/utils.py` around lines 694 - 704, Add a Google-style docstring to the
public helper function get_resource_by_name that documents its behavior,
parameters, return value, and exceptions: describe that it performs a cluster
lookup for a resource of type Resource using admin_client (DynamicClient), list
args (resource_kind: Resource class, name: str, admin_client: DynamicClient,
namespace: str | None), state the return type (Resource instance) and side
effect (may perform API call to the cluster), and explicitly document Raises
ResourceNotFoundError when the named resource does not exist; also mention that
kwargs['client'] is set to admin_client and that namespace is optional.

Comment thread tests/utils.py
Comment on lines +695 to +701
resource_kind: Resource, name: str, admin_client: DynamicClient, namespace: str | None = None
) -> Resource:
kwargs = {"name": name}
if namespace:
kwargs["namespace"] = namespace
kwargs["client"] = admin_client
resource = resource_kind(**kwargs)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify the helper signature and constructor-style usage across changed files.
rg -n "def get_resource_by_name|resource_kind\s*:" tests/utils.py tests/install_upgrade_operators/crypto_policy/utils.py
rg -n "get_resource_by_name\(" tests/utils.py tests/install_upgrade_operators

Repository: RedHatQE/openshift-virtualization-tests

Length of output: 675


🏁 Script executed:

#!/bin/bash
# Read the full function implementation including docstring
sed -n '694,710p' tests/utils.py

Repository: RedHatQE/openshift-virtualization-tests

Length of output: 500


🏁 Script executed:

#!/bin/bash
# Check one actual call site to see what's being passed as resource_kind
sed -n '55,60p' tests/install_upgrade_operators/crypto_policy/utils.py

Repository: RedHatQE/openshift-virtualization-tests

Length of output: 294


🏁 Script executed:

#!/bin/bash
# Check what type Resource is and if it's callable
rg -A5 "^class Resource" tests/utils.py | head -15

Repository: RedHatQE/openshift-virtualization-tests

Length of output: 65


Fix type annotation and add required docstring for new helper.

The resource_kind parameter is called as a constructor at line 701 (resource_kind(**kwargs)), so it must be typed as type[Resource], not Resource. Current annotation breaks strict type checking and masks call-site errors.

Additionally, this public helper performs cluster API interactions and raises exceptions—it requires a Google-format docstring explaining the behavior and exception handling per coding guidelines.

Proposed fixes
 def get_resource_by_name(
-    resource_kind: Resource, name: str, admin_client: DynamicClient, namespace: str | None = None
+    resource_kind: type[Resource], name: str, admin_client: DynamicClient, namespace: str | None = None
 ) -> Resource:
+    """Get a resource by name from the cluster.
+
+    Args:
+        resource_kind: The resource class to instantiate.
+        name: Resource name.
+        admin_client: Authenticated cluster client.
+        namespace: Optional namespace; if omitted, cluster-scoped resource is used.
+
+    Returns:
+        Instantiated and verified resource.
+
+    Raises:
+        ResourceNotFoundError: If resource does not exist on cluster.
+    """
     kwargs = {"name": name}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/utils.py` around lines 695 - 701, The parameter annotation for
resource_kind is incorrect: change its type from Resource to type[Resource] so
the helper can call it as a constructor (resource = resource_kind(**kwargs));
then add a Google-style docstring to the helper function that describes its
purpose (creating/returning a Resource via cluster API), lists parameters,
return type, and documents exceptions the function may raise (propagate
client/API exceptions when cluster interactions fail) per coding guidelines.
Ensure the docstring is public, placed immediately under the function signature,
and mentions that callers should handle or expect client/API exceptions.

Comment on lines +185 to +200
def nodes_with_supported_mig_gpus(gpu_nodes, workers_utility_pods):
gpu_nodes_copy = gpu_nodes.copy()
for node in gpu_nodes:
if "A2" in get_nodes_gpu_info(util_pods=workers_utility_pods, node=node):
gpu_nodes_copy.remove(node)
return gpu_nodes_copy


@pytest.fixture(scope="class")
def supported_mig_gpu_device(workers_utility_pods, nodes_with_supported_mig_gpus):
gpu_info = get_nodes_gpu_info(util_pods=workers_utility_pods, node=nodes_with_supported_mig_gpus[0])
for gpu_id in GPU_CARDS_MAP:
if gpu_id in gpu_info:
return GPU_CARDS_MAP[gpu_id]

raise UnsupportedGPUDeviceError("GPU device ID not in current GPU_CARDS_MAP!")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

CRITICAL: Positively identify MIG-capable GPUs before labeling nodes.

Lines 185-200 currently treat every non-A2 card as MIG-capable, so a T4-first cluster can resolve supported_mig_gpu_device to the T4 entry from GPU_CARDS_MAP. Lines 215-220 then hardcode A30-1-6C anyway, so this fixture tree can misconfigure non-A30 clusters instead of failing fast; A2-only clusters also fall through to nodes_with_supported_mig_gpus[0]. Separately, the autouse tests/virt/conftest.py::virt_special_infra_sanity path still builds support from nodes_with_supported_gpus, which currently drops A30 nodes, so an A30-only MIG environment can be rejected before these fixtures run. Please drive both paths from the same explicit MIG allow-list, and derive VGPU_CONFIG_LABEL from the selected mapping.

Possible fix
+SUPPORTED_MIG_GPU_IDS = ("10de:20b7",)
+
 `@pytest.fixture`(scope="class")
 def nodes_with_supported_mig_gpus(gpu_nodes, workers_utility_pods):
-    gpu_nodes_copy = gpu_nodes.copy()
-    for node in gpu_nodes:
-        if "A2" in get_nodes_gpu_info(util_pods=workers_utility_pods, node=node):
-            gpu_nodes_copy.remove(node)
-    return gpu_nodes_copy
+    mig_nodes = []
+    for node in gpu_nodes:
+        gpu_info = get_nodes_gpu_info(util_pods=workers_utility_pods, node=node)
+        if any(gpu_id in gpu_info for gpu_id in SUPPORTED_MIG_GPU_IDS):
+            mig_nodes.append(node)
+    return mig_nodes
 
 
 `@pytest.fixture`(scope="class")
 def supported_mig_gpu_device(workers_utility_pods, nodes_with_supported_mig_gpus):
+    if not nodes_with_supported_mig_gpus:
+        raise UnsupportedGPUDeviceError("No MIG-capable GPU nodes found")
     gpu_info = get_nodes_gpu_info(util_pods=workers_utility_pods, node=nodes_with_supported_mig_gpus[0])
-    for gpu_id in GPU_CARDS_MAP:
+    for gpu_id in SUPPORTED_MIG_GPU_IDS:
         if gpu_id in gpu_info:
             return GPU_CARDS_MAP[gpu_id]
@@
 `@pytest.fixture`(scope="class")
 def mig_gpu_nodes_labeled_with_vgpu_config(
     nodes_with_supported_mig_gpus,
+    supported_mig_gpu_device,
 ):
     label_gen = label_nodes(
         nodes=nodes_with_supported_mig_gpus,
-        labels={"nvidia.com/vgpu.config": "A30-1-6C"},
+        labels={VGPU_CONFIG_LABEL: supported_mig_gpu_device[MDEV_NAME_STR].split()[-1]},
     )

Mirror the same allow-list in tests/virt/conftest.py::nodes_with_supported_gpus / virt_special_infra_sanity, or the session sanity will still reject the target A30 environment. Based on learnings, mixed GPU clusters are explicitly not supported in this suite; all GPU nodes are assumed to share the same model.

Also applies to: 215-220

🧰 Tools
🪛 Ruff (0.15.9)

[warning] 200-200: Avoid specifying long messages outside the exception class

(TRY003)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/virt/node/gpu/vgpu/conftest.py` around lines 185 - 200, The fixture
logic currently assumes any non-"A2" GPU is MIG-capable and can misidentify
devices; update nodes_with_supported_mig_gpus and supported_mig_gpu_device to
use a single explicit MIG allow-list (e.g., a list of supported MIG-capable
model keys from GPU_CARDS_MAP) and validate each node by checking actual MIG
capability (via get_nodes_gpu_info) against that allow-list, derive
VGPU_CONFIG_LABEL from the selected GPU_CARDS_MAP entry, and fail fast with a
clear error if no nodes match; also mirror this same allow-list in
tests/virt/conftest.py::nodes_with_supported_gpus and the autouse
virt_special_infra_sanity path so both discovery and session sanity use the same
explicit supported set and avoid mixed-cluster misconfiguration.

Comment on lines +23 to +28
pytestmark = [
pytest.mark.post_upgrade,
pytest.mark.special_infra,
pytest.mark.gpu,
pytest.mark.usefixtures("non_existent_mdev_bus_mig_nodes"),
]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

HIGH: Add the tier3 marker to this MIG vGPU module.

These cases are hardware- and platform-specific, and they rely on specialized MIG/vGPU infrastructure. Without pytest.mark.tier3, they can be collected in lanes that do not guarantee the required environment.

Suggested change
 pytestmark = [
     pytest.mark.post_upgrade,
     pytest.mark.special_infra,
     pytest.mark.gpu,
+    pytest.mark.tier3,
     pytest.mark.usefixtures("non_existent_mdev_bus_mig_nodes"),
 ]

As per coding guidelines "All new tests MUST have markers - check pytest.ini for available markers, NEVER commit unmarked tests; tier3 marker when warranted for complex/hardware/platform-specific/time-consuming tests".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/virt/node/gpu/vgpu/test_rhel_vm_with_mig_vgpu.py` around lines 23 - 28,
The pytest module-level marker list `pytestmark` needs the `pytest.mark.tier3`
marker added so these hardware/platform-specific MIG vGPU cases are only
collected in the correct lane; update the `pytestmark` array (which currently
contains pytest.mark.post_upgrade, pytest.mark.special_infra, pytest.mark.gpu,
pytest.mark.usefixtures("non_existent_mdev_bus_mig_nodes")) to include
pytest.mark.tier3 alongside the others.

@openshift-virtualization-qe-bot
Copy link
Copy Markdown

D/S test tox -e verify-tc-requirement-polarion failed: cnv-tests-tox-executor/26874

@akri3i
Copy link
Copy Markdown
Contributor Author

akri3i commented Apr 14, 2026

/wip cancel

@openshift-virtualization-qe-bot-2 openshift-virtualization-qe-bot-2 changed the title WIP: Test VM with MIG VGPU Test VM with MIG VGPU Apr 14, 2026
@openshift-virtualization-qe-bot-4
Copy link
Copy Markdown

Clean rebase detected — no code changes compared to previous head (c97b1a6).

@openshift-virtualization-qe-bot
Copy link
Copy Markdown

D/S test tox -e verify-tc-requirement-polarion failed: cnv-tests-tox-executor/26883

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (4)
tests/utils.py (1)

694-704: ⚠️ Potential issue | 🟠 Major

HIGH: Fix constructor typing and add docstring for this shared lookup helper.

Line 695 should use type[Resource] (not Resource) because Line 701 instantiates it as a class. Current typing weakens static checks and can hide bad call sites. Also, this public helper performs cluster lookup and raises ResourceNotFoundError, so it should include a Google-style docstring.

Proposed fix
 def get_resource_by_name(
-    resource_kind: Resource, name: str, admin_client: DynamicClient, namespace: str | None = None
+    resource_kind: type[Resource], name: str, admin_client: DynamicClient, namespace: str | None = None
 ) -> Resource:
+    """Get an existing resource by kind and name.
+
+    Args:
+        resource_kind: Resource class to instantiate.
+        name: Resource name.
+        admin_client: Cluster dynamic client.
+        namespace: Optional namespace for namespaced resources.
+
+    Returns:
+        The existing resource object.
+
+    Raises:
+        ResourceNotFoundError: If the resource does not exist.
+    """
     kwargs = {"name": name}
#!/bin/bash
# Verify helper signature and constructor usage at definition + call sites.
sed -n '694,706p' tests/utils.py
rg -n -C2 'get_resource_by_name\(' tests/utils.py tests/install_upgrade_operators

Expected result: definition shows resource_kind: type[Resource], and call sites pass resource classes (e.g., HyperConverged, KubeVirt) to resource_kind.

As per coding guidelines "Google-format docstrings REQUIRED for all public functions with non-obvious return values OR side effects".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/utils.py` around lines 694 - 704, Update the signature of the helper
get_resource_by_name to accept a constructor type by changing the parameter
typing from Resource to type[Resource] (so callers pass classes like
HyperConverged or KubeVirt and the function can instantiate them), and add a
Google-style docstring that documents the function purpose (cluster lookup by
name, optional namespace), parameters (resource_kind, name, admin_client,
namespace), return type (Resource), and raised exceptions (ResourceNotFoundError
when the resource does not exist); reference the function name
get_resource_by_name, the Resource type, admin_client/DynamicClient, and
ResourceNotFoundError when editing.
tests/virt/node/gpu/vgpu/conftest.py (1)

171-187: ⚠️ Potential issue | 🔴 Critical

CRITICAL: MIG selection still accepts non-MIG cards and then hardcodes A30 anyway.

nodes_with_supported_mig_gpus() only removes "A2", so a T4-first cluster still passes this filter. supported_mig_gpu_device() then walks the full GPU_CARDS_MAP, which can return the T4 entry, while mig_gpu_nodes_labeled_with_vgpu_config() unconditionally stamps A30-1-6C. That misconfigures the cluster instead of failing fast. Please use one explicit MIG allow-list, guard the empty-node case, and derive VGPU_CONFIG_LABEL from the selected mapping.

Possible fix
+SUPPORTED_MIG_GPU_IDS = ("10de:20b7",)
+
 `@pytest.fixture`(scope="class")
 def nodes_with_supported_mig_gpus(gpu_nodes, workers_utility_pods):
-    gpu_nodes_copy = gpu_nodes.copy()
-    for node in gpu_nodes:
-        if "A2" in get_nodes_gpu_info(util_pods=workers_utility_pods, node=node):
-            gpu_nodes_copy.remove(node)
-    return gpu_nodes_copy
+    mig_nodes = []
+    for node in gpu_nodes:
+        gpu_info = get_nodes_gpu_info(util_pods=workers_utility_pods, node=node)
+        if any(gpu_id in gpu_info for gpu_id in SUPPORTED_MIG_GPU_IDS):
+            mig_nodes.append(node)
+    return mig_nodes
 
 
 `@pytest.fixture`(scope="class")
 def supported_mig_gpu_device(workers_utility_pods, nodes_with_supported_mig_gpus):
+    if not nodes_with_supported_mig_gpus:
+        raise UnsupportedGPUDeviceError("No MIG-capable GPU nodes found")
     gpu_info = get_nodes_gpu_info(util_pods=workers_utility_pods, node=nodes_with_supported_mig_gpus[0])
-    for gpu_id in GPU_CARDS_MAP:
+    for gpu_id in SUPPORTED_MIG_GPU_IDS:
         if gpu_id in gpu_info:
             return GPU_CARDS_MAP[gpu_id]
 
     raise UnsupportedGPUDeviceError("GPU device ID not in current GPU_CARDS_MAP!")
@@
 def mig_gpu_nodes_labeled_with_vgpu_config(
     nodes_with_supported_mig_gpus,
     mig_gpu_nodes_labeled_with_vm_vgpu,
     nvidia_sandbox_validator_ds,
     gpu_nodes,
+    supported_mig_gpu_device,
 ):
     label_gen = label_nodes(
         nodes=nodes_with_supported_mig_gpus,
-        labels={"nvidia.com/vgpu.config": "A30-1-6C"},
+        labels={VGPU_CONFIG_LABEL: supported_mig_gpu_device[MDEV_NAME_STR].split()[-1]},
     )

Based on learnings: "heterogeneous GPU models are explicitly not supported in this test suite; all GPU nodes are assumed to have the same device model."

Also applies to: 202-210

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/virt/node/gpu/vgpu/conftest.py` around lines 171 - 187,
nodes_with_supported_mig_gpus and supported_mig_gpu_device currently allow
non-MIG cards and hardcode an A30 vgpu label; update them to use an explicit MIG
allow-list (e.g., ALLOWED_MIG_MODELS) and filter gpu_nodes by membership in that
allow-list inside nodes_with_supported_mig_gpus, then in
supported_mig_gpu_device only search GPU_CARDS_MAP entries for models in the
same allow-list and raise a clear error if nodes_with_supported_mig_gpus is
empty; finally remove the hardcoded "A30-1-6C" and instead derive
VGPU_CONFIG_LABEL (used by mig_gpu_nodes_labeled_with_vgpu_config) from the
selected GPU_CARDS_MAP mapping so the label matches the chosen device.
tests/virt/node/gpu/vgpu/test_rhel_vm_with_mig_vgpu.py (1)

23-28: ⚠️ Potential issue | 🟠 Major

HIGH: add the tier3 marker to this MIG vGPU module.

These cases are hardware-specific and depend on specialized MIG/vGPU infrastructure. Without pytest.mark.tier3, they can be collected in lanes that do not guarantee the required environment.

Possible fix
 pytestmark = [
     pytest.mark.post_upgrade,
     pytest.mark.special_infra,
     pytest.mark.gpu,
+    pytest.mark.tier3,
     pytest.mark.usefixtures("non_existent_mdev_bus_mig_nodes"),
 ]

As per coding guidelines: "All new tests MUST have markers - check pytest.ini for available markers, NEVER commit unmarked tests; tier3 marker when warranted for complex/hardware/platform-specific/time-consuming tests".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/virt/node/gpu/vgpu/test_rhel_vm_with_mig_vgpu.py` around lines 23 - 28,
The module-level pytestmark list is missing the required tier3 marker; update
the pytestmark variable in this file (the pytestmark list near the top of
test_rhel_vm_with_mig_vgpu.py) to include pytest.mark.tier3 alongside the
existing markers (post_upgrade, special_infra, gpu,
usefixtures("non_existent_mdev_bus_mig_nodes")) so the entire MIG vGPU module is
marked tier3 for collection; ensure you import or reference pytest.mark.tier3
consistently with the existing marker style.
tests/virt/conftest.py (1)

184-191: ⚠️ Potential issue | 🔴 Critical

CRITICAL: shared GPU discovery still filters out the only MIG-capable card.

This fixture removes every node whose GPU info contains A30, but the new MIG flow depends on A30-backed nodes. Because virt_special_infra_sanity() uses nodes_with_supported_gpus for pytest.mark.gpu, an A30-only MIG cluster will fail session sanity before the MIG fixtures in tests/virt/node/gpu/vgpu/conftest.py ever run. Please stop hard-excluding A30 here, or split MIG sanity onto a separate supported-node path.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/virt/conftest.py` around lines 184 - 191, The nodes_with_supported_gpus
fixture currently removes any node whose get_nodes_gpu_info contains "A30",
which wrongly excludes A30-only MIG-capable hosts and causes
virt_special_infra_sanity (pytest.mark.gpu) to fail before MIG fixtures run;
update nodes_with_supported_gpus so it no longer hard-excludes "A30"—either
remove the "A30" check entirely or replace it with a capability-aware check that
calls the MIG-specific detector (e.g., is_mig_capable or reuse logic from
tests/virt/node/gpu/vgpu/conftest.py) and only exclude nodes that truly lack
required drivers/features for the non-MIG flow, or alternatively split the
sanity path so virt_special_infra_sanity uses a different fixture that filters
only non-MIG-incompatible nodes.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/virt/node/gpu/vgpu/conftest.py`:
- Around line 242-251: The fixture non_existent_mdev_bus_mig_nodes currently
calls assert_mdev_bus_exists_on_nodes before the vGPU config label is
guaranteed; update the fixture to depend on the vGPU-config readiness fixture
(e.g., add mig_gpu_nodes_labeled_with_vgpu_config or the shared
vgpu_ready_nodes) so that the nvidia.com/vgpu.config label is applied before
calling assert_mdev_bus_exists_on_nodes; alternatively create a single MIG-ready
fixture that combines mig_gpu_nodes_labeled_with_vm_vgpu and
mig_gpu_nodes_labeled_with_vgpu_config and call assert_mdev_bus_exists_on_nodes
with that ready fixture to ensure ordering.
- Around line 113-123: The fixture update_daemon_set_to_enable_mig_vgpu
currently yields inside a loop over get_daemonsets(), which can produce multiple
yields and breaks pytest; instead collect all patch entries first (e.g. build a
patches dict mapping each matching ds to its patch_data by iterating through
all_daemonsets and creating container_patch/patch_data for
ds.name.startswith("nvidia-vgpu-manager-daemonset")), then open a single
ResourceEditor(patches=patches) context and yield exactly once; reference the
functions/classes get_daemonsets, ResourceEditor and the variable names
container_patch and patch_data to locate and apply the change.

---

Duplicate comments:
In `@tests/utils.py`:
- Around line 694-704: Update the signature of the helper get_resource_by_name
to accept a constructor type by changing the parameter typing from Resource to
type[Resource] (so callers pass classes like HyperConverged or KubeVirt and the
function can instantiate them), and add a Google-style docstring that documents
the function purpose (cluster lookup by name, optional namespace), parameters
(resource_kind, name, admin_client, namespace), return type (Resource), and
raised exceptions (ResourceNotFoundError when the resource does not exist);
reference the function name get_resource_by_name, the Resource type,
admin_client/DynamicClient, and ResourceNotFoundError when editing.

In `@tests/virt/conftest.py`:
- Around line 184-191: The nodes_with_supported_gpus fixture currently removes
any node whose get_nodes_gpu_info contains "A30", which wrongly excludes
A30-only MIG-capable hosts and causes virt_special_infra_sanity
(pytest.mark.gpu) to fail before MIG fixtures run; update
nodes_with_supported_gpus so it no longer hard-excludes "A30"—either remove the
"A30" check entirely or replace it with a capability-aware check that calls the
MIG-specific detector (e.g., is_mig_capable or reuse logic from
tests/virt/node/gpu/vgpu/conftest.py) and only exclude nodes that truly lack
required drivers/features for the non-MIG flow, or alternatively split the
sanity path so virt_special_infra_sanity uses a different fixture that filters
only non-MIG-incompatible nodes.

In `@tests/virt/node/gpu/vgpu/conftest.py`:
- Around line 171-187: nodes_with_supported_mig_gpus and
supported_mig_gpu_device currently allow non-MIG cards and hardcode an A30 vgpu
label; update them to use an explicit MIG allow-list (e.g., ALLOWED_MIG_MODELS)
and filter gpu_nodes by membership in that allow-list inside
nodes_with_supported_mig_gpus, then in supported_mig_gpu_device only search
GPU_CARDS_MAP entries for models in the same allow-list and raise a clear error
if nodes_with_supported_mig_gpus is empty; finally remove the hardcoded
"A30-1-6C" and instead derive VGPU_CONFIG_LABEL (used by
mig_gpu_nodes_labeled_with_vgpu_config) from the selected GPU_CARDS_MAP mapping
so the label matches the chosen device.

In `@tests/virt/node/gpu/vgpu/test_rhel_vm_with_mig_vgpu.py`:
- Around line 23-28: The module-level pytestmark list is missing the required
tier3 marker; update the pytestmark variable in this file (the pytestmark list
near the top of test_rhel_vm_with_mig_vgpu.py) to include pytest.mark.tier3
alongside the existing markers (post_upgrade, special_infra, gpu,
usefixtures("non_existent_mdev_bus_mig_nodes")) so the entire MIG vGPU module is
marked tier3 for collection; ensure you import or reference pytest.mark.tier3
consistently with the existing marker style.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 9d464110-083a-4162-bdc9-6c389f3a1d5b

📥 Commits

Reviewing files that changed from the base of the PR and between 203430e and 33e35a1.

📒 Files selected for processing (10)
  • tests/install_upgrade_operators/conftest.py
  • tests/install_upgrade_operators/crypto_policy/utils.py
  • tests/install_upgrade_operators/feature_gates/test_default_featuregates.py
  • tests/install_upgrade_operators/utils.py
  • tests/utils.py
  • tests/virt/conftest.py
  • tests/virt/node/gpu/constants.py
  • tests/virt/node/gpu/utils.py
  • tests/virt/node/gpu/vgpu/conftest.py
  • tests/virt/node/gpu/vgpu/test_rhel_vm_with_mig_vgpu.py
💤 Files with no reviewable changes (1)
  • tests/install_upgrade_operators/utils.py

Comment on lines +113 to +123
def update_daemon_set_to_enable_mig_vgpu(update_cluster_policy_to_enable_mig_vgpu, admin_client):
all_daemonsets = get_daemonsets(admin_client=admin_client, namespace="nvidia-gpu-operator")
for ds in all_daemonsets:
if ds.name.startswith("nvidia-vgpu-manager-daemonset"):
container = ds.instance.spec.template.spec.containers[0]

container_patch = dict(container.items())
container_patch["imagePullPolicy"] = "Always"
patch_data = {"spec": {"template": {"spec": {"containers": [container_patch]}}}}
with ResourceEditor(patches={ds: patch_data}):
yield
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
python - <<'PY'
import ast
from pathlib import Path

path = Path("tests/virt/node/gpu/vgpu/conftest.py")
tree = ast.parse(path.read_text())

parents = {}
for parent in ast.walk(tree):
    for child in ast.iter_child_nodes(parent):
        parents[child] = parent

for node in tree.body:
    if isinstance(node, ast.FunctionDef) and node.name == "update_daemon_set_to_enable_mig_vgpu":
        for yield_node in [child for child in ast.walk(node) if isinstance(child, ast.Yield)]:
            current = parents[yield_node]
            yield_inside_loop = False
            while current is not node:
                if isinstance(current, (ast.For, ast.AsyncFor, ast.While)):
                    yield_inside_loop = True
                    break
                current = parents[current]
            print({"function": node.name, "yield_inside_loop": yield_inside_loop})
PY

Repository: RedHatQE/openshift-virtualization-tests

Length of output: 162


HIGH: This fixture can yield multiple times, breaking pytest.

yield is nested inside the daemonset loop. When get_daemonsets() returns more than one nvidia-vgpu-manager-daemonset*, pytest will resume the generator at a second yield during teardown, causing: fixture function has more than one 'yield'. Fixtures must have exactly one yield at the top level. Accumulate all patches in a dict before entering the ResourceEditor context, then yield once.

Suggested fix
 `@pytest.fixture`(scope="class")
 def update_daemon_set_to_enable_mig_vgpu(update_cluster_policy_to_enable_mig_vgpu, admin_client):
-    all_daemonsets = get_daemonsets(admin_client=admin_client, namespace="nvidia-gpu-operator")
-    for ds in all_daemonsets:
-        if ds.name.startswith("nvidia-vgpu-manager-daemonset"):
-            container = ds.instance.spec.template.spec.containers[0]
-
-            container_patch = dict(container.items())
-            container_patch["imagePullPolicy"] = "Always"
-            patch_data = {"spec": {"template": {"spec": {"containers": [container_patch]}}}}
-            with ResourceEditor(patches={ds: patch_data}):
-                yield
+    patches = {}
+    for daemonset in get_daemonsets(admin_client=admin_client, namespace="nvidia-gpu-operator"):
+        if daemonset.name.startswith("nvidia-vgpu-manager-daemonset"):
+            container_patch = dict(daemonset.instance.spec.template.spec.containers[0].items())
+            container_patch["imagePullPolicy"] = "Always"
+            patches[daemonset] = {"spec": {"template": {"spec": {"containers": [container_patch]}}}}
+
+    with ResourceEditor(patches=patches):
+        yield
🧰 Tools
🪛 Ruff (0.15.10)

[warning] 113-113: Unused function argument: update_cluster_policy_to_enable_mig_vgpu

(ARG001)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/virt/node/gpu/vgpu/conftest.py` around lines 113 - 123, The fixture
update_daemon_set_to_enable_mig_vgpu currently yields inside a loop over
get_daemonsets(), which can produce multiple yields and breaks pytest; instead
collect all patch entries first (e.g. build a patches dict mapping each matching
ds to its patch_data by iterating through all_daemonsets and creating
container_patch/patch_data for
ds.name.startswith("nvidia-vgpu-manager-daemonset")), then open a single
ResourceEditor(patches=patches) context and yield exactly once; reference the
functions/classes get_daemonsets, ResourceEditor and the variable names
container_patch and patch_data to locate and apply the change.

Comment on lines +242 to +251
@pytest.fixture(scope="class")
def non_existent_mdev_bus_mig_nodes(
workers_utility_pods,
mig_gpu_nodes_labeled_with_vm_vgpu,
nodes_with_supported_mig_gpus,
):
"""
Check if the mdev_bus needed for vGPU is available.
"""
assert_mdev_bus_exists_on_nodes(workers_utility_pods=workers_utility_pods, nodes=nodes_with_supported_mig_gpus)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

HIGH: mdev_bus is asserted before the MIG profile label is guaranteed to exist.

This fixture depends on mig_gpu_nodes_labeled_with_vm_vgpu, but not on mig_gpu_nodes_labeled_with_vgpu_config. That means assert_mdev_bus_exists_on_nodes() can run before nvidia.com/vgpu.config is applied, unlike the shared path in tests/virt/conftest.py, which waits on vgpu_ready_nodes first. Please depend on the vGPU-config fixture here, or introduce a single MIG “ready” fixture and assert against that.

🧰 Tools
🪛 Ruff (0.15.10)

[warning] 245-245: Unused function argument: mig_gpu_nodes_labeled_with_vm_vgpu

(ARG001)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/virt/node/gpu/vgpu/conftest.py` around lines 242 - 251, The fixture
non_existent_mdev_bus_mig_nodes currently calls assert_mdev_bus_exists_on_nodes
before the vGPU config label is guaranteed; update the fixture to depend on the
vGPU-config readiness fixture (e.g., add mig_gpu_nodes_labeled_with_vgpu_config
or the shared vgpu_ready_nodes) so that the nvidia.com/vgpu.config label is
applied before calling assert_mdev_bus_exists_on_nodes; alternatively create a
single MIG-ready fixture that combines mig_gpu_nodes_labeled_with_vm_vgpu and
mig_gpu_nodes_labeled_with_vgpu_config and call assert_mdev_bus_exists_on_nodes
with that ready fixture to ensure ordering.

@openshift-virtualization-qe-bot-5
Copy link
Copy Markdown

Clean rebase detected — no code changes compared to previous head (33e35a1).
The following labels were preserved: changes-requested-coderabbitai[bot].

@openshift-virtualization-qe-bot-5
Copy link
Copy Markdown

jerry7z can not be added as reviewer. Reviews may only be requested from collaborators. One or more of the users or teams you specified is not a collaborator of the RedHatQE/openshift-virtualization-tests repository.: 422 {"message": "Reviews may only be requested from collaborators. One or more of the users or teams you specified is not a collaborator of the RedHatQE/openshift-virtualization-tests repository.", "documentation_url": "https://docs.github.com/rest/pulls/review-requests#request-reviewers-for-a-pull-request", "status": "422"}

@openshift-virtualization-qe-bot-5
Copy link
Copy Markdown

SiboWang1997 can not be added as reviewer. Reviews may only be requested from collaborators. One or more of the users or teams you specified is not a collaborator of the RedHatQE/openshift-virtualization-tests repository.: 422 {"message": "Reviews may only be requested from collaborators. One or more of the users or teams you specified is not a collaborator of the RedHatQE/openshift-virtualization-tests repository.", "documentation_url": "https://docs.github.com/rest/pulls/review-requests#request-reviewers-for-a-pull-request", "status": "422"}

@openshift-virtualization-qe-bot
Copy link
Copy Markdown

D/S test tox -e verify-tc-requirement-polarion failed: cnv-tests-tox-executor/26916

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants