Skip to content

VEP #144: Adjust virt-launcher memory lock requirements#145

Merged
kubevirt-bot merged 1 commit intokubevirt:mainfrom
bgartzi:network_binding_plugin-memlock_rlimits
Dec 21, 2025
Merged

VEP #144: Adjust virt-launcher memory lock requirements#145
kubevirt-bot merged 1 commit intokubevirt:mainfrom
bgartzi:network_binding_plugin-memlock_rlimits

Conversation

@bgartzi
Copy link
Copy Markdown
Contributor

@bgartzi bgartzi commented Nov 28, 2025

VEP Metadata

Tracking issue: #144
SIG label: /sig compute

What this PR does

This enhancement introduces a new field to VirtualMachine annotations
to specify memory lock rlimit configuration requirements. That
annotation can be updated by other external mechanisms, such as mutating
admission webhooks, based on the VMI spec and specific conditions. The
value of that annotation field will be checked by the virt-handler to
appropriately configure the memlock rlimits of the virt-launcher and
qemu process.

Special notes for your reviewer

@kubevirt-bot kubevirt-bot added the dco-signoff: yes Indicates the PR's author has DCO signed all their commits. label Nov 28, 2025
@frenzyfriday
Copy link
Copy Markdown
Contributor

/cc @nirdothan

@bgartzi bgartzi force-pushed the network_binding_plugin-memlock_rlimits branch from 14c932c to 1631058 Compare December 2, 2025 09:19
@bgartzi
Copy link
Copy Markdown
Contributor Author

bgartzi commented Dec 2, 2025

Pointed out the scalability concerns discussed yesterday during the community sig-network call.

I also added another possible alternative that could address the issue.

Copy link
Copy Markdown
Member

@nirdothan nirdothan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @bgartzi! It's a beautiful VEP.
Please see my comments below.

configuration is static, meaning that it does not scale based on
guest memory sizes. It increases pods' memory overhead, which impacts
on schedulable VMIs. And finally, it does not trigger memory lock
rlimit configuration.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me get this right:

  1. You need a dynamic overhead that is a function of the guest memory, while ComputeResourceOverhead is a fixed size.
  2. The current calculation of RLIMIT, takes vmi.Spec.Domain.Memory.Guest or vmi.Spec.Domain.Resources.Requests.Memory but are ignoring ComputeResourceOverhead, even-though the container.requests.resources does.

Regarding 1: You might as well add 0 overhead to the plugin, but add enough guest memory in the VMI. Yes the calculation may be complex, but that can be assisted by a simple webhook that knows the algorithm.

Regarding 2: I assume that it was done this way in order to support passt which has pod level memory overhead but it does not affect the guest in any way. This could have been been a wrong decision that is only viable for passt.

Regarding:

which impacts on schedulable VMIs

It has to impact schedulable VMIs. You cannot lock memory without specifying the request to the scheduler.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Yes.
  2. Yes, it's ignoring ComputeResourceOverhead. The reason for that is that IIUC, ComputeResourceOverhead is about the overhead that the network binding introduces, but doesn't have nothing to do with the guest memory size, which is the one the devices will try to lock.

Regarding 1: You might as well add 0 overhead to the plugin, but add enough guest memory in the VMI.

Sorry, I think I didn't quite get this, could you rephrase it?

Yes the calculation may be complex, but that can be assisted by a simple webhook that knows the algorithm.

Another of the alternatives presented in this VEP takes an implementation based on admission webhooks into account. I agree that outsourcing the calculation function to webhooks could make the solution more general. The problem is that even if the webhook knows the algorithm, we need somewhere else to put that information, be it an annotation, a field in the VMI spec... whatever, and act accordingly later.

From the implementation point of view, the key aspect is that AdjustResources should look into that annotation, field, or even call a calculation function, to know whether a network binding plugin, or anything else, could need a non-default memory lock limit to function properly. If the limit is not adjusted properly, libvirt will try to do it later, and this will lead into an error as it lacks permissions.

You cannot lock memory without specifying the request to the scheduler.

But this does not lock memory, it "just" updates the limits of memory that virt-launcher processes could lock.
I might be missing something here as well, but setProcessMemoryLockRLimit, which is the one adjusting memory lock limits, does not let the scheduler know about the update.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I'm beginning to understand the difference now.
Bottom line: KubeVirt is not sufficiently flexible to let the user define lock limits, separate from guest memory.
If that is the case, it is probably not a network specific requirement. I suppose that it needs to be solved across the board.

guest memory size, number of network interfaces of each kind and
network bindings' needs.
- Adjust memory lock limits without impacting VMI scheduling capacity
(i.e. increasing pod memory resource limits).
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fail to understand how you can lock memory without affecting scheduling capacity.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because it is not about setting a limit on how much memory the guest, and all related infra are going to consume. It is about setting a limit on how much of that memory (already) consumed can be locked in memory, in other words, forced not to move away from their physical RAM pages.

- $A_i$ is 1 if `AdjustMemoryLockLimits` is true for the network
binding plugin of interface $i$ and 0 otherwise.
- $O_{fixed}$ is a fixed offset set to $1024*1024$ kB, to mimic
[libvirt's expectations][libvirt-vdpa-memlock-ratio].
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to mimic libvirt's expectations

IMHO instead of having to guess libvirts expectations, we should be able to obtain them by means of a libvirt query API.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know it's no excuse, but such interfaces does not currently exist and libvirt is also trying to do it's best to guess qemu's needs. That's pretty much why it wasn't taken into consideration in the first place.

However, what about the other alternatives, such as the one relying on admission webhooks? The idea was to outsource calculation to webhooks, which wouldn't be able to reach libvirt's query APIs IIUC.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea was to outsource calculation to webhooks, which wouldn't be able to reach libvirt's query APIs IIUC.

That is correct. I think that the suggestion to use a webhook came from people who understood that updating exiting API values with a sophisticated calculation could meet the requirements. However that is not the case.

@nirdothan
Copy link
Copy Markdown
Member

/cc @EdDev @orelmisan

Copy link
Copy Markdown
Contributor Author

@bgartzi bgartzi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @nirdothan for this first review round. I'm leaving some comments on the discussion threads.

configuration is static, meaning that it does not scale based on
guest memory sizes. It increases pods' memory overhead, which impacts
on schedulable VMIs. And finally, it does not trigger memory lock
rlimit configuration.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Yes.
  2. Yes, it's ignoring ComputeResourceOverhead. The reason for that is that IIUC, ComputeResourceOverhead is about the overhead that the network binding introduces, but doesn't have nothing to do with the guest memory size, which is the one the devices will try to lock.

Regarding 1: You might as well add 0 overhead to the plugin, but add enough guest memory in the VMI.

Sorry, I think I didn't quite get this, could you rephrase it?

Yes the calculation may be complex, but that can be assisted by a simple webhook that knows the algorithm.

Another of the alternatives presented in this VEP takes an implementation based on admission webhooks into account. I agree that outsourcing the calculation function to webhooks could make the solution more general. The problem is that even if the webhook knows the algorithm, we need somewhere else to put that information, be it an annotation, a field in the VMI spec... whatever, and act accordingly later.

From the implementation point of view, the key aspect is that AdjustResources should look into that annotation, field, or even call a calculation function, to know whether a network binding plugin, or anything else, could need a non-default memory lock limit to function properly. If the limit is not adjusted properly, libvirt will try to do it later, and this will lead into an error as it lacks permissions.

You cannot lock memory without specifying the request to the scheduler.

But this does not lock memory, it "just" updates the limits of memory that virt-launcher processes could lock.
I might be missing something here as well, but setProcessMemoryLockRLimit, which is the one adjusting memory lock limits, does not let the scheduler know about the update.

guest memory size, number of network interfaces of each kind and
network bindings' needs.
- Adjust memory lock limits without impacting VMI scheduling capacity
(i.e. increasing pod memory resource limits).
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because it is not about setting a limit on how much memory the guest, and all related infra are going to consume. It is about setting a limit on how much of that memory (already) consumed can be locked in memory, in other words, forced not to move away from their physical RAM pages.

- $A_i$ is 1 if `AdjustMemoryLockLimits` is true for the network
binding plugin of interface $i$ and 0 otherwise.
- $O_{fixed}$ is a fixed offset set to $1024*1024$ kB, to mimic
[libvirt's expectations][libvirt-vdpa-memlock-ratio].
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know it's no excuse, but such interfaces does not currently exist and libvirt is also trying to do it's best to guess qemu's needs. That's pretty much why it wasn't taken into consideration in the first place.

However, what about the other alternatives, such as the one relying on admission webhooks? The idea was to outsource calculation to webhooks, which wouldn't be able to reach libvirt's query APIs IIUC.

@nirdothan
Copy link
Copy Markdown
Member

nirdothan commented Dec 3, 2025

But this does not lock memory, it "just" updates the limits of memory that virt-launcher processes could lock.
I might be missing something here as well, but setProcessMemoryLockRLimit, which is the one adjusting memory lock limits, does not let the scheduler know about the update.

Thanks. I'm beginning to understand the difference now.
Bottom line: KubeVirt is not sufficiently flexible to let the user define lock limits, separate from guest memory.
If that is the case, it is probably not a network specific requirement. I suppose that it needs to be solved across the board.

Comment moved to a reply above.

@bgartzi bgartzi force-pushed the network_binding_plugin-memlock_rlimits branch from 1631058 to 92a9888 Compare December 10, 2025 12:50
@bgartzi bgartzi changed the title VEP#144: Network binding plugin memory lock requirements VEP #144: Adjust virt-launcher memory lock requirements Dec 10, 2025
@bgartzi
Copy link
Copy Markdown
Contributor Author

bgartzi commented Dec 10, 2025

/sig compute

@nirdothan
Copy link
Copy Markdown
Member

nirdothan commented Dec 10, 2025

/cc

@vladikr
Copy link
Copy Markdown
Member

vladikr commented Dec 11, 2025

@bgartzi Thank you for this VEP!

As we discussed during the last VEP call, I agree with you that, in general, workload/VM owners should have control over the allocated memory for their workload.

@EdDev and I were chatting offline to discuss the API.

What I would suggest is an API perhaps as spec.domain.memory.reservedOverhead with an optional requiresLock: true subfield.
What I like about this API is that it's explicit and in the future plugins/webhooks could fill these on user behalf.
This approach provides a higher-level abstraction, treating memlock as a side effect of reserving overhead for specific needs (like DPA or plugin requirements), while allowing explicit control where necessary. It integrates well with existing overhead calculations in KubeVirt, such as in GetMemoryOverhead, and could be extended for dynamic plugin scenarios without overcomplicating the core logic.

What do you think?
@iholder101 @jean-edouard

Copy link
Copy Markdown
Member

@nirdothan nirdothan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @bgartzi. Looks good. See comments below.

This solution introduces a new optional annotation to `VirtualMachine`s,
`kubevirt.io/memlock-rlimit`. The annotation will be used to specify the
memlock rlimit needs of a VM. The value of that annotation key will
contain a [`Quantity` object][k8s-resource-quantity] from
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think it has to be a string that will be marshaled to a quantity object

There already are certain cases in which the `virt-handler` adjusts
memlock rlimits. When this new mechanism is introduced, `virt-handler`
will track the annotation, and if any, it will add it to the result of
the calculation existing for VFIO, SEV and RT VMs.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to decide if it is added or overrides the existing calculation. If added, it may lead to situations where the entity that sets the annotation value needs to be aware of the KubeVirt internal calculation, e.g. For SR-IOV it should take into consideration that we already add 1G per device internally.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tricky part is that if overridden completely, without taking into account possible existing values, we would be up to the arbitrary order in which mutating webhooks are applied, and only the last one would impact the locked memory. That last one would only know about itself, but not the rest.

I think that the existing logic should take the value present in the annotation, and add it to the internal calculations. But yes, it gets tricky with the 1G assigned by SR-IOV devices due to locking the MMIO region. What if a SR-IOV device and a vdpa/whatever device gets attached to a VM? I agree with you that we would be accounting the MMIO region twice.

I think that considering the approach proposed by @vladikr, could help mitigate this issue. (Using a boolean flag to mark whether memory needs to be locked or not).

Comment thread veps/sig-compute/144-memory-lock-rlimits/memory-lock-rlimits.md
Note that mutating admission webhooks do introduce certain latency
during pod admission. This means that in case the number of devices and
conditions that require special memlock rlimits is large enough, this
could introduce a considerable latency to `VirtualMachine` admission.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless we start implementing common use cases in the core, and not necessarily as webhooks.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my opinion this mechanism's main selling point is keeping that logic out of the core for new use-cases that were not already modelled in the current KubeVirt source code.

containing a proper value was added.
- VM migration: Memory lock rlimits are configured properly in the
destination `virt-launcher` and qemu process when a VMIs with a
`kubevirt.io/memlock-rlimit` annotation are migrated.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Tests that validate the process rlimit cannot be e2e tests. We no longer use e2e tests to check underlying technicalities unless they can be asserted inside the guest (I assume that they can't?). We can however test the value that we set in Unit Tests.
  • An e2e test would verify that the VM strats up successfully/fails in under different scenarios.
    I suggest to test scenarios of memory hotplug along with devices that require a higher rlimit.
    See https://github.com/kubevirt/kubevirt/blob/v1.7.0/tests/network/sriov.go#L320 as an example.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated this part in the last version, let me know if it matches your expectations.

@bgartzi bgartzi force-pushed the network_binding_plugin-memlock_rlimits branch from 92a9888 to 3e4e55d Compare December 12, 2025 09:36
Copy link
Copy Markdown
Contributor Author

@bgartzi bgartzi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @nirdothan, I'm leaving some comments and updates

There already are certain cases in which the `virt-handler` adjusts
memlock rlimits. When this new mechanism is introduced, `virt-handler`
will track the annotation, and if any, it will add it to the result of
the calculation existing for VFIO, SEV and RT VMs.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tricky part is that if overridden completely, without taking into account possible existing values, we would be up to the arbitrary order in which mutating webhooks are applied, and only the last one would impact the locked memory. That last one would only know about itself, but not the rest.

I think that the existing logic should take the value present in the annotation, and add it to the internal calculations. But yes, it gets tricky with the 1G assigned by SR-IOV devices due to locking the MMIO region. What if a SR-IOV device and a vdpa/whatever device gets attached to a VM? I agree with you that we would be accounting the MMIO region twice.

I think that considering the approach proposed by @vladikr, could help mitigate this issue. (Using a boolean flag to mark whether memory needs to be locked or not).

Note that mutating admission webhooks do introduce certain latency
during pod admission. This means that in case the number of devices and
conditions that require special memlock rlimits is large enough, this
could introduce a considerable latency to `VirtualMachine` admission.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my opinion this mechanism's main selling point is keeping that logic out of the core for new use-cases that were not already modelled in the current KubeVirt source code.

@bgartzi
Copy link
Copy Markdown
Contributor Author

bgartzi commented Dec 12, 2025

@vladikr, thanks for taking a look and participating in the discussions

I agree with both of you that the API would look much better the way you propose it. Specially by putting the new field inside of the Memory spec.

However I have a concern about using a boolean flag: It makes it way more explicit and easier to understand, but it would imply a limitation: attaching 2 or more devices that need locking memory to a VM. That's a known issue: even if two devices lock the same memory pages in physical RAM, the amount of locked memory is counted once per each lock. If we would like to support this scenario, we would need an alternative, such as a count of times memory would be locked or something like that. Would that work, or would you rather still go for the boolean flag?

@vladikr
Copy link
Copy Markdown
Member

vladikr commented Dec 12, 2025

@vladikr, thanks for taking a look and participating in the discussions

I agree with both of you that the API would look much better the way you propose it. Specially by putting the new field inside of the Memory spec.

However I have a concern about using a boolean flag: It makes it way more explicit and easier to understand, but it would imply a limitation: attaching 2 or more devices that need locking memory to a VM. That's a known issue: even if two devices lock the same memory pages in physical RAM, the amount of locked memory is counted once per each lock. If we would like to support this scenario, we would need an alternative, such as a count of times memory would be locked or something like that. Would that work, or would you rather still go for the boolean flag?

@bgartzi Thanks for the feedback.

I think there is some misunderstanding...

The requiresLock: true is only meant to trigger the memory locking mechanism. That is to trigger the prlimit adjustment in virt-handler via AdjustResources or AdjustQemuProcessMemoryLimits to enable allow when needed.
It doesn't set any limitation on the number of devices or locks. The actual memlock limit would be calculated based on the total memory needs: base guest memory + calculated overhead (from GetMemoryOverhead) + the user-specified reservedOverhead (the new field)

For multiple devices, several VFIO passthrough devices, each might require additional non-overlapping locked memory. However, since the kernel's mlock enforcement is on the total unique locked bytes in the process's address space we don't need to track a "count of locks" explicitly here. Instead, users or plugins/webhooks can simply set reservedOverhead to arequired value and KubeVirt will factor that into the overall rlimit.

In the current KubeVirt implementation, we already use a fixed 1GiB overhead for any VFIO assignments, regardless of device count, but this new API would allow more control, which is perfect for dynamic plugin scenarios like in this VEP.

If there's a specific use case where we will need a different behavior, we can iterate on it, but based on my understanding, this should cover multi-device attachments effectively - and keep the api hypervior agonist (for future)

Does this address your concern, or am I missing something?

@bgartzi
Copy link
Copy Markdown
Contributor Author

bgartzi commented Dec 13, 2025

@vladikr thank you. That's way simpler than what I was proposing and addresses my concerns, you're right. Sometimes you can't see it clearly when the solution is right in front of you...

I will rephrase the document and send a new version for review beginning next week.

@bgartzi
Copy link
Copy Markdown
Contributor Author

bgartzi commented Dec 15, 2025

Hello again @vladikr, I've been thinking a bit more about this topic. After that, I don't clearly see how a boolean flag could help all possible use-cases.

I think that a boolean flag could cover a scenario in which only n=1 device requiring to lock VM memory is attached. If n>1, we wouldn't be safe unless we ask the user to do what I would consider a workaround, by adding the expected VM memory as many times as needed to reservedOverhead.

Let's think of the following example: We configure requiresLock: true and reservedOverhead: 1Gi in the VM spec. This will work for when n_vdpa = 1. However, under the same configuration, when n_vdpa > 1, libvirt will try to re-configure the memlock rlimits but fail due to lack of permissions if I understand correctly.

This is because libvirt calculates the expected rlimit considering a known locked memory accounting issue [1], which in simple terms does $mem_{rlimit} = n_{lockdevs} * Mem_{vm} + Mem_{mmio} $, where $n_{lockdevs}$ is the number of devices requiring locking the VM memory, $Mem_{vm}$ is the VM memory, and $Mem_{mmio}$ is the MMIO memory region.

With a VM-wide boolean flag, we don't know how many of the attached devices do require locking memory, so we don't know which the value of $n_{lockdevs}$ is. In that case, I don't see a clear way to configure an environment in which libvirt/qemu can run safely without adding $(n_{lockdevs} - 1) * Mem_{vm}$ to reservedOverhead.

I'm preparing an experiment to confirm this.

As a side-note, in case of SR-IOV, I don't hit this issue when $n>1$ because I assume that a viommu is not being configured for the VM.

Edit: I can confirm the behavior for n>1

[1] https://gitlab.com/libvirt/libvirt/-/blob/v11.9.0/src/qemu/qemu_domain.c#L8363-8371

@vladikr
Copy link
Copy Markdown
Member

vladikr commented Dec 15, 2025

Hello again @vladikr, I've been thinking a bit more about this topic. After that, I don't clearly see how a boolean flag could help all possible use-cases.

I think that a boolean flag could cover a scenario in which only n=1 device requiring to lock VM memory is attached. If n>1, we wouldn't be safe unless we ask the user to do what I would consider a workaround, by adding the expected VM memory as many times as needed to reservedOverhead.

Let's think of the following example: We configure requiresLock: true and reservedOverhead: 1Gi in the VM spec. This will work for when n_vdpa = 1. However, under the same configuration, when n_vdpa > 1, libvirt will try to re-configure the memlock rlimits but fail due to lack of permissions if I understand correctly.

This is because libvirt calculates the expected rlimit considering a known locked memory accounting issue [1], which in simple terms does m e m r l i m i t = n l o c k d e v s ∗ M e m v m + M e m m m i o , where n l o c k d e v s is the number of devices requiring locking the VM memory, M e m v m is the VM memory, and M e m m m i o is the MMIO memory region.

With a VM-wide boolean flag, we don't know how many of the attached devices do require locking memory, so we don't know which the value of n l o c k d e v s is. In that case, I don't see a clear way to configure an environment in which libvirt/qemu can run safely without adding ( n l o c k d e v s − 1 ) ∗ M e m v m to reservedOverhead.

I'm preparing an experiment to confirm this.

As a side-note, in case of SR-IOV, I don't hit this issue when n > 1 because I assume that a viommu is not being configured for the VM.

Edit: I can confirm the behavior for n>1

[1] https://gitlab.com/libvirt/libvirt/-/blob/v11.9.0/src/qemu/qemu_domain.c#L8363-8371

Hey @bgartzi Thanks again for diving deeper and for running the experiment.

I think I understand your concern.
I appreciate the detailed explanation and the link to the libvirt code.

You're absolutely right about the locked memory accounting issue in libvirt/QEMU.
For context, libvirt calculates the required RLIMIT_MEMLOCK roughly as n_lockdevs * guest_memory + mem_mmio + other_overhead, where n_lockdevs is the number of devices requiring memory locking
If I'm not mistaken this is coming from QEMU where memory is effectively overcounted for each device. And that causes a higher rlimit. Because of that libvirt tries to set a high rlimit.

In KubeVirt, since the virt-launcher pod (and all the processes in it) is unprivileged, libvirt can't increase the rlimit itself. It the virt-handler that sets the limit. If libvirt attempts to set a higher value it will fail.
Currently, KubeVirt's GetMemoryOverhead and rlimit adjustments (in AdjustResources/AdjustQemuProcessMemoryLimits) use a fixed 1GiB overhead for any VFIO detection, which works for n=1 (depending on the device, it can work for n>1 ) but falls short for n>1 due to this overcounting in your experiment.
Until now, KubeVirt users could adjust the overhead (additional memory that virt-hnalder will use for rlimit) by manually setting spec.domain.resources.requests.memory higher than spec.domain.memory.guest.

The proposed spec.domain.memory.reservedOverhead with requiresLock: true directly addresses this without needing a per-lock count:

  • The bool triggers the overall memlock enablement and prlimit adjustment in virt-handler.
  • The reservedOverhead lets users (or mutating webhooks/plugins) explicitly add the extra amount needed, which gets factored into both pod resource requests/limit sand the final rlimit calculation.
    For n>1, you can set it to something like (n-1) * [guest memory] + n * Mem_mmio (or whatever formula fits your vDPA case), that is to make sure the rlimit is high enough to be equal to or exceed libvirt's internal calculation and avoid the failure.

I don't see it as a "workaround". It's the intended flexibility for cases where core KubeVirt can't automatically detect n_lockdevs. That is also extensible for plugins to add devices at runtime, as in this VEP. For built-in features like SR-IOV or GPUs, we could enhance KubeVirt to count devices from the VMI spec and automatically apply the per device overhead in GetMemoryOverhead to reduce the need for manual adjustments. But for dynamic plugins, the API allows them to compute and set the overhead.

Does this make sense, or is there a specific aspect of the vDPA plugin where even this wouldn't be enough?

@bgartzi
Copy link
Copy Markdown
Contributor Author

bgartzi commented Dec 16, 2025

Hi @vladikr, thanks a lot again, let me eat my own words and excuse myself.

This would indeed work as you explain, for vDPA and other use-cases. I was concerned about exposing reservedOverhead to the user and making them do some maths themselves; that's why I considered it a workaround that users would have to do at first. It's true we can mask that under a mutating admission webhook and do not bother users with that at all, so yeah, excuse me.

I'm rephrasing this VEP by today according to the API you proposed.

Now my question is, what if reservedOverhead has a value but requiresLock: false? We should mark that as an invalid configuration, right? Or are we expected to treat reservedOverhead as additionalGuestMemoryOverheadRatio or binding.computeResourceOverhead? I mean, would the value in reservedOverhead end up limiting the amount of schedulable VMs per node, or is it a parameter that just indicates how much memory to lock?

@bgartzi bgartzi force-pushed the network_binding_plugin-memlock_rlimits branch from 3e4e55d to 015c554 Compare December 16, 2025 11:58
@bgartzi
Copy link
Copy Markdown
Contributor Author

bgartzi commented Dec 16, 2025

I just pushed a new version of the VEP that matches the design that is currently being discussed. That is, it proposes adding a new optional field named spec.domain.memory.lockRLimits which contains two optional sub-fields:

  • requiresLock: a boolean flag.
  • reservedOverhead: a k8s Quantity object.

@EdDev
Copy link
Copy Markdown
Member

EdDev commented Dec 18, 2025

Now my question is, what if reservedOverhead has a value but requiresLock: false? We should mark that as an invalid configuration, right? Or are we expected to treat reservedOverhead as additionalGuestMemoryOverheadRatio or binding.computeResourceOverhead? I mean, would the value in reservedOverhead end up limiting the amount of schedulable VMs per node, or is it a parameter that just indicates how much memory to lock?

The new section should replace vmi.Spec.Domain.Resources.Requests.Memory, by being explicit with the additional memory to use beyond the ones Kubevirt is aware of.
In this regard, this new field should be added with a protecting FG in alpha stage. The beta stage should also include the deceleration of previous options to extend the memory as deprecated (from both the VMI spec and the plugin one).

I just pushed a new version of the VEP that matches the design that is currently being discussed. That is, it proposes adding a new optional field named spec.domain.memory.lockRLimits which contains two optional sub-fields:

* `requiresLock`: a boolean flag.

* `reservedOverhead`: a k8s `Quantity` object.

The suggestion, as I understood it from @vladikr is this:

spec:
  domain:
    memory:
      reservedOverhead:
        value: <Quanity>
        requiresLock: "true"

Make sure you do not use a go boolean type but an enum/string, it fits the best API practices from K8S.
I used the name value above for the memory specification, but if there is some other better name for it, feel free to use that.

Comment thread veps/sig-compute/144-memory-lock-rlimits/memory-lock-rlimits.md Outdated

### Alpha

### Beta
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may have been already discussed, but please express why you think we can go with Beta directly.
I think this API needs experimentation before we can be sure it can go GA, I would recomend to start with Alpha, then move to Beta without any special requirements, except perhaps the notification that the new filed will replace the previous extension knob from both the VMI spec and the network plugin parts.

In all cases, unit and e2e tests are a must from day 1, none are to be dragged to the GA stage.

You also need to bring back the FG section, as this now needs to be protected by it.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved it to Alpha in a version I pushed just before receiving these comments.

I'm now pushing a newer version that also considers e2e testing in Alpha, as well as documentation.
I'm also adding some comments about notifying that other memory extension mechanisms are being deprecated in favour of this in beta, and how would they be rolled out when this feature goes into GA.

@bgartzi bgartzi force-pushed the network_binding_plugin-memlock_rlimits branch from 015c554 to 3f4a56b Compare December 18, 2025 08:17
Signed-off-by: Beñat Gartzia Arruabarrena <bgartzia@redhat.com>
@bgartzi
Copy link
Copy Markdown
Contributor Author

bgartzi commented Dec 18, 2025

@EdDev,

Now my question is, what if reservedOverhead has a value but requiresLock: false? We should mark that as an invalid configuration, right? Or are we expected to treat reservedOverhead as additionalGuestMemoryOverheadRatio or binding.computeResourceOverhead? I mean, would the value in reservedOverhead end up limiting the amount of schedulable VMs per node, or is it a parameter that just indicates how much memory to lock?

The new section should replace vmi.Spec.Domain.Resources.Requests.Memory, by being explicit with the additional memory to use beyond the ones Kubevirt is aware of.

Okay, now I understand better. Telling whether memory could be locked or not, and telling whether some overhead is expected or not. This approach handles well cases in which qemu/devices could lock the VM memory once plus some extra overhead memory. That overhead memory will be taken into account in the pod resource limits and also in the virt-launcher's environment memlock RLimits.

However, I don't see how this could scale to cases in which 2 (or more) devices might lock the same VM memory. VM memory will be advertised as locked once per each device that is locking it. That does not mean that we are using twice as physical RAM as initially intended, but that the same physical memory region has been counted as locked twice.

Not sure if I'm explaining my concern too clearly; I hope I can help myself through a practical example. Let's imagine a 8G VM that has 2 devices that will lock VM memory. Those devices need a 1G memory overhead (which will also be locked). In that case, we need 8G + 1G -> 9G of physical memory to run safely. However, each device will advertise VM's memory locked once, so the memlock RLimits that we need to run libvirt/qemu safely are 2 * 8G + 1G -> 17G: almost double. If we want to use reservedOverhead to indicate both memory limits and memlock RLimits through the same field, we would end up telling kubernetes that we need 17G to run this VM, while we just needed 9G.

I guess that it's okay for alpha, but those cases, the amount of VMs that we think can be scheduled in a node wouldn't be on pair with the actual amount of VMs that can really be scheduled. That's one of the reasons why I was advocating for modelling requiresLock as an int/counter in earlier drafts.

In this regard, this new field should be added with a protecting FG in alpha stage. The beta stage should also include the deceleration of previous options to extend the memory as deprecated (from both the VMI spec and the plugin one).

Fixing that in the version I'm pushing now.

I just pushed a new version of the VEP that matches the design that is currently being discussed. That is, it proposes adding a new optional field named spec.domain.memory.lockRLimits which contains two optional sub-fields:

  • requiresLock: a boolean flag.

  • reservedOverhead: a k8s Quantity object.

The suggestion, as I understood it from @vladikr is this:

spec:
domain:
memory:
reservedOverhead:
value:
requiresLock: "true"

Make sure you do not use a go boolean type but an enum/string, it fits the best API practices from K8S.
I used the name value above for the memory specification, but if there is some other better name for it, feel free to use that.

I'm updating field names in a newer version I'm pushing right now.

@bgartzi bgartzi force-pushed the network_binding_plugin-memlock_rlimits branch from 3f4a56b to 4b689a8 Compare December 18, 2025 11:38
@vladikr
Copy link
Copy Markdown
Member

vladikr commented Dec 18, 2025

@EdDev,

Now my question is, what if reservedOverhead has a value but requiresLock: false? We should mark that as an invalid configuration, right? Or are we expected to treat reservedOverhead as additionalGuestMemoryOverheadRatio or binding.computeResourceOverhead? I mean, would the value in reservedOverhead end up limiting the amount of schedulable VMs per node, or is it a parameter that just indicates how much memory to lock?

The new section should replace vmi.Spec.Domain.Resources.Requests.Memory, by being explicit with the additional memory to use beyond the ones Kubevirt is aware of.

Okay, now I understand better. Telling whether memory could be locked or not, and telling whether some overhead is expected or not. This approach handles well cases in which qemu/devices could lock the VM memory once plus some extra overhead memory. That overhead memory will be taken into account in the pod resource limits and also in the virt-launcher's environment memlock RLimits.

However, I don't see how this could scale to cases in which 2 (or more) devices might lock the same VM memory. VM memory will be advertised as locked once per each device that is locking it. That does not mean that we are using twice as physical RAM as initially intended, but that the same physical memory region has been counted as locked twice.

Not sure if I'm explaining my concern too clearly; I hope I can help myself through a practical example. Let's imagine a 8G VM that has 2 devices that will lock VM memory. Those devices need a 1G memory overhead (which will also be locked). In that case, we need 8G + 1G -> 9G of physical memory to run safely. However, each device will advertise VM's memory locked once, so the memlock RLimits that we need to run libvirt/qemu safely are 2 * 8G + 1G -> 17G: almost double. If we want to use reservedOverhead to indicate both memory limits and memlock RLimits through the same field, we would end up telling kubernetes that we need 17G to run this VM, while we just needed 9G.

@bgartzi I completely understand your concern, but the issue comes from libvirt's overaccounting.
I don't think KubeVirt needs to craft its API to work around bugs in libvirt/qemu.
Exposing something like lockCount risks tying the API to a transient bug... I would prefer not to expose the device counts or locking multiplicities.

I guess that it's okay for alpha, but those cases, the amount of VMs that we think can be scheduled in a node wouldn't be on pair with the actual amount of VMs that can really be scheduled. That's one of the reasons why I was advocating for modelling requiresLock as an int/counter in earlier drafts.

In this regard, this new field should be added with a protecting FG in alpha stage. The beta stage should also include the deceleration of previous options to extend the memory as deprecated (from both the VMI spec and the plugin one).

Fixing that in the version I'm pushing now.

I just pushed a new version of the VEP that matches the design that is currently being discussed. That is, it proposes adding a new optional field named spec.domain.memory.lockRLimits which contains two optional sub-fields:

  • requiresLock: a boolean flag.
  • reservedOverhead: a k8s Quantity object.

The suggestion, as I understood it from @vladikr is this:
spec:
domain:
memory:
reservedOverhead:
value:
requiresLock: "true"
Make sure you do not use a go boolean type but an enum/string, it fits the best API practices from K8S.
I used the name value above for the memory specification, but if there is some other better name for it, feel free to use that.

I'm updating field names in a newer version I'm pushing right now.

@iholder101
Copy link
Copy Markdown
Contributor

@bgartzi I completely understand your concern, but the issue comes from libvirt's overaccounting. I don't think KubeVirt needs to craft its API to work around bugs in libvirt/qemu. Exposing something like lockCount risks tying the API to a transient bug... I would prefer not to expose the device counts or locking multiplicities.

Thanks @vladikr for the reference.

In a code comment, they even write:

    /* At present, the full memory needs to be locked for each VFIO / VDPA
    * NVMe device. For VFIO devices, this only applies when there is a
    * vIOMMU present. Yes, this may result in a memory limit that is
    * greater than the host physical memory, which is not ideal. The
    * long-term solution is a new userspace iommu interface (iommufd)
    * which should eliminate this duplicate memory accounting. But for now
    * this is the only way to enable configurations with e.g. multiple
    * VDPA/NVMe devices.

So there is an intention to fix this bug moving forward.
Therefore I definitely agree we should not aim to fix this as part of Kubevirt's API.

@vladikr
Copy link
Copy Markdown
Member

vladikr commented Dec 18, 2025

/approve
From my side, this pr looks good. Let's start with the proposed API and see if we need adjustments later.
Thank you @bgartzi !
I would like to wait for @EdDev to have another look.

@kubevirt-bot
Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: vladikr

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@kubevirt-bot kubevirt-bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 18, 2025
Copy link
Copy Markdown
Member

@EdDev EdDev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great, thank you very much!

@kubevirt-bot kubevirt-bot added the lgtm Indicates that a PR is ready to be merged. label Dec 21, 2025
@kubevirt-bot kubevirt-bot merged commit 6aa0acb into kubevirt:main Dec 21, 2025
2 checks passed
@bgartzi
Copy link
Copy Markdown
Contributor Author

bgartzi commented Dec 22, 2025

Thanks lots everyone for the patience, reviews and help!

@enp0s3
Copy link
Copy Markdown

enp0s3 commented Apr 19, 2026

@bgartzi @vladikr Hey folks. There is something that I can't understand, how can we guarantee that in case of multiple webhooks one webhook will respect the values set by the previous one?

@iholder101
Copy link
Copy Markdown
Contributor

@bgartzi @vladikr Hey folks. There is something that I can't understand, how can we guarantee that in case of multiple webhooks one webhook will respect the values set by the previous one?

Hey @enp0s3!

@bgartzi recently raised the same concern here #263 (comment) which was followed up by a few comments. Do you mind adding your thoughts over there?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. dco-signoff: yes Indicates the PR's author has DCO signed all their commits. lgtm Indicates that a PR is ready to be merged. sig/compute size/L

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants