VEP #144: Adjust virt-launcher memory lock requirements#145
Conversation
|
/cc @nirdothan |
14c932c to
1631058
Compare
|
Pointed out the scalability concerns discussed yesterday during the community sig-network call. I also added another possible alternative that could address the issue. |
| configuration is static, meaning that it does not scale based on | ||
| guest memory sizes. It increases pods' memory overhead, which impacts | ||
| on schedulable VMIs. And finally, it does not trigger memory lock | ||
| rlimit configuration. |
There was a problem hiding this comment.
Let me get this right:
- You need a dynamic overhead that is a function of the guest memory, while
ComputeResourceOverheadis a fixed size. - The current calculation of RLIMIT, takes
vmi.Spec.Domain.Memory.Guestorvmi.Spec.Domain.Resources.Requests.Memorybut are ignoringComputeResourceOverhead, even-though the container.requests.resources does.
Regarding 1: You might as well add 0 overhead to the plugin, but add enough guest memory in the VMI. Yes the calculation may be complex, but that can be assisted by a simple webhook that knows the algorithm.
Regarding 2: I assume that it was done this way in order to support passt which has pod level memory overhead but it does not affect the guest in any way. This could have been been a wrong decision that is only viable for passt.
Regarding:
which impacts on schedulable VMIs
It has to impact schedulable VMIs. You cannot lock memory without specifying the request to the scheduler.
There was a problem hiding this comment.
- Yes.
- Yes, it's ignoring
ComputeResourceOverhead. The reason for that is that IIUC,ComputeResourceOverheadis about the overhead that the network binding introduces, but doesn't have nothing to do with the guest memory size, which is the one the devices will try to lock.
Regarding 1: You might as well add 0 overhead to the plugin, but add enough guest memory in the VMI.
Sorry, I think I didn't quite get this, could you rephrase it?
Yes the calculation may be complex, but that can be assisted by a simple webhook that knows the algorithm.
Another of the alternatives presented in this VEP takes an implementation based on admission webhooks into account. I agree that outsourcing the calculation function to webhooks could make the solution more general. The problem is that even if the webhook knows the algorithm, we need somewhere else to put that information, be it an annotation, a field in the VMI spec... whatever, and act accordingly later.
From the implementation point of view, the key aspect is that AdjustResources should look into that annotation, field, or even call a calculation function, to know whether a network binding plugin, or anything else, could need a non-default memory lock limit to function properly. If the limit is not adjusted properly, libvirt will try to do it later, and this will lead into an error as it lacks permissions.
You cannot lock memory without specifying the request to the scheduler.
But this does not lock memory, it "just" updates the limits of memory that virt-launcher processes could lock.
I might be missing something here as well, but setProcessMemoryLockRLimit, which is the one adjusting memory lock limits, does not let the scheduler know about the update.
There was a problem hiding this comment.
Thanks. I'm beginning to understand the difference now.
Bottom line: KubeVirt is not sufficiently flexible to let the user define lock limits, separate from guest memory.
If that is the case, it is probably not a network specific requirement. I suppose that it needs to be solved across the board.
| guest memory size, number of network interfaces of each kind and | ||
| network bindings' needs. | ||
| - Adjust memory lock limits without impacting VMI scheduling capacity | ||
| (i.e. increasing pod memory resource limits). |
There was a problem hiding this comment.
I fail to understand how you can lock memory without affecting scheduling capacity.
There was a problem hiding this comment.
Because it is not about setting a limit on how much memory the guest, and all related infra are going to consume. It is about setting a limit on how much of that memory (already) consumed can be locked in memory, in other words, forced not to move away from their physical RAM pages.
| - $A_i$ is 1 if `AdjustMemoryLockLimits` is true for the network | ||
| binding plugin of interface $i$ and 0 otherwise. | ||
| - $O_{fixed}$ is a fixed offset set to $1024*1024$ kB, to mimic | ||
| [libvirt's expectations][libvirt-vdpa-memlock-ratio]. |
There was a problem hiding this comment.
to mimic libvirt's expectations
IMHO instead of having to guess libvirts expectations, we should be able to obtain them by means of a libvirt query API.
There was a problem hiding this comment.
I know it's no excuse, but such interfaces does not currently exist and libvirt is also trying to do it's best to guess qemu's needs. That's pretty much why it wasn't taken into consideration in the first place.
However, what about the other alternatives, such as the one relying on admission webhooks? The idea was to outsource calculation to webhooks, which wouldn't be able to reach libvirt's query APIs IIUC.
There was a problem hiding this comment.
The idea was to outsource calculation to webhooks, which wouldn't be able to reach libvirt's query APIs IIUC.
That is correct. I think that the suggestion to use a webhook came from people who understood that updating exiting API values with a sophisticated calculation could meet the requirements. However that is not the case.
|
/cc @EdDev @orelmisan |
bgartzi
left a comment
There was a problem hiding this comment.
Thanks @nirdothan for this first review round. I'm leaving some comments on the discussion threads.
| configuration is static, meaning that it does not scale based on | ||
| guest memory sizes. It increases pods' memory overhead, which impacts | ||
| on schedulable VMIs. And finally, it does not trigger memory lock | ||
| rlimit configuration. |
There was a problem hiding this comment.
- Yes.
- Yes, it's ignoring
ComputeResourceOverhead. The reason for that is that IIUC,ComputeResourceOverheadis about the overhead that the network binding introduces, but doesn't have nothing to do with the guest memory size, which is the one the devices will try to lock.
Regarding 1: You might as well add 0 overhead to the plugin, but add enough guest memory in the VMI.
Sorry, I think I didn't quite get this, could you rephrase it?
Yes the calculation may be complex, but that can be assisted by a simple webhook that knows the algorithm.
Another of the alternatives presented in this VEP takes an implementation based on admission webhooks into account. I agree that outsourcing the calculation function to webhooks could make the solution more general. The problem is that even if the webhook knows the algorithm, we need somewhere else to put that information, be it an annotation, a field in the VMI spec... whatever, and act accordingly later.
From the implementation point of view, the key aspect is that AdjustResources should look into that annotation, field, or even call a calculation function, to know whether a network binding plugin, or anything else, could need a non-default memory lock limit to function properly. If the limit is not adjusted properly, libvirt will try to do it later, and this will lead into an error as it lacks permissions.
You cannot lock memory without specifying the request to the scheduler.
But this does not lock memory, it "just" updates the limits of memory that virt-launcher processes could lock.
I might be missing something here as well, but setProcessMemoryLockRLimit, which is the one adjusting memory lock limits, does not let the scheduler know about the update.
| guest memory size, number of network interfaces of each kind and | ||
| network bindings' needs. | ||
| - Adjust memory lock limits without impacting VMI scheduling capacity | ||
| (i.e. increasing pod memory resource limits). |
There was a problem hiding this comment.
Because it is not about setting a limit on how much memory the guest, and all related infra are going to consume. It is about setting a limit on how much of that memory (already) consumed can be locked in memory, in other words, forced not to move away from their physical RAM pages.
| - $A_i$ is 1 if `AdjustMemoryLockLimits` is true for the network | ||
| binding plugin of interface $i$ and 0 otherwise. | ||
| - $O_{fixed}$ is a fixed offset set to $1024*1024$ kB, to mimic | ||
| [libvirt's expectations][libvirt-vdpa-memlock-ratio]. |
There was a problem hiding this comment.
I know it's no excuse, but such interfaces does not currently exist and libvirt is also trying to do it's best to guess qemu's needs. That's pretty much why it wasn't taken into consideration in the first place.
However, what about the other alternatives, such as the one relying on admission webhooks? The idea was to outsource calculation to webhooks, which wouldn't be able to reach libvirt's query APIs IIUC.
Comment moved to a reply above. |
1631058 to
92a9888
Compare
|
/sig compute |
|
|
|
@bgartzi Thank you for this VEP! As we discussed during the last VEP call, I agree with you that, in general, workload/VM owners should have control over the allocated memory for their workload. @EdDev and I were chatting offline to discuss the API. What I would suggest is an API perhaps as What do you think? |
| This solution introduces a new optional annotation to `VirtualMachine`s, | ||
| `kubevirt.io/memlock-rlimit`. The annotation will be used to specify the | ||
| memlock rlimit needs of a VM. The value of that annotation key will | ||
| contain a [`Quantity` object][k8s-resource-quantity] from |
There was a problem hiding this comment.
nit: I think it has to be a string that will be marshaled to a quantity object
| There already are certain cases in which the `virt-handler` adjusts | ||
| memlock rlimits. When this new mechanism is introduced, `virt-handler` | ||
| will track the annotation, and if any, it will add it to the result of | ||
| the calculation existing for VFIO, SEV and RT VMs. |
There was a problem hiding this comment.
We need to decide if it is added or overrides the existing calculation. If added, it may lead to situations where the entity that sets the annotation value needs to be aware of the KubeVirt internal calculation, e.g. For SR-IOV it should take into consideration that we already add 1G per device internally.
There was a problem hiding this comment.
The tricky part is that if overridden completely, without taking into account possible existing values, we would be up to the arbitrary order in which mutating webhooks are applied, and only the last one would impact the locked memory. That last one would only know about itself, but not the rest.
I think that the existing logic should take the value present in the annotation, and add it to the internal calculations. But yes, it gets tricky with the 1G assigned by SR-IOV devices due to locking the MMIO region. What if a SR-IOV device and a vdpa/whatever device gets attached to a VM? I agree with you that we would be accounting the MMIO region twice.
I think that considering the approach proposed by @vladikr, could help mitigate this issue. (Using a boolean flag to mark whether memory needs to be locked or not).
| Note that mutating admission webhooks do introduce certain latency | ||
| during pod admission. This means that in case the number of devices and | ||
| conditions that require special memlock rlimits is large enough, this | ||
| could introduce a considerable latency to `VirtualMachine` admission. |
There was a problem hiding this comment.
Unless we start implementing common use cases in the core, and not necessarily as webhooks.
There was a problem hiding this comment.
In my opinion this mechanism's main selling point is keeping that logic out of the core for new use-cases that were not already modelled in the current KubeVirt source code.
| containing a proper value was added. | ||
| - VM migration: Memory lock rlimits are configured properly in the | ||
| destination `virt-launcher` and qemu process when a VMIs with a | ||
| `kubevirt.io/memlock-rlimit` annotation are migrated. |
There was a problem hiding this comment.
- Tests that validate the process rlimit cannot be e2e tests. We no longer use e2e tests to check underlying technicalities unless they can be asserted inside the guest (I assume that they can't?). We can however test the value that we set in Unit Tests.
- An e2e test would verify that the VM strats up successfully/fails in under different scenarios.
I suggest to test scenarios of memory hotplug along with devices that require a higher rlimit.
See https://github.com/kubevirt/kubevirt/blob/v1.7.0/tests/network/sriov.go#L320 as an example.
There was a problem hiding this comment.
I updated this part in the last version, let me know if it matches your expectations.
92a9888 to
3e4e55d
Compare
bgartzi
left a comment
There was a problem hiding this comment.
Thanks @nirdothan, I'm leaving some comments and updates
| There already are certain cases in which the `virt-handler` adjusts | ||
| memlock rlimits. When this new mechanism is introduced, `virt-handler` | ||
| will track the annotation, and if any, it will add it to the result of | ||
| the calculation existing for VFIO, SEV and RT VMs. |
There was a problem hiding this comment.
The tricky part is that if overridden completely, without taking into account possible existing values, we would be up to the arbitrary order in which mutating webhooks are applied, and only the last one would impact the locked memory. That last one would only know about itself, but not the rest.
I think that the existing logic should take the value present in the annotation, and add it to the internal calculations. But yes, it gets tricky with the 1G assigned by SR-IOV devices due to locking the MMIO region. What if a SR-IOV device and a vdpa/whatever device gets attached to a VM? I agree with you that we would be accounting the MMIO region twice.
I think that considering the approach proposed by @vladikr, could help mitigate this issue. (Using a boolean flag to mark whether memory needs to be locked or not).
| Note that mutating admission webhooks do introduce certain latency | ||
| during pod admission. This means that in case the number of devices and | ||
| conditions that require special memlock rlimits is large enough, this | ||
| could introduce a considerable latency to `VirtualMachine` admission. |
There was a problem hiding this comment.
In my opinion this mechanism's main selling point is keeping that logic out of the core for new use-cases that were not already modelled in the current KubeVirt source code.
|
@vladikr, thanks for taking a look and participating in the discussions I agree with both of you that the API would look much better the way you propose it. Specially by putting the new field inside of the However I have a concern about using a boolean flag: It makes it way more explicit and easier to understand, but it would imply a limitation: attaching 2 or more devices that need locking memory to a VM. That's a known issue: even if two devices lock the same memory pages in physical RAM, the amount of locked memory is counted once per each lock. If we would like to support this scenario, we would need an alternative, such as a count of times memory would be locked or something like that. Would that work, or would you rather still go for the boolean flag? |
@bgartzi Thanks for the feedback. I think there is some misunderstanding... The For multiple devices, several VFIO passthrough devices, each might require additional non-overlapping locked memory. However, since the kernel's mlock enforcement is on the total unique locked bytes in the process's address space we don't need to track a "count of locks" explicitly here. Instead, users or plugins/webhooks can simply set reservedOverhead to arequired value and KubeVirt will factor that into the overall rlimit. In the current KubeVirt implementation, we already use a fixed 1GiB overhead for any VFIO assignments, regardless of device count, but this new API would allow more control, which is perfect for dynamic plugin scenarios like in this VEP. If there's a specific use case where we will need a different behavior, we can iterate on it, but based on my understanding, this should cover multi-device attachments effectively - and keep the api hypervior agonist (for future) Does this address your concern, or am I missing something? |
|
@vladikr thank you. That's way simpler than what I was proposing and addresses my concerns, you're right. Sometimes you can't see it clearly when the solution is right in front of you... I will rephrase the document and send a new version for review beginning next week. |
|
Hello again @vladikr, I've been thinking a bit more about this topic. After that, I don't clearly see how a boolean flag could help all possible use-cases. I think that a boolean flag could cover a scenario in which only n=1 device requiring to lock VM memory is attached. If n>1, we wouldn't be safe unless we ask the user to do what I would consider a workaround, by adding the expected VM memory as many times as needed to Let's think of the following example: We configure This is because libvirt calculates the expected rlimit considering a known locked memory accounting issue [1], which in simple terms does With a VM-wide boolean flag, we don't know how many of the attached devices do require locking memory, so we don't know which the value of
As a side-note, in case of SR-IOV, I don't hit this issue when Edit: I can confirm the behavior for n>1 [1] https://gitlab.com/libvirt/libvirt/-/blob/v11.9.0/src/qemu/qemu_domain.c#L8363-8371 |
Hey @bgartzi Thanks again for diving deeper and for running the experiment. I think I understand your concern. You're absolutely right about the locked memory accounting issue in libvirt/QEMU. In KubeVirt, since the virt-launcher pod (and all the processes in it) is unprivileged, libvirt can't increase the rlimit itself. It the virt-handler that sets the limit. If libvirt attempts to set a higher value it will fail. The proposed
I don't see it as a "workaround". It's the intended flexibility for cases where core KubeVirt can't automatically detect Does this make sense, or is there a specific aspect of the vDPA plugin where even this wouldn't be enough? |
|
Hi @vladikr, thanks a lot again, let me eat my own words and excuse myself. This would indeed work as you explain, for vDPA and other use-cases. I was concerned about exposing I'm rephrasing this VEP by today according to the API you proposed. Now my question is, what if |
3e4e55d to
015c554
Compare
|
I just pushed a new version of the VEP that matches the design that is currently being discussed. That is, it proposes adding a new optional field named
|
The new section should replace
The suggestion, as I understood it from @vladikr is this: spec:
domain:
memory:
reservedOverhead:
value: <Quanity>
requiresLock: "true"Make sure you do not use a go boolean type but an enum/string, it fits the best API practices from K8S. |
|
|
||
| ### Alpha | ||
|
|
||
| ### Beta |
There was a problem hiding this comment.
This may have been already discussed, but please express why you think we can go with Beta directly.
I think this API needs experimentation before we can be sure it can go GA, I would recomend to start with Alpha, then move to Beta without any special requirements, except perhaps the notification that the new filed will replace the previous extension knob from both the VMI spec and the network plugin parts.
In all cases, unit and e2e tests are a must from day 1, none are to be dragged to the GA stage.
You also need to bring back the FG section, as this now needs to be protected by it.
There was a problem hiding this comment.
I moved it to Alpha in a version I pushed just before receiving these comments.
I'm now pushing a newer version that also considers e2e testing in Alpha, as well as documentation.
I'm also adding some comments about notifying that other memory extension mechanisms are being deprecated in favour of this in beta, and how would they be rolled out when this feature goes into GA.
015c554 to
3f4a56b
Compare
Signed-off-by: Beñat Gartzia Arruabarrena <bgartzia@redhat.com>
Okay, now I understand better. Telling whether memory could be locked or not, and telling whether some overhead is expected or not. This approach handles well cases in which qemu/devices could lock the VM memory once plus some extra overhead memory. That overhead memory will be taken into account in the pod resource limits and also in the virt-launcher's environment memlock RLimits. However, I don't see how this could scale to cases in which 2 (or more) devices might lock the same VM memory. VM memory will be advertised as locked once per each device that is locking it. That does not mean that we are using twice as physical RAM as initially intended, but that the same physical memory region has been counted as locked twice. Not sure if I'm explaining my concern too clearly; I hope I can help myself through a practical example. Let's imagine a 8G VM that has 2 devices that will lock VM memory. Those devices need a 1G memory overhead (which will also be locked). In that case, we need 8G + 1G -> 9G of physical memory to run safely. However, each device will advertise VM's memory locked once, so the memlock RLimits that we need to run libvirt/qemu safely are 2 * 8G + 1G -> 17G: almost double. If we want to use I guess that it's okay for alpha, but those cases, the amount of VMs that we think can be scheduled in a node wouldn't be on pair with the actual amount of VMs that can really be scheduled. That's one of the reasons why I was advocating for modelling
Fixing that in the version I'm pushing now.
I'm updating field names in a newer version I'm pushing right now. |
3f4a56b to
4b689a8
Compare
@bgartzi I completely understand your concern, but the issue comes from libvirt's overaccounting.
|
Thanks @vladikr for the reference. In a code comment, they even write:
So there is an intention to fix this bug moving forward. |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: vladikr The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
EdDev
left a comment
There was a problem hiding this comment.
This looks great, thank you very much!
|
Thanks lots everyone for the patience, reviews and help! |
Hey @enp0s3! @bgartzi recently raised the same concern here #263 (comment) which was followed up by a few comments. Do you mind adding your thoughts over there? |
VEP Metadata
Tracking issue: #144
SIG label: /sig compute
What this PR does
This enhancement introduces a new field to
VirtualMachineannotationsto specify memory lock rlimit configuration requirements. That
annotation can be updated by other external mechanisms, such as mutating
admission webhooks, based on the VMI spec and specific conditions. The
value of that annotation field will be checked by the
virt-handlertoappropriately configure the memlock rlimits of the
virt-launcherandqemu process.
Special notes for your reviewer