Skip to content

VM stuck in unresponsive state and prohibits listing processes on host #389

@ddrazyk

Description

@ddrazyk

We had an issue on 3 out of 4 hosts in an ovirt cluster (4.5.4-1.el8) where one VM is stuck in unresponsive state. It cannot be powered down nor restarted and as long as it's qemu process is running I can't list processes on that host. VM is unreachable through network and ovirt's VNC console. The only way to resolve the issue is to restart host from ovirt webUI (or kill qemu process).
I can see in vdsm logs such entries:

2023-05-05 21:27:52,848+0200 ERROR (qgapoller/1) [virt.periodic.Operation] <bound method QemuGuestAgentPoller._poller of <vdsm.virt.qemuguestagent.QemuGuestAgentPoller object at 0x7fe08c0d9630>> operation failed (periodic:187)
Traceback (most recent call last):
File "/usr/lib/python3.6/site-packages/vdsm/virt/periodic.py", line 185, in call
self._func()
File "/usr/lib/python3.6/site-packages/vdsm/virt/qemuguestagent.py", line 476, in _poller
vm_id, self._qga_call_get_vcpus(vm_obj))
File "/usr/lib/python3.6/site-packages/vdsm/virt/qemuguestagent.py", line 797, in _qga_call_get_vcpus
if 'online' in vcpus:
TypeError: argument of type 'NoneType' is not iterable

And then eventually leads to:
2023-05-05 21:45:17,709+0200 ERROR (vm/220746d4) [virt.vm] (vmId='220746d4-56a5-40cc-8633-1285c167c4fe') Failed to update CPU set of the VM to match shared pool (cpumanagement:121)
Traceback (most recent call last):
File "/usr/lib/python3.6/site-packages/vdsm/virt/virdomain.py", line 104, in f
ret = attr(*args, **kwargs)
File "/usr/lib/python3.6/site-packages/vdsm/common/libvirtconnection.py", line 114, in wrapper
ret = f(*args, **kwargs)
File "/usr/lib/python3.6/site-packages/vdsm/common/function.py", line 78, in wrapper
return func(inst, *args, **kwargs)
File "/usr/lib64/python3.6/site-packages/libvirt.py", line 2303, in pinVcpu
raise libvirtError('virDomainPinVcpu() failed')
libvirt.libvirtError: Timed out during operation: cannot acquire state change lock (held by monitor=remoteDispatchConnectGetAllDomainStats)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/lib/python3.6/site-packages/vdsm/virt/cpumanagement.py", line 108, in _assign_shared
vm.pin_vcpu(vcpu, cpuset)
File "/usr/lib/python3.6/site-packages/vdsm/virt/vm.py", line 6306, in pin_vcpu
self._dom.pinVcpu(vcpu, cpuset)
File "/usr/lib/python3.6/site-packages/vdsm/virt/virdomain.py", line 112, in f
raise toe
vdsm.virt.virdomain.TimeoutError: Timed out during operation: cannot acquire state change lock (held by monitor=remoteDispatchConnectGetAllDomainStats)

This causes CPU to stuck on qemu process. If I forcibly kill the process everything gets back to normal, but ovirt reports vm's state as "unresponsive" or "powering down" if I try to shut it down from webUI.
Hosts are connected via glusterfs FUSE which runs on separate hosts (3 hosts with replica 3 and jbod setup with 6 nvme disks).
All hosts (hypervisors and gluster) use CentOS 8 Stream.

Version-Release number of selected component:
4.50.3.4-1.el8.x86_64

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions