fix segfault when exitting cage with a child still present#488
fix segfault when exitting cage with a child still present#488sdumetz wants to merge 1 commit intocage-kiosk:masterfrom
Conversation
|
Note that #484 looks like it's adressing the same thing, with part of the same fix but it's not sufficient to fix my exact issue. |
| view->surface_destroy.notify = handle_surface_destroy; | ||
| wl_signal_add(&surface->events.destroy, &view->surface_destroy); |
There was a problem hiding this comment.
I'm not sure I understand. wlr_surface.events.unmap is always guaranteed to be called before destroy, so I don't know how view->scene_tree can free itself before we call wlr_scene_node_destroy().
There was a problem hiding this comment.
Its entirely possible that I got things wrong. I'm not at all confident in my knowledge of the wlroots or cage codebases.
What I know is that in the described test, view_unmap does raise a SIGSEGV (at least, _on my machine_™)
Adding log lines to this patch in view_unmap, and handle_surface_destroy I see (when running with the detached child process):
00:00:02.989 [../view.c:118] surface destroy
00:00:02.989 [../view.c:137] view_unmap: scene_tree is NULL
Which if I understand correctly means either there is a bug elsewhere (in wlroots?) or unmap is not guaranteed to run before destroy?
There was a problem hiding this comment.
When the surface object is destroyed, this function gets called in wlroots:
The first thing it does is ensure the surface is unmapped:
Are we sure a single surface is involved?
The script linked above doesn't reproduce the bug on my setup. cage doesn't crash, it exits cleanly.
There was a problem hiding this comment.
The script linked above doesn't reproduce the bug on my setup. cage doesn't crash, it exits cleanly.
Oh that's strange. I didn't think it would be setup-dependant since I saw the issue in the wild on a brand new amd64 mini-PC and on a raspberry PI running cage over a DRM backend (admitttedly, running a modified version with a few patches) and was able to reproduce with this script on my machine (debian/KDE, wayland backend) using the master branch and default options.
The attached script seems to segfault 100% of the time, too.
When compiled in debug mode, the logs' tail looks like this:
00:00:00.202 [types/wlr_compositor.c:786] New wlr_surface 0x556959ee6d10 (res 0x556959d49bb0)
00:00:00.202 [xwayland/xwm.c:2055] New xwayland surface: 0x556959ee6d10
00:00:00.202 [xwayland/xwm.c:695] XCB_ATOM_NET_STARTUP_ID: (null)
00:00:00.203 [xwayland/xwm.c:1105] unhandled X11 property 283 (WM_STATE) for window 4194314
00:00:00.203 [backend/wayland/output.c:433] Primary buffer size mismatch
00:00:00.215 [types/scene/wlr_scene.c:2189] Direct scan-out enabled
Sending SIGTERM to cage (pid 65809) only, xclock stays alive...
(EE) failed to read Wayland events: Connection reset by peer
./reproduce_crash.sh : ligne 46 : 65809 Erreur de segmentation (core dumped)"$CAGE" -- xclock
X connection to :1 broken (explicit kill or server shutdown).
Where pretty much everything looks ok to me, except for the core dump. Could it be a xwayland issue rather than a cage / wlroots issue?
I encountered a SEGFAULT when running a program that had bad cleanup handling. ie: it was spawning child processes that owned the (xwayland-based) window and on SIGINT was exitting before the child had time to exit properly. I hacked-together a bash program that can reproduce the behaviour :
trigger SIGSEGV with xclock - bash script
This is not a very high-profile bug, because a program that handles signals properly wouldn't trigger it. And it's triggered on exit anyways so it's only a minor inconvenience.
But I suppose it can't hurt to make destroy handlers a bit more robust? Anyways, I hope this helps.
The proposed fix registers a handler to nullify
scene_treeafter wlroots destroys the surface. cage's destroy handler is then made conditional to prevent dereferencing the destroyed tree node.Please let me know if I misunderstood something or if the fix is otherwise problematic: I'm still new to wlroots.