Skip to content

fix segfault when exitting cage with a child still present#488

Open
sdumetz wants to merge 1 commit intocage-kiosk:masterfrom
Holusion:child_segfault
Open

fix segfault when exitting cage with a child still present#488
sdumetz wants to merge 1 commit intocage-kiosk:masterfrom
Holusion:child_segfault

Conversation

@sdumetz
Copy link
Copy Markdown

@sdumetz sdumetz commented Mar 18, 2026

I encountered a SEGFAULT when running a program that had bad cleanup handling. ie: it was spawning child processes that owned the (xwayland-based) window and on SIGINT was exitting before the child had time to exit properly. I hacked-together a bash program that can reproduce the behaviour :

trigger SIGSEGV with xclock - bash script
#!/usr/bin/env bash
# Minimal reproducer for cage XWayland surface destroy crash.
#
# This script sends SIGTERM only to cage, leaving xclock alive so the
# surface is still present when wlr_xwayland_destroy() is called.
#
# Usage:
#   ./reproduce_crash.sh [path-to-cage]
#
# Expected result with unfixed cage:  exit code 139 (SIGSEGV)
# Expected result with fixed cage:    exit code 0

set -euo pipefail

CAGE=${1:-./build/cage}

if ! command -v xclock &>/dev/null; then
    echo "ERROR: xclock not found. Install with: sudo apt install x11-apps" >&2
    exit 1
fi

if [[ ! -x "$CAGE" ]]; then
    echo "ERROR: cage binary not found at $CAGE" >&2
    exit 1
fi

echo "Starting cage with xclock..."
"$CAGE" -- xclock &
CAGE_PID=$!

echo "Waiting for xclock to map its window..."
sleep 3

echo "Sending SIGTERM to cage (pid $CAGE_PID) only, xclock stays alive..."
kill -TERM "$CAGE_PID" 2>/dev/null || true

wait "$CAGE_PID"
EXIT=$?

if [[ $EXIT -eq 139 ]]; then
    echo "CRASHED (exit 139, SIGSEGV) — bug reproduced"
elif [[ $EXIT -eq 0 ]]; then
    echo "Clean exit (exit 0) — bug not present or already fixed"
else
    echo "Exited with code $EXIT"
fi

This is not a very high-profile bug, because a program that handles signals properly wouldn't trigger it. And it's triggered on exit anyways so it's only a minor inconvenience.

But I suppose it can't hurt to make destroy handlers a bit more robust? Anyways, I hope this helps.

The proposed fix registers a handler to nullify scene_tree after wlroots destroys the surface. cage's destroy handler is then made conditional to prevent dereferencing the destroyed tree node.

Please let me know if I misunderstood something or if the fix is otherwise problematic: I'm still new to wlroots.

@sdumetz
Copy link
Copy Markdown
Author

sdumetz commented Mar 18, 2026

Note that #484 looks like it's adressing the same thing, with part of the same fix but it's not sufficient to fix my exact issue.

Comment thread view.c
Comment on lines +172 to +173
view->surface_destroy.notify = handle_surface_destroy;
wl_signal_add(&surface->events.destroy, &view->surface_destroy);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand. wlr_surface.events.unmap is always guaranteed to be called before destroy, so I don't know how view->scene_tree can free itself before we call wlr_scene_node_destroy().

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its entirely possible that I got things wrong. I'm not at all confident in my knowledge of the wlroots or cage codebases.

What I know is that in the described test, view_unmap does raise a SIGSEGV (at least, _on my machine_™)

Adding log lines to this patch in view_unmap, and handle_surface_destroy I see (when running with the detached child process):

00:00:02.989 [../view.c:118] surface destroy
00:00:02.989 [../view.c:137] view_unmap: scene_tree is NULL

Which if I understand correctly means either there is a bug elsewhere (in wlroots?) or unmap is not guaranteed to run before destroy?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the surface object is destroyed, this function gets called in wlroots:

https://gitlab.freedesktop.org/wlroots/wlroots/-/blob/334019f839bf0728d958c179aceed67e0e8db66a/types/wlr_compositor.c#L725

The first thing it does is ensure the surface is unmapped:

https://gitlab.freedesktop.org/wlroots/wlroots/-/blob/334019f839bf0728d958c179aceed67e0e8db66a/types/wlr_compositor.c#L934

Are we sure a single surface is involved?

The script linked above doesn't reproduce the bug on my setup. cage doesn't crash, it exits cleanly.

Copy link
Copy Markdown
Author

@sdumetz sdumetz Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The script linked above doesn't reproduce the bug on my setup. cage doesn't crash, it exits cleanly.

Oh that's strange. I didn't think it would be setup-dependant since I saw the issue in the wild on a brand new amd64 mini-PC and on a raspberry PI running cage over a DRM backend (admitttedly, running a modified version with a few patches) and was able to reproduce with this script on my machine (debian/KDE, wayland backend) using the master branch and default options.

The attached script seems to segfault 100% of the time, too.

When compiled in debug mode, the logs' tail looks like this:

00:00:00.202 [types/wlr_compositor.c:786] New wlr_surface 0x556959ee6d10 (res 0x556959d49bb0)
00:00:00.202 [xwayland/xwm.c:2055] New xwayland surface: 0x556959ee6d10
00:00:00.202 [xwayland/xwm.c:695] XCB_ATOM_NET_STARTUP_ID: (null)
00:00:00.203 [xwayland/xwm.c:1105] unhandled X11 property 283 (WM_STATE) for window 4194314
00:00:00.203 [backend/wayland/output.c:433] Primary buffer size mismatch
00:00:00.215 [types/scene/wlr_scene.c:2189] Direct scan-out enabled
Sending SIGTERM to cage (pid 65809) only, xclock stays alive...
(EE) failed to read Wayland events: Connection reset by peer
./reproduce_crash.sh : ligne 46 : 65809 Erreur de segmentation  (core dumped)"$CAGE" -- xclock
X connection to :1 broken (explicit kill or server shutdown).

Where pretty much everything looks ok to me, except for the core dump. Could it be a xwayland issue rather than a cage / wlroots issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants