Skip to content

WebSocket connection can survive silent half-open disconnect indefinitely #577

@ku1ik

Description

@ku1ik

After migrating asciinema-server from Cowboy to Bandit, I started seeing accumulation of stale websocket-connected LiveViews on production nodes.

The problem reproduces when a websocket connection becomes half-open in a way that does not deliver FIN/RST to the server, while the server continues to generate outbound websocket traffic for that connection.

In my case, periodic PubSub updates trigger LiveView diffs/pushes every 5 seconds. Those outbound/local messages appear to keep the Bandit websocket process alive indefinitely, so the transport never goes away and the attached LiveViews also remain alive.

This did not happen under Cowboy.

Environment

  • Elixir: 1.18
  • Phoenix: 1.7.21
  • Phoenix LiveView: 1.0.18
  • Plug: 1.18.1
  • Bandit: 1.10.2
  • ThousandIsland: 1.4.3
  • WebSockAdapter: 0.5.9

Expected behavior

A websocket connection that has become silently half-open should eventually be reaped, even if the server is still generating outbound traffic for it.

At minimum, periodic server-side messages should not keep the connection alive forever in the absence of incoming client activity (e.g. LiveView heart beats).

Actual behavior

If the server never receives FIN/RST, and the app continues to generate server-side websocket traffic for that connection, the websocket process remains alive indefinitely.

In asciinema server this leaves transport processes alive, and the LiveViews attached to those transports remain alive as well.

Reproduction

I reproduced this on two machines on the same LAN:

  • Linux workstation running asciinema server with Bandit
  • MacBook running the browser client

The page under test mounts connected LiveViews that receive periodic updates every 5 seconds.

Steps

  1. Start a Phoenix app using Bandit.
  2. Open a page in the browser that mounts connected LiveViews and causes periodic server pushes over the websocket.
  3. Confirm on the server that the websocket and LiveView processes are present (I used Phoenix LiveDashboard).
  4. On the server, silently blackhole packets to/from the client IP for the websocket port, for example:
sudo iptables -I INPUT  -s <client-ip> -p tcp --dport 4000 -j DROP
sudo iptables -I OUTPUT -d <client-ip> -p tcp --sport 4000 -j DROP

This is intended to simulate a silent half-open connection where the server never sees FIN/RST.

  1. After the drop is active, close the browser tab on the client.
  2. Observe the server-side websocket / LiveView processes.

Result

After the packet blackhole is in place and the browser tab is closed on the client side, the websocket transport process on the server remains alive instead of being reaped. The LiveViews attached to that transport also remain alive. In this setup, periodic server-side updates continue to be generated, and the connection does not time out or terminate on its own.

In production, this shows up as a gradually increasing number of stale websocket-connected LiveViews on each node. As those accumulate, CPU usage rises together with memory usage until the node eventually hits its limits and the container is restarted, often due to OOM.

Control case

If I close the page normally without blackholing packets, cleanup happens quickly as expected.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions