Skip to content

Call deadlocks in HPB mode when local & remote session IDs are compared across namespaces #6169

@alauzon

Description

@alauzon

Bug summary

In PeerConnectionWrapper, the offer/offeree role is decided by a lexicographic
string comparison between the remote session ID (received over the signaling
channel) and the local session ID. In HPB (external signaling) mode these two
IDs are not in the same namespace, which makes the comparison meaningless and
can cause both peers to compute "I am NOT the offerer" simultaneously. No offer
is ever sent, no answer is ever produced, and the call hangs silently.

Steps to reproduce

  1. Configure a Nextcloud instance with the standalone signaling server (HPB).
  2. Start a P2P call (no MCU) between two Talk Android clients.
  3. Repeat until both peers happen to compute the same direction for compareTo
    (depends on the random session IDs generated by NC and HPB on each join).

The deadlock is intermittent because session IDs are random — but the
underlying logic guarantees a non-zero failure rate per call.

Expected behaviour

For every successful peer pairing, exactly one side becomes the offerer and
sends a WebRTC offer; the other side answers; media flows.

Actual behaviour

In HPB mode, both peers can simultaneously decide they are the offeree. No
offer is sent. The call stays in "connecting" state until the user gives up.

Versions

  • Talk Android app: master — bug present in this code path since commit
    b98bb93e4
    (2017-11-29). All released versions are affected.
  • Nextcloud server: 33.0.3 (also affects all earlier versions where HPB is
    supported).
  • Talk (spreed) server app: 24.0.0-dev.3.
  • Standalone signaling server (HPB): v2.1.1.
  • Custom Signaling server configured: Yes.
  • Custom TURN server configured: Yes.
  • Custom STUN server configured: Yes.

Code references

PeerConnectionWrapper.java decides offerer/offeree at construction:

https://github.com/nextcloud/talk-android/blob/master/app/src/main/java/com/nextcloud/talk/webrtc/PeerConnectionWrapper.java#L119

boolean hasInitiated = sessionId.compareTo(localSession) < 0;

The localSession argument comes from CallActivity.createPeerConnectionWrapper:

https://github.com/nextcloud/talk-android/blob/master/app/src/main/java/com/nextcloud/talk/activities/CallActivity.kt#L2390

return PeerConnectionWrapper(
    ...
    sessionId,
    callSession,    // ← Nextcloud session ID, even in HPB mode
    ...
)

callSession is set from ApplicationWideCurrentRoomHolder.session
(NC session ID, ~16 char hex token):

https://github.com/nextcloud/talk-android/blob/master/app/src/main/java/com/nextcloud/talk/activities/CallActivity.kt#L1534

The remote sessionId parameter, however, is the HPB session ID in HPB
mode. It is set from the WebSocket envelope senderWebSocketMessage.sessionId:

https://github.com/nextcloud/talk-android/blob/master/app/src/main/java/com/nextcloud/talk/webrtc/WebSocketInstance.kt#L184

ncSignalingMessage.from = callWebSocketMessage.senderWebSocketMessage!!.sessionId

The HPB session ID is a 58–70 char base64url string set in the
hello response:

https://github.com/nextcloud/talk-android/blob/master/app/src/main/java/com/nextcloud/talk/webrtc/WebSocketInstance.kt#L324

Why the comparison can deadlock

In standard internal signaling mode, both sessionId and localSession are
NC session IDs (same namespace, same length, same character set), so
compareTo defines a total order — exactly one side computes < 0.

In HPB mode, sessionId is the HPB session ID (base64url, 58–70 chars) and
localSession is still the NC session ID (~16 char hex). The two strings live
in different namespaces with different alphabets and lengths. Lexicographic
ordering between them carries no semantic meaning. There exist pairs of IDs
where:

  • Peer A computes remote_HPB.compareTo(local_NC_A) >= 0 ("A is not initiator")
  • Peer B computes remote_HPB.compareTo(local_NC_B) >= 0 ("B is not initiator")

Both peers wait for an offer that never arrives. The fix processParticipantList
in #6087 prevents the call from being torn down on the partial-update path,
but it does not unblock the offer/answer deadlock — that is a separate symptom
of the same underlying contract violation.

Suggested fix

Pass the HPB session ID (i.e. webSocketClient.sessionId) instead of
callSession to PeerConnectionWrapper when external signaling is enabled.
Both sides then compare in the same namespace.

A more defensive approach would be to rename the parameter to
localSignalingSessionId and document that it MUST be in the same namespace
as the remote sessionId.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions