Skip to content

Fix 500ms scouting delay regression introduced in 1.7.2#2493

Merged
OlivierHecart merged 2 commits intoeclipse-zenoh:mainfrom
JafarAbdi:fix/scouting-delay-regression
Mar 20, 2026
Merged

Fix 500ms scouting delay regression introduced in 1.7.2#2493
OlivierHecart merged 2 commits intoeclipse-zenoh:mainfrom
JafarAbdi:fix/scouting-delay-regression

Conversation

@JafarAbdi
Copy link
Contributor

@JafarAbdi JafarAbdi commented Mar 16, 2026

Description

I was upgrading zenoh for a project I'm working on from 1.7.1 to 1.7.2 and noticed zenoh.open() taking ~500ms longer. Traced it to the gossip refactor in #2347.

What does this PR do?

Move add_peer_connector_zid inside the spawned task, after the transport existence check.

Why is this change needed?

Related Issues

Here is a python script for reproducing the issue

#!/usr/bin/env python3
# /// script
# dependencies = [
#   "eclipse-zenoh==1.7.2",
# ]
# ///
"""Reproduce: zenoh.open() takes ~500ms extra in peer mode due to gossip regression.

Introduced in 1.7.2 by commit 39a440574 ("Fix bugs in gossip (#2347)").
The gossip handler's link_states() adds peer connectors for already-connected
scouted peers that are never terminated, blocking start_conditions.notified()
for the full scouting delay (500ms).

Using eclipse-zenoh==1.7.1 does not have the delay.

Run with:
    uv run reproduce_scouting_delay.py
"""

import multiprocessing
import time

import zenoh


def peer_process():
    """Long-running peer that declares a liveliness token."""
    zenoh.init_log_from_env_or("error")
    config = zenoh.Config.from_json5('{ mode: "peer" }')
    session = zenoh.open(config)
    token = session.liveliness().declare_token("test/alive")

    try:
        while True:
            time.sleep(0.1)
    except KeyboardInterrupt:
        pass
    finally:
        token.undeclare()
        session.close()


def main():
    peer = multiprocessing.Process(target=peer_process, daemon=True)
    peer.start()
    time.sleep(2)  # Wait for peer to be discoverable

    zenoh.init_log_from_env_or("error")
    config = zenoh.Config.from_json5('{ mode: "peer" }')

    start = time.monotonic()
    session = zenoh.open(config)
    elapsed_ms = (time.monotonic() - start) * 1000

    # Verify peer was discovered
    replies = list(session.liveliness().get("test/**", timeout=1.0))
    found = any(r.ok is not None for r in replies)

    print(f"zenoh.open() took {elapsed_ms:.0f}ms (peer found: {found})")
    if elapsed_ms > 100:
        print(f"BUG: expected <100ms, got {elapsed_ms:.0f}ms — scouting delay regression")
    else:
        print("OK: no regression")

    session.close()
    peer.terminate()
    peer.join(timeout=5)


if __name__ == "__main__":
    main()

🏷️ Label-Based Checklist

Based on the labels applied to this PR, please complete these additional requirements:

Labels: bug

🐛 Bug Fix Requirements

Since this PR is labeled as a bug fix, please ensure:

  • Root cause documented - Explain what caused the bug in the PR description
  • Reproduction test added - Test that fails on main branch without the fix
  • Test passes with fix - The reproduction test passes with your changes
  • Regression prevention - Test will catch if this bug reoccurs in the future
  • Fix is minimal - Changes are focused only on fixing the bug
  • Related bugs checked - Verified no similar bugs exist in related code

Why this matters: Bugs without tests often reoccur.

Instructions:

  1. Check off items as you complete them (change - [ ] to - [x])
  2. The PR checklist CI will verify these are completed

This checklist updates automatically when labels change, but preserves your checked boxes.

The gossip refactor in 39a4405 ("Fix bugs in gossip (eclipse-zenoh#2347)") introduced
a 500ms delay in zenoh.open() by calling add_peer_connector_zid for
already-connected peers, creating unterminated entries that block
start_conditions.notified() for the full scouting delay.
add_peer_connector_zid was called unconditionally for all
autoconnectable nodes in gossip link_states, including peers
already connected via scouting. The spawned connect task
short-circuits when the transport already exists, skipping
terminate_peer_connector_zid. Entries without locators (such
as the local node echoed back in gossip) never spawn a task
at all. Both cases leave unterminated connectors that block
start_conditions.notified() for the full 500ms scouting delay.
@codecov
Copy link

codecov bot commented Mar 17, 2026

Codecov Report

❌ Patch coverage is 50.00000% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 72.50%. Comparing base (94123e0) to head (e324a56).
⚠️ Report is 13 commits behind head on main.
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
zenoh/src/net/protocol/gossip.rs 50.00% 4 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2493      +/-   ##
==========================================
- Coverage   72.58%   72.50%   -0.08%     
==========================================
  Files         390      390              
  Lines       63366    63358       -8     
==========================================
- Hits        45992    45939      -53     
- Misses      17374    17419      +45     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@OlivierHecart OlivierHecart added the bug Something isn't working label Mar 20, 2026
@OlivierHecart OlivierHecart merged commit f27282e into eclipse-zenoh:main Mar 20, 2026
33 of 38 checks passed
@JafarAbdi JafarAbdi deleted the fix/scouting-delay-regression branch March 20, 2026 12:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants