Fix 500ms scouting delay regression introduced in 1.7.2 by JafarAbdi · Pull Request #2493 · eclipse-zenoh/zenoh

JafarAbdi · 2026-03-16T10:02:49Z

Description

I was upgrading zenoh for a project I'm working on from 1.7.1 to 1.7.2 and noticed zenoh.open() taking ~500ms longer. Traced it to the gossip refactor in #2347.

What does this PR do?

Move add_peer_connector_zid inside the spawned task, after the transport existence check.

Why is this change needed?

Related Issues

Here is a python script for reproducing the issue

#!/usr/bin/env python3
# /// script
# dependencies = [
#   "eclipse-zenoh==1.7.2",
# ]
# ///
"""Reproduce: zenoh.open() takes ~500ms extra in peer mode due to gossip regression.

Introduced in 1.7.2 by commit 39a440574 ("Fix bugs in gossip (#2347)").
The gossip handler's link_states() adds peer connectors for already-connected
scouted peers that are never terminated, blocking start_conditions.notified()
for the full scouting delay (500ms).

Using eclipse-zenoh==1.7.1 does not have the delay.

Run with:
    uv run reproduce_scouting_delay.py
"""

import multiprocessing
import time

import zenoh


def peer_process():
    """Long-running peer that declares a liveliness token."""
    zenoh.init_log_from_env_or("error")
    config = zenoh.Config.from_json5('{ mode: "peer" }')
    session = zenoh.open(config)
    token = session.liveliness().declare_token("test/alive")

    try:
        while True:
            time.sleep(0.1)
    except KeyboardInterrupt:
        pass
    finally:
        token.undeclare()
        session.close()


def main():
    peer = multiprocessing.Process(target=peer_process, daemon=True)
    peer.start()
    time.sleep(2)  # Wait for peer to be discoverable

    zenoh.init_log_from_env_or("error")
    config = zenoh.Config.from_json5('{ mode: "peer" }')

    start = time.monotonic()
    session = zenoh.open(config)
    elapsed_ms = (time.monotonic() - start) * 1000

    # Verify peer was discovered
    replies = list(session.liveliness().get("test/**", timeout=1.0))
    found = any(r.ok is not None for r in replies)

    print(f"zenoh.open() took {elapsed_ms:.0f}ms (peer found: {found})")
    if elapsed_ms > 100:
        print(f"BUG: expected <100ms, got {elapsed_ms:.0f}ms — scouting delay regression")
    else:
        print("OK: no regression")

    session.close()
    peer.terminate()
    peer.join(timeout=5)


if __name__ == "__main__":
    main()

🏷️ Label-Based Checklist

Based on the labels applied to this PR, please complete these additional requirements:

Labels: bug

🐛 Bug Fix Requirements

Since this PR is labeled as a bug fix, please ensure:

Root cause documented - Explain what caused the bug in the PR description
Reproduction test added - Test that fails on main branch without the fix
Test passes with fix - The reproduction test passes with your changes
Regression prevention - Test will catch if this bug reoccurs in the future
Fix is minimal - Changes are focused only on fixing the bug
Related bugs checked - Verified no similar bugs exist in related code

Why this matters: Bugs without tests often reoccur.

Instructions:

Check off items as you complete them (change - [ ] to - [x])
The PR checklist CI will verify these are completed

This checklist updates automatically when labels change, but preserves your checked boxes.

The gossip refactor in 39a4405 ("Fix bugs in gossip (eclipse-zenoh#2347)") introduced a 500ms delay in zenoh.open() by calling add_peer_connector_zid for already-connected peers, creating unterminated entries that block start_conditions.notified() for the full scouting delay.

add_peer_connector_zid was called unconditionally for all autoconnectable nodes in gossip link_states, including peers already connected via scouting. The spawned connect task short-circuits when the transport already exists, skipping terminate_peer_connector_zid. Entries without locators (such as the local node echoed back in gossip) never spawn a task at all. Both cases leave unterminated connectors that block start_conditions.notified() for the full 500ms scouting delay.

codecov · 2026-03-17T12:51:53Z

Codecov Report

❌ Patch coverage is 50.00000% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 72.50%. Comparing base (94123e0) to head (e324a56).
⚠️ Report is 13 commits behind head on main.
✅ All tests successful. No failed tests found.

Files with missing lines	Patch %	Lines
zenoh/src/net/protocol/gossip.rs	50.00%	4 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2493      +/-   ##
==========================================
- Coverage   72.58%   72.50%   -0.08%     
==========================================
  Files         390      390              
  Lines       63366    63358       -8     
==========================================
- Hits        45992    45939      -53     
- Misses      17374    17419      +45

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

JafarAbdi added 2 commits March 15, 2026 21:53

OlivierHecart approved these changes Mar 20, 2026

View reviewed changes

OlivierHecart added the bug Something isn't working label Mar 20, 2026

OlivierHecart merged commit f27282e into eclipse-zenoh:main Mar 20, 2026
33 of 38 checks passed

JafarAbdi deleted the fix/scouting-delay-regression branch March 20, 2026 12:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix 500ms scouting delay regression introduced in 1.7.2#2493

Fix 500ms scouting delay regression introduced in 1.7.2#2493
OlivierHecart merged 2 commits intoeclipse-zenoh:mainfrom
JafarAbdi:fix/scouting-delay-regression

JafarAbdi commented Mar 16, 2026 •

edited by OlivierHecart

Loading

Uh oh!

codecov bot commented Mar 17, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

JafarAbdi commented Mar 16, 2026 • edited by OlivierHecart Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

What does this PR do?

Why is this change needed?

Related Issues

🏷️ Label-Based Checklist

🐛 Bug Fix Requirements

Uh oh!

codecov bot commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

JafarAbdi commented Mar 16, 2026 •

edited by OlivierHecart

Loading

codecov bot commented Mar 17, 2026 •

edited

Loading