[lldp] Replace per-port portidsubtype loop with global ifname directive#26873
[lldp] Replace per-port portidsubtype loop with global ifname directive#26873ZhaohuiS wants to merge 2 commits intosonic-net:masterfrom
Conversation
PR sonic-net#26144 added per-port portidsubtype configuration in lldpd.conf.j2 to prevent transient MAC address as Port ID during lldpd startup. However, on high-port-count platforms (e.g. Mellanox SN5640 with 512 ports), this generates 512 config lines that take ~11 seconds to process, causing: 1. LLDP blackout during startup (no frames sent for 11+ seconds) 2. Neighbor table instability with continuous add/delete churn 3. Cascading force-repopulate storms in lldp_syncd Replace the O(N) per-port loop with a single O(1) global directive: configure lldp portidsubtype ifname This sets the default Port ID subtype to Interface Name (subtype 5), so the first LLDP frame uses the Linux interface name (e.g. Ethernet0) instead of MAC address. lldpmgrd still runs later to configure the final portidsubtype to local+alias per port. Verified on Mellanox SN5640 (512 ports, t0-isolated-d256u256s2 topology): - Config lines: 519 -> 6 (removed 513 per-port lines) - Convergence: never stable -> 259/259 in ~30 seconds, rock solid - Neighbor deletes: 74+ -> 0 - First frame Port ID: Subtype Interface Name (5): Ethernet0 (not MAC) Fixes sonic-net#26568 Signed-off-by: Zhaohui Sun <zhaohuisun@microsoft.com>
|
/azp run Azure.sonic-buildimage |
|
Azure Pipelines successfully started running 1 pipeline(s). |
There was a problem hiding this comment.
Pull request overview
This PR optimizes LLDP daemon startup configuration in SONiC’s docker-lldp container by replacing a per-port portidsubtype configuration loop with a single global portidsubtype ifname directive to avoid startup-time scaling issues on high-port-count platforms.
Changes:
- Removed the Jinja2 loop that emitted
configure ports <port> lldp portidsubtype local ...for every front-panel port. - Added a single global
configure lldp portidsubtype ifnamedirective to prevent MAC-based Port IDs in the first LLDP frames while keeping config load O(1).
| neighbor churn and cascading force-repopulate storms in lldp_syncd. | ||
| The single global directive achieves the same goal (no MAC-as-Port-ID) | ||
| with O(1) config processing time. lldpmgrd still runs later to set the | ||
| final portidsubtype to local+alias per port. #} | ||
| configure lldp portidsubtype ifname |
There was a problem hiding this comment.
This template change will break sonic-config-engine pytest golden-file comparisons: test_j2files.py renders dockers/docker-lldp/lldpd.conf.j2 and compares it to the sample_output lldpd-*.conf files, which currently expect only per-port configure ports ... portidsubtype local ... lines and do not include the new global configure lldp portidsubtype ifname directive. Please update the corresponding golden outputs (py2 + py3) via a sonic-config-engine submodule bump (preferred) or otherwise keep the unit tests in sync with the new rendered output.
99f60e3 to
295c408
Compare
|
/azp run Azure.sonic-buildimage |
|
Azure Pipelines successfully started running 1 pipeline(s). |
295c408 to
005fdaa
Compare
|
/azp run Azure.sonic-buildimage |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Move the detailed explanation of the portidsubtype change from the Jinja2 comment to the PR description. Keep only a one-line comment in the code. Signed-off-by: Zhaohui Sun <zhaohuisun@microsoft.com>
51c024a to
aa3506d
Compare
|
/azp run Azure.sonic-buildimage |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Description of PR
Summary:
Replace the per-port portidsubtype configuration loop from PR #26144 with a single global
configure lldp portidsubtype ifnamedirective.Fixes #26568
Background: Why portidsubtype matters
Without any portidsubtype configuration, lldpd uses MAC address as the default Port ID. When lldpd auto-resumes after config processing, the first LLDP frame carries MAC-based Port IDs. Peers then see a transient MSAP change when
lldpmgrdlater reconfigures portidsubtype tolocal(alias), causing unnecessary neighbor flaps.PR #26144 solved this by adding a Jinja2 loop generating per-port
configure ports <name> lldp portidsubtype local <alias>lines inlldpd.conf. This worked fine on low-port-count platforms.The problem on high-port-count platforms
On platforms with many ports (e.g., Mellanox SN5640 with 512 ports), PR #26144 generates 512+ config lines. Processing these lines blocks lldpd for 11+ seconds at startup. During this blackout:
The fix: global
portidsubtype ifnameThe single global directive
configure lldp portidsubtype ifnametells lldpd to use the Linux interface name (e.g.,Ethernet0) as Port ID for all ports, instead of MAC address. This achieves the same goal as the per-port loop (no MAC-as-Port-ID) with O(1) config processing time.lldpmgrdstill runs later to set the finalportidsubtypetolocal+ alias (e.g.,etp1a) per port — this is the existing designed behavior and works correctly.Key trade-off: The first LLDP frame after restart uses the interface name (
Ethernet0) instead of the alias (etp9a). When lldpmgrd starts and reconfigures, there is a one-time MSAP transition per port. This is acceptable because:Type of change
Back port request
Approach
What is the motivation for this PR?
On high-port-count platforms (SN5640, 512 ports), the per-port portidsubtype loop from PR #26144 blocks lldpd startup for 11+ seconds, causing LLDP neighbor churn and cascading instability that never converges.
How did you do it?
Replaced the per-port Jinja2 loop (N lines) with a single global directive (1 line):
How did you verify/test it?
A/B tested on Mellanox SN5640 (str5-sn5640-6, topology t0-isolated-d256u256s2, 259 VM neighbors):
The global ifname approach was tested across multiple restart cycles with zero neighbor drops.
Any platform specific information?
Primarily affects high-port-count platforms (SN5640 with 512 ports). Lower port count platforms were not affected by the original issue but benefit from the simpler config.
Supported testbed topology if it's a new test case?
N/A (not a new test case)
Documentation
N/A