netsync: Separate mix message request. by davecgh · Pull Request #3499 · decred/dcrd

davecgh · 2025-08-08T04:55:21Z

This requires #3498.

The current logic for requesting mix messages from peers was added to the generic request from peer paths because the old async model required a lot of additional plumbing.

Now that the sync manager has been converted to a synchronous model, no additional plumbing is needed and therefore it is much simpler and more efficient to implement the request logic for mix messages independently.

Also, only a single mix message for a missing pair request is ever needed at a time, so the additional overhead of taking a slice is not needed.

Consequently, this separates the logic for requesting a mix message from a peer from the generic request from peer path into its own specialized method that is much more efficient.

Finally, it also changes the logic for determining if the mix message should be requested to make use of the same method as other paths that handle requesting mix messages. In particular, this means it will no longer attempt to request recently rejected or removed messages.

internal/netsync/manager.go

This is the first in a series of commits that will ultimately convert all of the code related to the sync manager handling events from remote peers to synchronous code that makes use of separate mutexes and atomics to protect concurrent access. The motivation for the change in architecture follows. Currently, all operations performed by the sync manager are implemented via a single event handler goroutine and a single message channel to protect concurrent access. The existing implementation has worked well for many years since syncing currently only happens via a single sync peer. However, with the goal of moving to a syncing paradigm where there is no sync peer and data is intelligently downloaded from multiple peers in parallel, the current implementation is not ideal in a few ways: - It is difficult to improve parallel processing since everything is forced into a single goroutine - It requires additional plumbing to provide access to any related information - The use of channels significantly hinders dynamically modifying requests based on state which is necessary for efficiently syncing from multiple peers Converting all the relevant code to synchronous code addresses the aforementioned concerns and has some additional benefits. For example, the new model: - Ultimately allows more code to run in parallel in the individual peer goroutines - Requires less plumbing for handling events - Makes the state available to calling code so it can make better decisions - Blocks further reads until the messages are processed by default - Provides better overall throughput One notable aspect of the overall change to the architecture is that the peer read goroutines themselves will now block while processing the messages by default as opposed to the current model where the messages are queued up to be handled asynchronously which then allows the peers to immediately read more data. This existing async queuing approach requires care to prevent malicious behavior as is evident by several of the handlers (for example, blocks, transactions, and mix messages) currently manually implementing blocking behavior by using channels that are notified once the events are eventually processed by the async handler. Moreover, since all of the events from all peers all currently go through the single sync manager goroutine, the aforementioned messages often end up blocking not only until the message itself is processed, but also until all of the other messages from all of the other peers that arrived first are processed. The existing model also results in further induced delays because most of the messages result in requests to the peer in response which obviously can't be sent until the message is processed. So, while at first glance the lack of read pipelining by default might raise some potential questions about extra latency, in practice, the new model will typically result in faster overall processing despite that. Further, it permits much more efficient per-message targeted pipelining if any major bottlenecks were to arise in new messages. With all of that in mind, this commit starts the conversion by introducing a separate mutex to protect the peers map and updates the code to protect all references to the map with that mutex. This will allow future commits to methodically refactor the various operations without introducing races.

Currently, the logic that transitions from the initial header sync to the initial chain sync is quite buried in a conditional in the header handling routine which makes it somewhat challenging to find and follow for those not familiar with the code. It also makes it more difficult to update, particularly if more than a single sync peer for the initial header sync is implemented, since it isn't concurrent safe and other code around it relies on local state it updates. This aims to improve the aforementioned by separating the handling for when the manager determines the initial header sync process completes into a separate method named onInitialHeaderSyncDone. The main benefits are: - It is much easier to find and follow the overall sequence of events that take place during the initial sync process - It is more consistent the existing onInitialChainSyncDone method that is invoked when the initial chain sync process completes - It will allow future commits to more easily add additional handling and make it concurrent safe This is a part of the overall effort to convert the code related to handling the various sync manager events to synchronous code that runs in the various caller goroutines.

A peer that serves data is not necessarily a sync peer candidate, but serving data is a necessary condition for being a sync peer candidate. This helps make that distinction clear by adding a new flag to the Peer struct that specifies whether or not the peer serves data and updates a few places in the code that only rely on serving data to use the new flag instead of overloading the meaning of the sync candidate flag. It also has the benefit of being concurrent safe, unlike the sync candidate flag, since whether or not a peer serves data can't change while sync candidacy can. This is a part of the overall effort to convert the code related to handling the various sync manager events to synchronous code that runs in the various caller goroutines.

This simplifies sync peer candidacy determination and makes it concurrent safe by removing the sync candidate flag from the Peer struct and instead basing candidacy on the new immutable serves data flag and imposing the other necessary conditions when choosing the best sync peer. This is a part of the overall effort to convert the code related to handling the various sync manager events to synchronous code that runs in the various caller goroutines.

This modifies the sync height to use an atomic.Int64 instead of a mutex to protect concurrent access and introduces a new method that atomically and conditionally sets it to a higher value. This approach makes it both harder to misuse and quite a bit faster due to using non-blocking optimistic locking. It also adds a test to ensure the new method works as expected including under concurrent access. This is a part of the overall effort to convert the code related to handling the various sync manager events to synchronous code that runs in the various caller goroutines.

This modifies the isCurrent flag to use an atomic.Bool instead of a mutex to protect concurrent access. This approach makes it both harder to misuse and quite a bit faster due to using non-blocking optimistic locking. This is a part of the overall effort to convert the code related to handling the various sync manager events to synchronous code that runs in the various caller goroutines.

This changes the headersSynced flag to an atomic.Bool so it is safe to access concurrently. It also introduces a new method on the header sync state named InitialHeaderSyncDone to simplify and improve the readability of the various places that check its state. This is a part of the overall effort to convert the code related to handling the various sync manager events to synchronous code that runs in the various caller goroutines.

This makes access to the sync peer independently concurrent safe by protecting it with a mutex and updating all of the call sites appropriately. It also ensures the methods related to starting the various chain sync steps from the sync peer are protected by the mutex since their overall logic relies on its value to determine when they need to be invoked. This is a part of the overall effort to convert the code related to handling the various sync manager events to synchronous code that runs in the various caller goroutines.

This makes all accesses to the request maps independently concurrent safe by protecting them with a mutex and updating all of the call sites accordingly. This is a part of the overall effort to convert the code related to handling the various sync manager events to synchronous code that runs in the various caller goroutines.

This makes all logic related to requesting the next needed blocks from the sync peer independently concurrent safe by protecting it with a mutex and updating all of the call sites accordingly. This is a part of the overall effort to convert the code related to handling the various sync manager events to synchronous code that runs in the various caller goroutines.

This modifies the stall timeout logic to make it concurrent safe by slightly loosening the ordering requirements as it relates to stopping and resetting the stall timeout. The current implementation was careful to prevent stale time values from being delivered on the timer's channel, however, there really is no reason to worry about that possibility because: 1) The stall timeout logic only depends on the event as opposed to the time values 2) The timeout is sufficiently long that it is almost never hit in practice 3) It is even more rare that the timer would just happen to be in the process of firing (sending to the channel) when data simultaneously arrives and resets the timer just before it actually fires 4) Even if all of that were to happen, the only effect is the standard stall timeout handling logic which means there are no adverse effects This is a part of the overall effort to convert the code related to handling the various sync manager events to synchronous code that runs in the various caller goroutines.

This refactors the logic related to adding a new valid peer to the sync manager out of the event handler since it is now independently concurrent safe. This is a part of the overall effort to convert the code related to handling the various sync manager events to synchronous code that runs in the various caller goroutines.

This refactors the logic related to handling valid peers disconnecting from the sync manager out of the event handler since it is now independently concurrent safe. This is a part of the overall effort to convert the code related to handling the various sync manager events to synchronous code that runs in the various caller goroutines.

This changes the numConsecutiveOrphanHeaders counter to an atomic.Int64 so it is independently safe for concurrent access. This is a part of the overall effort to convert the code related to handling the various sync manager events to synchronous code that runs in the various caller goroutines.

This makes all logic related to tracking the best announced block by each peer independently concurrent safe by protecting the related fields with a mutex and updating all of the call sites accordingly. This is a part of the overall effort to convert the code related to handling the various sync manager events to synchronous code that runs in the various caller goroutines.

This splits the logic for starting the initial header sync and initial chain sync processes into separate methods and also improves the warning when there are no sync peers to be concurrent safe as well as to prevent multiple unnecessary warnings. The primary motivation for this change is to make it possible to independently make the state transitions between the modes concurrent safe in upcoming future commits. Looking even further ahead, it partially paves the way to eventually removing the entire notion of a chain sync peer (but not header sync peer) entirely. However, it also improves readability and has the nice minor benefit of slightly improved efficiency when peers connect and disconnect since it allows the logic that does not apply during the initial headers sync to more easily be skipped without needing to check the flag multiple times.

The current approach to logging during the initial headers sync and transitioning from the initial headers sync mode to the initial chain sync mode relies on the fact the entire sequence is mutually exclusive, as is currently the case, but it would result in logic races if parallel header downloading is ever implemented during the headers sync process. There are also no immediate plans to implement parallel header downloading, in part because doing so would require an entirely new and more capable wire message to be implemented first, but also because it's already such a tiny fraction of the overall sync time and headers don't grow at a very fast rate, so it will be a very long time before any bottleneck there would become an issue. However, it is worthwhile to address it now even though it's not technically required yet since it might not be super obvious later. Consequently, this reworks the logic related to transitioning to the initial headers sync done state slightly to be independently concurrent safe by adding a mutex to the header sync state and ensuring the flag that tracks its completion, which is also separately an atomic for performance, is re-checked and updated under that mutex to ensure the transition is idempotent. It also gates the header sync progress logging behind the initial header sync mode flag since it is technically more accurate even though it does not really matter so long as there is only a single header sync peer. This is a part of the overall effort to convert the code related to handling the various sync manager events to synchronous code that runs in the various caller goroutines.

This refactors the logic related to handling header messages from peers out of the event handler since it is now independently concurrent safe. This is a part of the overall effort to convert the code related to handling the various sync manager events to synchronous code that runs in the various caller goroutines.

The current approach to detecting when the initial chain sync process is done is based on determining if the chain wasn't current prior to processing the block and becomes current after. That logic is sound so long as the entire sequence is mutually exclusive, as is currently the case, but it would result in logic races when processing multiple blocks concurrently. Consequently, this reworks the logic related to transitioning to the initial chain sync done state slightly to be independently concurrent safe by introducing a new flag and mutex to track its state and then transitioning based on the flag and status of whether the chain believes it is current. It also squashes a very minor long-standing logging bug where the transition out of the initial chain sync mode would sometimes not log the last batch of processed blocks for either an extended period or not at all. This is a part of the overall effort to convert the code related to handling the various sync manager events to synchronous code that runs in the various caller goroutines.

This refactors the logic related to handling block messages from peers out of the event handler since it is now independently concurrent safe. This is a part of the overall effort to convert the code related to handling the various sync manager events to synchronous code that runs in the various caller goroutines.

This refactors the logic related to handling inv messages from peers out of the event handler since it is now independently concurrent safe. This is a part of the overall effort to convert the code related to handling the various sync manager events to synchronous code that runs in the various caller goroutines.

This refactors the logic related to handling notfound messages from peers out of the event handler since it is now independently concurrent safe. This is a part of the overall effort to convert the code related to handling the various sync manager events to synchronous code that runs in the various caller goroutines.

This refactors the logic related to handling transaction messages from peers out of the event handler since it is now independently concurrent safe. It also slightly modifies the behavior to return the slice of accepted transactions to the caller so it can announce them as opposed to relying on a callback to do the announcements. The announcement callback is removed from the peer notifier interface since it is no longer needed. This is a part of the overall effort to convert the code related to handling the various sync manager events to synchronous code that runs in the various caller goroutines.

This refactors the logic related to handling mix messages from peers out of the event handler since it is now independently concurrent safe. It also slightly modifies the behavior to return the slice of accepted messages to the caller so it can announce them as opposed to relying on a callback to do the announcements. The entire peer notifier interface and containing file is removed since it is no longer needed. This is a part of the overall effort to convert the code related to handling the various sync manager events to synchronous code that runs in the various caller goroutines.

This refactors the logic related to requesting data from peers out of the event handler since it is now independently concurrent safe. This is a part of the overall effort to convert the code related to handling the various sync manager events to synchronous code that runs in the various caller goroutines.

This refactors the logic related to processing blocks from other sources than the network out of the event handler since it is now independently concurrent safe. This is a part of the overall effort to convert the code related to handling the various sync manager events to synchronous code that runs in the various caller goroutines.

This refactors the logic related to returning the current sync peer ID out of the event handler since it is now independently concurrent safe. This is a part of the overall effort to convert the code related to handling the various sync manager events to synchronous code that runs in the various caller goroutines.

This removes the event handler message channel since it is no longer used. This is a part of the overall effort to convert the code related to handling the various sync manager events to synchronous code that runs in the various caller goroutines.

This renames the event handler goroutine to stallHandler to more accurately reflect its only remaining purpose after all of the recent changes to move everything else out of the event handler. This is a part of the overall effort to convert the code related to handling the various sync manager events to synchronous code that runs in the various caller goroutines.

This simplifies the request tracking for blocks, transactions, and mix messages by making use of a single global map for each type that tracks the peer instead of a global map plus an additional per-peer map for each type. Note that this change does make it very slightly more costly in terms of CPU usage to determine the in-flight requests for a specific peer, but, in practice, the new approach ends up being slightly faster overall despite that due to the combination of less map management performed under a mutex as well as the infrequent need to determine the in-flight requests for a specific peer. It also slightly reduces memory usage due to the removal of 3 maps per connected peer which totals to between 24 and 375 maps with typical default connection settings.

The current logic for requesting mix messages from peers was added to the generic request from peer paths because the old async model required a lot of additional plumbing. Now that the sync manager has been converted to a synchronous model, no additional plumbing is needed and therefore it is much simpler and more efficient to implement the request logic for mix messages independently. Also, only a single mix message for a missing pair request is ever needed at a time, so the additional overhead of taking a slice is not needed. Consequently, this separates the logic for requesting a mix message from a peer from the generic request from peer path into its own specialized method that is much more efficient. Finally, it also changes the logic for determining if the mix message should be requested to make use of the same method as other paths that handle requesting mix messages. In particular, this means it will no longer attempt to request recently rejected or removed messages.

davecgh added this to the 2.1.0 milestone Aug 8, 2025

davecgh force-pushed the netsync_separate_mix_msg_request branch 2 times, most recently from 1a4256f to 34b5315 Compare August 8, 2025 05:10

davecgh commented Aug 8, 2025

View reviewed changes

internal/netsync/manager.go Outdated Show resolved Hide resolved

davecgh force-pushed the netsync_separate_mix_msg_request branch 2 times, most recently from 78a6642 to 67c037e Compare August 8, 2025 05:29

davecgh mentioned this pull request Aug 8, 2025

netsync: Rework external request from peer. #3500

Merged

davecgh force-pushed the netsync_separate_mix_msg_request branch from 67c037e to 099f734 Compare August 8, 2025 08:11

jrick reviewed Aug 8, 2025

View reviewed changes

internal/netsync/manager.go Show resolved Hide resolved

davecgh force-pushed the netsync_separate_mix_msg_request branch 2 times, most recently from e8e5fae to a730e0b Compare August 8, 2025 19:26

davecgh mentioned this pull request Aug 14, 2025

netsync: Convert to synchronous model. #3496

Merged

jholdstock reviewed Aug 15, 2025

View reviewed changes

internal/netsync/manager.go Outdated Show resolved Hide resolved

davecgh force-pushed the netsync_separate_mix_msg_request branch from a730e0b to a559857 Compare August 15, 2025 18:58

jholdstock approved these changes Aug 16, 2025

View reviewed changes

davecgh force-pushed the netsync_separate_mix_msg_request branch from a559857 to c728a9a Compare August 25, 2025 20:49

davecgh added 14 commits August 26, 2025 20:02

davecgh added 17 commits August 26, 2025 20:03

netsync: Remove unused msg channel.

979947d

This removes the event handler message channel since it is no longer used. This is a part of the overall effort to convert the code related to handling the various sync manager events to synchronous code that runs in the various caller goroutines.

davecgh force-pushed the netsync_separate_mix_msg_request branch from c728a9a to 1232311 Compare August 27, 2025 01:05

jrick approved these changes Aug 28, 2025

View reviewed changes

davecgh merged commit 1232311 into decred:master Aug 29, 2025
2 checks passed

davecgh deleted the netsync_separate_mix_msg_request branch August 29, 2025 00:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

netsync: Separate mix message request.#3499

netsync: Separate mix message request.#3499
davecgh merged 31 commits intodecred:masterfrom
davecgh:netsync_separate_mix_msg_request

davecgh commented Aug 8, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

davecgh commented Aug 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

davecgh commented Aug 8, 2025 •

edited

Loading