Skip to content

netsync: Separate mix message request.#3499

Merged
davecgh merged 31 commits intodecred:masterfrom
davecgh:netsync_separate_mix_msg_request
Aug 29, 2025
Merged

netsync: Separate mix message request.#3499
davecgh merged 31 commits intodecred:masterfrom
davecgh:netsync_separate_mix_msg_request

Conversation

@davecgh
Copy link
Member

@davecgh davecgh commented Aug 8, 2025

This requires #3498.

The current logic for requesting mix messages from peers was added to the generic request from peer paths because the old async model required a lot of additional plumbing.

Now that the sync manager has been converted to a synchronous model, no additional plumbing is needed and therefore it is much simpler and more efficient to implement the request logic for mix messages independently.

Also, only a single mix message for a missing pair request is ever needed at a time, so the additional overhead of taking a slice is not needed.

Consequently, this separates the logic for requesting a mix message from a peer from the generic request from peer path into its own specialized method that is much more efficient.

Finally, it also changes the logic for determining if the mix message should be requested to make use of the same method as other paths that handle requesting mix messages. In particular, this means it will no longer attempt to request recently rejected or removed messages.

@davecgh davecgh added this to the 2.1.0 milestone Aug 8, 2025
@davecgh davecgh force-pushed the netsync_separate_mix_msg_request branch 2 times, most recently from 1a4256f to 34b5315 Compare August 8, 2025 05:10
@davecgh davecgh force-pushed the netsync_separate_mix_msg_request branch 2 times, most recently from 78a6642 to 67c037e Compare August 8, 2025 05:29
@davecgh davecgh force-pushed the netsync_separate_mix_msg_request branch from 67c037e to 099f734 Compare August 8, 2025 08:11
@davecgh davecgh force-pushed the netsync_separate_mix_msg_request branch 2 times, most recently from e8e5fae to a730e0b Compare August 8, 2025 19:26
@davecgh davecgh force-pushed the netsync_separate_mix_msg_request branch from a730e0b to a559857 Compare August 15, 2025 18:58
@davecgh davecgh force-pushed the netsync_separate_mix_msg_request branch from a559857 to c728a9a Compare August 25, 2025 20:49
davecgh added 14 commits August 26, 2025 20:02
This is the first in a series of commits that will ultimately convert
all of the code related to the sync manager handling events from remote
peers to synchronous code that makes use of separate mutexes and atomics
to protect concurrent access.

The motivation for the change in architecture follows.

Currently, all operations performed by the sync manager are implemented
via a single event handler goroutine and a single message channel to
protect concurrent access.

The existing implementation has worked well for many years since syncing
currently only happens via a single sync peer.

However, with the goal of moving to a syncing paradigm where there is no
sync peer and data is intelligently downloaded from multiple peers in
parallel, the current implementation is not ideal in a few ways:

- It is difficult to improve parallel processing since everything is
  forced into a single goroutine
- It requires additional plumbing to provide access to any related
  information
- The use of channels significantly hinders dynamically modifying
  requests based on state which is necessary for efficiently syncing
  from multiple peers

Converting all the relevant code to synchronous code addresses the
aforementioned concerns and has some additional benefits.

For example, the new model:

- Ultimately allows more code to run in parallel in the individual peer
  goroutines
- Requires less plumbing for handling events
- Makes the state available to calling code so it can make better
  decisions
- Blocks further reads until the messages are processed by default
- Provides better overall throughput

One notable aspect of the overall change to the architecture is that the
peer read goroutines themselves will now block while processing the
messages by default as opposed to the current model where the messages
are queued up to be handled asynchronously which then allows the peers
to immediately read more data.

This existing async queuing approach requires care to prevent malicious
behavior as is evident by several of the handlers (for example, blocks,
transactions, and mix messages) currently manually implementing blocking
behavior by using channels that are notified once the events are
eventually processed by the async handler.

Moreover, since all of the events from all peers all currently go
through the single sync manager goroutine, the aforementioned messages
often end up blocking not only until the message itself is processed,
but also until all of the other messages from all of the other peers
that arrived first are processed.

The existing model also results in further induced delays because most
of the messages result in requests to the peer in response which
obviously can't be sent until the message is processed.

So, while at first glance the lack of read pipelining by default might
raise some potential questions about extra latency, in practice, the new
model will typically result in faster overall processing despite that.
Further, it permits much more efficient per-message targeted pipelining
if any major bottlenecks were to arise in new messages.

With all of that in mind, this commit starts the conversion by
introducing a separate mutex to protect the peers map and updates the
code to protect all references to the map with that mutex.

This will allow future commits to methodically refactor the various
operations without introducing races.
Currently, the logic that transitions from the initial header sync to
the initial chain sync is quite buried in a conditional in the header
handling routine which makes it somewhat challenging to find and follow
for those not familiar with the code.  It also makes it more difficult
to update, particularly if more than a single sync peer for the initial
header sync is implemented, since it isn't concurrent safe and other
code around it relies on local state it updates.

This aims to improve the aforementioned by separating the handling for
when the manager determines the initial header sync process completes
into a separate method named onInitialHeaderSyncDone.

The main benefits are:

- It is much easier to find and follow the overall sequence of
  events that take place during the initial sync process
- It is more consistent the existing onInitialChainSyncDone method that
  is invoked when the initial chain sync process completes
- It will allow future commits to more easily add additional handling
  and make it concurrent safe

This is a part of the overall effort to convert the code related to
handling the various sync manager events to synchronous code that runs
in the various caller goroutines.
A peer that serves data is not necessarily a sync peer candidate, but
serving data is a necessary condition for being a sync peer candidate.

This helps make that distinction clear by adding a new flag to the Peer
struct that specifies whether or not the peer serves data and updates a
few places in the code that only rely on serving data to use the new
flag instead of overloading the meaning of the sync candidate flag.

It also has the benefit of being concurrent safe, unlike the sync
candidate flag, since whether or not a peer serves data can't change
while sync candidacy can.

This is a part of the overall effort to convert the code related to
handling the various sync manager events to synchronous code that runs
in the various caller goroutines.
This simplifies sync peer candidacy determination and makes it
concurrent safe by removing the sync candidate flag from the Peer struct
and instead basing candidacy on the new immutable serves data flag and
imposing the other necessary conditions when choosing the best sync
peer.

This is a part of the overall effort to convert the code related to
handling the various sync manager events to synchronous code that runs
in the various caller goroutines.
This modifies the sync height to use an atomic.Int64 instead of a mutex
to protect concurrent access and introduces a new method that atomically
and conditionally sets it to a higher value.

This approach makes it both harder to misuse and quite a bit faster due
to using non-blocking optimistic locking.

It also adds a test to ensure the new method works as expected including
under concurrent access.

This is a part of the overall effort to convert the code related to
handling the various sync manager events to synchronous code that runs
in the various caller goroutines.
This modifies the isCurrent flag to use an atomic.Bool instead of a
mutex to protect concurrent access.

This approach makes it both harder to misuse and quite a bit faster due
to using non-blocking optimistic locking.

This is a part of the overall effort to convert the code related to
handling the various sync manager events to synchronous code that runs
in the various caller goroutines.
This changes the headersSynced flag to an atomic.Bool so it is safe to
access concurrently.

It also introduces a new method on the header sync state named
InitialHeaderSyncDone to simplify and improve the readability of the
various places that check its state.

This is a part of the overall effort to convert the code related to
handling the various sync manager events to synchronous code that runs
in the various caller goroutines.
This makes access to the sync peer independently concurrent safe by
protecting it with a mutex and updating all of the call sites
appropriately.

It also ensures the methods related to starting the various chain sync
steps from the sync peer are protected by the mutex since their overall
logic relies on its value to determine when they need to be invoked.

This is a part of the overall effort to convert the code related to
handling the various sync manager events to synchronous code that runs
in the various caller goroutines.
This makes all accesses to the request maps independently concurrent
safe by protecting them with a mutex and updating all of the call sites
accordingly.

This is a part of the overall effort to convert the code related to
handling the various sync manager events to synchronous code that runs
in the various caller goroutines.
This makes all logic related to requesting the next needed blocks from
the sync peer independently concurrent safe by protecting it with a
mutex and updating all of the call sites accordingly.

This is a part of the overall effort to convert the code related to
handling the various sync manager events to synchronous code that runs
in the various caller goroutines.
This modifies the stall timeout logic to make it concurrent safe by
slightly loosening the ordering requirements as it relates to stopping
and resetting the stall timeout.

The current implementation was careful to prevent stale time values from
being delivered on the timer's channel, however, there really is no
reason to worry about that possibility because:

1) The stall timeout logic only depends on the event as opposed to the
   time values
2) The timeout is sufficiently long that it is almost never hit in
   practice
3) It is even more rare that the timer would just happen to be in the
   process of firing (sending to the channel) when data simultaneously
   arrives and resets the timer just before it actually fires
4) Even if all of that were to happen, the only effect is the standard
   stall timeout handling logic which means there are no adverse effects

This is a part of the overall effort to convert the code related to
handling the various sync manager events to synchronous code that runs
in the various caller goroutines.
This refactors the logic related to adding a new valid peer to the sync
manager out of the event handler since it is now independently
concurrent safe.

This is a part of the overall effort to convert the code related to
handling the various sync manager events to synchronous code that runs
in the various caller goroutines.
This refactors the logic related to handling valid peers disconnecting
from the sync manager out of the event handler since it is now
independently concurrent safe.

This is a part of the overall effort to convert the code related to
handling the various sync manager events to synchronous code that runs
in the various caller goroutines.
This changes the numConsecutiveOrphanHeaders counter to an atomic.Int64
so it is independently safe for concurrent access.

This is a part of the overall effort to convert the code related to
handling the various sync manager events to synchronous code that runs
in the various caller goroutines.
davecgh added 17 commits August 26, 2025 20:03
This makes all logic related to tracking the best announced block by
each peer independently concurrent safe by protecting the related fields
with a mutex and updating all of the call sites accordingly.

This is a part of the overall effort to convert the code related to
handling the various sync manager events to synchronous code that runs
in the various caller goroutines.
This splits the logic for starting the initial header sync and initial
chain sync processes into separate methods and also improves the warning
when there are no sync peers to be concurrent safe as well as to prevent
multiple unnecessary warnings.

The primary motivation for this change is to make it possible to
independently make the state transitions between the modes concurrent
safe in upcoming future commits.

Looking even further ahead, it partially paves the way to eventually
removing the entire notion of a chain sync peer (but not header sync
peer) entirely.

However, it also improves readability and has the nice minor benefit of
slightly improved efficiency when peers connect and disconnect since it
allows the logic that does not apply during the initial headers sync to
more easily be skipped without needing to check the flag multiple times.
The current approach to logging during the initial headers sync and
transitioning from the initial headers sync mode to the initial chain
sync mode relies on the fact the entire sequence is mutually exclusive,
as is currently the case, but it would result in logic races if parallel
header downloading is ever implemented during the headers sync process.

There are also no immediate plans to implement parallel header
downloading, in part because doing so would require an entirely new and
more capable wire message to be implemented first, but also because it's
already such a tiny fraction of the overall sync time and headers don't
grow at a very fast rate, so it will be a very long time before any
bottleneck there would become an issue.

However, it is worthwhile to address it now even though it's not
technically required yet since it might not be super obvious later.

Consequently, this reworks the logic related to transitioning to the
initial headers sync done state slightly to be independently concurrent
safe by adding a mutex to the header sync state and ensuring the flag
that tracks its completion, which is also separately an atomic for
performance, is re-checked and updated under that mutex to ensure the
transition is idempotent.

It also gates the header sync progress logging behind the initial header
sync mode flag since it is technically more accurate even though it does
not really matter so long as there is only a single header sync peer.

This is a part of the overall effort to convert the code related to
handling the various sync manager events to synchronous code that runs
in the various caller goroutines.
This refactors the logic related to handling header messages from peers
out of the event handler since it is now independently concurrent safe.

This is a part of the overall effort to convert the code related to
handling the various sync manager events to synchronous code that runs
in the various caller goroutines.
The current approach to detecting when the initial chain sync process is
done is based on determining if the chain wasn't current prior to
processing the block and becomes current after.  That logic is sound so
long as the entire sequence is mutually exclusive, as is currently the
case, but it would result in logic races when processing multiple blocks
concurrently.

Consequently, this reworks the logic related to transitioning to the
initial chain sync done state slightly to be independently concurrent
safe by introducing a new flag and mutex to track its state and then
transitioning based on the flag and status of whether the chain believes
it is current.

It also squashes a very minor long-standing logging bug where the
transition out of the initial chain sync mode would sometimes not log
the last batch of processed blocks for either an extended period or not
at all.

This is a part of the overall effort to convert the code related to
handling the various sync manager events to synchronous code that runs
in the various caller goroutines.
This refactors the logic related to handling block messages from peers
out of the event handler since it is now independently concurrent safe.

This is a part of the overall effort to convert the code related to
handling the various sync manager events to synchronous code that runs
in the various caller goroutines.
This refactors the logic related to handling inv messages from peers out
of the event handler since it is now independently concurrent safe.

This is a part of the overall effort to convert the code related to
handling the various sync manager events to synchronous code that runs
in the various caller goroutines.
This refactors the logic related to handling notfound messages from
peers out of the event handler since it is now independently concurrent
safe.

This is a part of the overall effort to convert the code related to
handling the various sync manager events to synchronous code that runs
in the various caller goroutines.
This refactors the logic related to handling transaction messages from
peers out of the event handler since it is now independently concurrent
safe.

It also slightly modifies the behavior to return the slice of accepted
transactions to the caller so it can announce them as opposed to relying
on a callback to do the announcements.

The announcement callback is removed from the peer notifier interface
since it is no longer needed.

This is a part of the overall effort to convert the code related to
handling the various sync manager events to synchronous code that runs
in the various caller goroutines.
This refactors the logic related to handling mix messages from peers out
of the event handler since it is now independently concurrent safe.

It also slightly modifies the behavior to return the slice of accepted
messages to the caller so it can announce them as opposed to relying on
a callback to do the announcements.

The entire peer notifier interface and containing file is removed since
it is no longer needed.

This is a part of the overall effort to convert the code related to
handling the various sync manager events to synchronous code that runs
in the various caller goroutines.
This refactors the logic related to requesting data from peers out of
the event handler since it is now independently concurrent safe.

This is a part of the overall effort to convert the code related to
handling the various sync manager events to synchronous code that runs
in the various caller goroutines.
This refactors the logic related to processing blocks from other sources
than the network out of the event handler since it is now independently
concurrent safe.

This is a part of the overall effort to convert the code related to
handling the various sync manager events to synchronous code that runs
in the various caller goroutines.
This refactors the logic related to returning the current sync peer ID
out of the event handler since it is now independently concurrent safe.

This is a part of the overall effort to convert the code related to
handling the various sync manager events to synchronous code that runs
in the various caller goroutines.
This removes the event handler message channel since it is no longer
used.

This is a part of the overall effort to convert the code related to
handling the various sync manager events to synchronous code that runs
in the various caller goroutines.
This renames the event handler goroutine to stallHandler to more
accurately reflect its only remaining purpose after all of the recent
changes to move everything else out of the event handler.

This is a part of the overall effort to convert the code related to
handling the various sync manager events to synchronous code that runs
in the various caller goroutines.
This simplifies the request tracking for blocks, transactions, and mix
messages by making use of a single global map for each type that tracks
the peer instead of a global map plus an additional per-peer map for
each type.

Note that this change does make it very slightly more costly in terms of
CPU usage to determine the in-flight requests for a specific peer, but,
in practice, the new approach ends up being slightly faster overall
despite that due to the combination of less map management performed
under a mutex as well as the infrequent need to determine the in-flight
requests for a specific peer.

It also slightly reduces memory usage due to the removal of 3 maps per
connected peer which totals to between 24 and 375 maps with typical
default connection settings.
The current logic for requesting mix messages from peers was added to
the generic request from peer paths because the old async model required
a lot of additional plumbing.

Now that the sync manager has been converted to a synchronous model, no
additional plumbing is needed and therefore it is much simpler and more
efficient to implement the request logic for mix messages independently.

Also, only a single mix message for a missing pair request is ever
needed at a time, so the additional overhead of taking a slice is not
needed.

Consequently, this separates the logic for requesting a mix message from
a peer from the generic request from peer path into its own specialized
method that is much more efficient.

Finally, it also changes the logic for determining if the mix message
should be requested to make use of the same method as other paths that
handle requesting mix messages.  In particular, this means it will no
longer attempt to request recently rejected or removed messages.
@davecgh davecgh force-pushed the netsync_separate_mix_msg_request branch from c728a9a to 1232311 Compare August 27, 2025 01:05
@davecgh davecgh merged commit 1232311 into decred:master Aug 29, 2025
2 checks passed
@davecgh davecgh deleted the netsync_separate_mix_msg_request branch August 29, 2025 00:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants