Skip to content

CIP-0164 | Refine Leios protocols based on buidlerfest discussions#1167

Open
rkuhn wants to merge 1 commit intocardano-foundation:masterfrom
rkuhn:rk/cip-0164-protocol-refinements
Open

CIP-0164 | Refine Leios protocols based on buidlerfest discussions#1167
rkuhn wants to merge 1 commit intocardano-foundation:masterfrom
rkuhn:rk/cip-0164-protocol-refinements

Conversation

@rkuhn
Copy link
Copy Markdown

@rkuhn rkuhn commented Mar 25, 2026

This PR resulted from a discussion today at buidlerfest, I offered to write things down and submit
here for wider discussion. The model of information flow itself remains unchanged, the refinement
only aims at making it easier to achieve low latency or high bandwidth for the different parts of
the Leios protocols, as necessary.

Summary of the changes:

  • use Reactive Streams semantics for bounded push of block announcements and votes
  • remove votes offer & request communication cycles to cut down latency
  • remove probably premature optimisation of block / txn request by just listing txn indices
  • split LeiosNotify into LeiosAnnounce (for EBs), LeiosVotes, LeiosBlockNotify (for when EB and/or txns are available upstream) to allow independent treatment in the muxer and remove or reduce head-of-line blocking

We might also want to allocate N2N mini-protocol IDs in this CIP because multiple teams are starting to play with this spec and might want to check interoperability.

Signed-off-by: Roland Kuhn rk@rkuhn.info

Copy link
Copy Markdown
Contributor

@coot coot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a semantic problem in the LeiosAnnounce mini-protocol. There are two MsgLeiosAnnounceRequestNext messages which start with different agencies: once it starts with the agancy on the Client the other one starts with agancy on the server - the meaning of the first is fine - the client asks for a number of announcements, but the latter means that the server is asking itself for a number of announcements.

I don't think we actually need the MsgLeiosAnnounceRequestNext :: StBusy -> StBusy. The client can just use protocol pipelining to send MsgLeiosAnnounceRequestNext :: StClient -> StBusy ahead of receiving all responses from its previous request.

@rkuhn
Copy link
Copy Markdown
Author

rkuhn commented Mar 26, 2026

Hi @coot, sorry for the misunderstanding, I should have tried to figure out a different color right away: the RequestNext(N) message is only sent by the client; in state StBusy both sides have agency. This is the same as with pipelining, I just prefer to make it explicit and specify it rather than leaving implicit.

Please do read the RS background I linked to. The intended usage is that the client will ask for 1000 items (which I think is roughly the right depth for votes), and while receiving them it will request the next 100 after the first 100 are received. The goal is to always have sufficient demand signaled at the server that it can send right away. This is important because we need low latency for tiny messages. The protocol still places a strict upper bound on required receive buffer size (which would be roughly 100-200kB for 1000 votes, as I've been told).

Yes, a similar result could be achieved by round-robin multiplexing 1000 instances of the protocol onto that mini-protocol's aubchannel, but it would be a lot less efficient (more messages sent, memory and processing overhead at both client and server).

@sandtreader
Copy link
Copy Markdown

in state StBusy both sides have agency.

That goes against the whole principle of the state-agency / typed protocols scheme I'm afraid.

I understand what you're aiming for - it's similar to TCP window - but I think as @coot says the existing pipelining does most of what you want, and for votes the responses can carry multiple anyway.

@rphair rphair added Update Adds content or significantly reworks an existing proposal State: Triage Applied to new PR afer editor cleanup on GitHub, pending CIP meeting introduction. labels Mar 26, 2026
Copy link
Copy Markdown
Collaborator

@rphair rphair left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @rkuhn for bringing this & the discussion into the CIP stream. Mainly the CIP editors will be looking for consensus among Leios architects & agreement from other stakeholders and merge this if & when that appears to happen... so please feel free to point out the details that warrant the most consideration & that should be resolved before we merge.

Next step would be to confirm this as a proper update at the next CIP meeting in Triage (https://hackmd.io/@cip-editors/131) so please anyone for or against these changes would be welcome to attend for a light introduction (not a full review)... just enough to establish validity & general consensus behind the update.

cc @will-break-it @ch1bo @nfrisby @bwbush @WhatisRT @nhenin @dnadales @perturbing @jpraynaud

@ch1bo
Copy link
Copy Markdown
Contributor

ch1bo commented Mar 27, 2026

@rphair I specifically asked for having the conversation here on a PR and expect some more discussion on this over the next few weeks, before we'll be seeking for consensus and merge it into the CIP. Does this work for you and the CIP auditors?

@rphair
Copy link
Copy Markdown
Collaborator

rphair commented Mar 27, 2026

of course @ch1bo - there's no rush & the editors would be following your (plural) lead & waiting for you to post your own conclusions at your own pace. We'll just have a look at the CIP meeting next week so all the editors know what's ahead & not do anything with it until the stakeholding reviewers provide our consensus.

@rkuhn rkuhn force-pushed the rk/cip-0164-protocol-refinements branch from 5de19c1 to a04b10d Compare March 27, 2026 20:28
@rkuhn
Copy link
Copy Markdown
Author

rkuhn commented Mar 27, 2026

That goes against the whole principle of the state-agency / typed protocols scheme I'm afraid.

The protocol is still typed and can still be usefully formalised, but yes, modelling the real behaviour does require mixed choice, which is not currently available in the machinery used in the Haskell implementation. This is semantically already true of the existing protocols, where in StCanAwait of chain sync the initiator may send RequestNext even though it has “no agency” — this is what you call pipelining, but since all “sessions” share the same communication channel without tagging their messages, I think it is fair to consider those pipelining instances just a matter of perspective, like the choice of gauge in quantum field theory.

What I’d like to achieve here is that we define a useful network protocol befitting the Leios information flow requirements, not moulded to or limited by the existing code structure of other mini-protocols in a particular Cardano node implementation.

It is also quite easy to conform to my proposed protocol using the existing Haskell machinery by applying the pipelining approach to run a number of instances of the protocol with a fixed choice of N (e.g. N=1 to get normal request–response). This requires slightly more resources within the node but saves complexity cost in the Haskell codebase by reusing existing infrastructure.

…, and for votes the responses can carry multiple anyway.

My proposal changes the MsgLeiosVote to carry exactly one vote because then the message has a predictable cost. Allowing a list of votes within a single message means that the responder can cause unpredictable resource usage — in my proposal an initiator may ask for 1000 votes after which it expects to receive no more than 1000 votes (until it asks for more later).


I appreciate that this deviates from the trodden path in Cardano, and I’d like to ask you to consider this with an open mind given that Reactive Streams have been around for more than a decade in the JDK, meaning that this communication principle is well proven.

Comment thread CIP-0164/README.md
size of the configured TCP send and receive buffers.

While the node is catching up with the chain after a restart, it will see Praos
blocks referencing EBs and use the MsgLeiosMultiBlockRequest to get not only
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While we want a multi-block request for everything to bulk download on catch-up, we still need the body-only request/response (MsgLeiosBlockRequest and MsgLeiosBlock before). While this might be generalized to be multiple bodies, a caught up node would not download the full block closure as only in the worst case all txs are unique and not known by the client already.

@nfrisby You also discussed a "cancel" semantics of this protocol last week? That would be an alternative to having a two-stage download: it could be used to always request the full closure and cancel / override with a desired closure subset via MsgLeiosBlockTxsRequest (or even switch to a "fresher" block download).

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My idea here is to deliver the block body for MsgLeiosBlockTxsRequest(hash, []) and an array of transactions for MsgLeiosBlockTxsRequest(hash, [<txHashes>]).

Cancellation of an ongoing transfer doesn’t seem fruitful (I am happy to be proven wrong by experiment!) because the RTT-bandwidth-product P will typically not be much smaller than the maximum response size: the responder can only stop sending upon receiving the cancellation, meaning after sending the cancellation the initiator will still receive P bytes.

@rphair
Copy link
Copy Markdown
Collaborator

rphair commented Apr 1, 2026

Brief discussion at the CIP meeting today introduced the dialogue here & confirmed my earlier #1167 (review) suggesting the existing discussion here simply take its natural course until apparently resolved by the participants.

@rphair rphair added State: Confirmed Candiate with CIP number (new PR) or update under review. and removed State: Triage Applied to new PR afer editor cleanup on GitHub, pending CIP meeting introduction. labels Apr 1, 2026
Copy link
Copy Markdown
Contributor

@nfrisby nfrisby left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to see another contributor to the CIP! 👏

Even with fait accompli and local sortition (so that some votes are bigger than others), I think it's plausible that micromanaging votes isn't worthwhile, so just sending them instead of offering them individually is very plausibly a latency optimization.

For the other changes, it's not yet clear to me that they're improvements; I commented on those parts of the PR in this review.

Comment thread CIP-0164/README.md
</div>

The purpose of this first protocol is to diffuse block announcements as fast as
possible throughout the network. Since these announcements are small and
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://www.reactive-streams.org/ says:

The main goal of Reactive Streams is to govern the exchange of stream data across an asynchronous boundary—think passing elements on to another thread or thread-pool—while ensuring that the receiving side is not forced to buffer arbitrary amounts of data

The way we have achieved that so far in Cardano node is to use the typed-protocols framework combined with what we call "mini protocol pipelining." I'm hesitant to explicit bake that pipelining into the definition of the mini protocol itself (since it's an technically just an optimization). But I do think the CIP text should probably discuss the mini protocol pipelining in more detail than the existing (on main) "Because the client only has agency in one state, it can pipeline its requests for the sake of latency hiding" sentence.

See this excerpt from page 18 of https://ouroboros-network.cardano.intersectmbo.org/pdfs/network-design/network-design.pdf for a brief discussion.

Image Image

You could also watch Duncan's 2019 presentation here, https://www.youtube.com/watch?v=kkynmgwa7gE. He starts discussing mini protocol pipelining at the 35m55s mark.

My summary:

  • Every single message in the LeiosNotify mini protocol (from the main branch) is tiny. So we don't really care that they're technically different sizes. So we can say "not forced to buffer arbitrary amounts of messages" is a fine substitute to "not forced to buffer arbitrary amounts of data" for LeiosNotify.
  • The agency status is always trivial: client sends one message, then the server sends a response, and loop. This makes it easy to write the pipelined version of this mini protocol, in which the client sends 1000 MsgLeiosNotificationRequestNext messages at the start of the connection and then conceptually sends another each time it receives a reply (or smooth that out with a low-high watermark, etc).
Image

Copy link
Copy Markdown
Contributor

@nfrisby nfrisby Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now, the only distinction I'm currently seeing between the content on the main branch and this CIP in terms of the LeiosNotify mini protocol(s) is:

  • You're batching the 1000 and/or 100 of MsgLeiosNotifyRequestNext messages into one message. Since we are expecting 1000s of messages for each EB, it might be worth it to add this kind of "batching" in typed-protocols (could be a mux-level "trick" in the decoding, maybe?). Which could look something like the explicit token counting you have on this PR.
  • You've separated block announcements, vote announcements, and block offers into separate mini protocols. I suppose that might be useful so that the requests for each could be counted differently? But do we need to count them separately? (FWIW, I don't already think this would be easy to do on top of typed-protocols.)
  • You've also inlined votes into their announcement: nodes just sent the vote without offering it first. If notes are barely bigger than their names, then that seems plausible to me.

edit: I forgot to add "Is that right or am I overlooking some difference in the LeiosNotify changes?"

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review, @nfrisby! Your summary of the changes to LeiosNotify is correct. The main motivation for request batching is to conserve resources because Leios involves many more notifications than Praos. I’ll respond to @coot’s comment below regarding how to model that in terms of the existing protocol framework, we seem to be converging (and I’d love to have a more formal description of pipelining in this particular context, which I’m happy to contribute as well).

The reasoning behind splitting the notify protocol into three pieces is that we’ll have three data streams that we need to back-pressure, and if my understanding is correct then the latency requirements and processing behaviour are different:

  • block announcements shall be disseminated as quickly as possible
  • votes are processed differently and thus may experience back-pressure while announcements do not
  • block offers (incl. closures) are processed again differently, hence the separate back-pressure signal

This separation also simplifies the sending side because it can just send messages down separately back-pressured channels and let the existing multiplexer do its job. Otherwise, another level of multiplexing needs to be implemented to prioritise the block announcements.

Copy link
Copy Markdown
Contributor

@ch1bo ch1bo Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just checked the current prototype that implements LeiosNotify like it's specified in the CIP (and also now in the cardano-blueprint).

And, indeed, we seem to have a pipelined client with a pipeline depth of 300: https://github.com/IntersectMBO/ouroboros-consensus/blob/leios-prototype/ouroboros-consensus-diffusion/src/ouroboros-consensus-diffusion/Ouroboros/Consensus/Network/NodeToNode.hs#L342

I have not verified this fully, but I would expect this client to send up to 300 requests for notifications even before seeing one response.

Comment thread CIP-0164/README.md
| Client→ | MsgLeiosMultiBlockRequest | list of EB hashes | Requests the EBs and all referenced transactions for the given EB hashes |
| ←Server | MsgLeiosBlock | EB block, list of transactions | A block requested in the previous MsgLeiosMultiBlockRequest |
| ←Server | MsgLeiosNoMoreBlocks | $\emptyset$ | All blocks from the previous MsgLeiosMultiBlockRequest have been delivered |
| Client→ | MsgLeiosBlockTxsRequest | EB hash, list of integers | For the referenced EB, request a list of transactions identified by their sequence number within that EB |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm guessing this change the one the PR description describes as

remove probably premature optimisation of block / txn request by just listing txn indices

Is that right?

An EB might contain ~15000 txs (that's 512 kB divided by 34+2 B), and a node might need to request the majority of them. With either design, then node has two choices.

  • Request the whole EB.
  • Or request whichever txs are actually needed by sending a request for those tx positions.

There will be some sweet spot, but since individual txs can be 16 kB, the sweet spot plausibly involves requesting the vast majority of txs in the EB but not all of them.

Roaring Bitmap versus Simple Integer Sequence

Assuming CBOR, the following table is the size of each integer encoding.

Interval CBOR Bytes per Number Size of Interval
0 - 23 1 24
24 - 255 2 232
256 - 14999 3 14744

The encoding of the sequence 0 - 14999 would be 1×24 + 2×232 + 3×14744 = 44720 B.

If we used plain bytes instead of CBOR (ie just one big CBOR bytestring), the size would be closer to 30 kB instead of 45 kB.

As a roaring bitmap, a 15000 tx request would instead be 234 full Word64 bitmaps plus 1 partial Word64 bitmap that only has 15000 - 64×234 = 24 bits set.

Each Word64 bitmap also gets an index, which would be 0 - 234 in this case, which would be 1×24 + 2×211 = 446 B in CBOR. The 234 full Word64s would be 9×234 = 2106 B. The 1 partial Word64 would either be 5 or 9; let's call it 9 B. That's a total of 2561 B. (Closer to 2000 B without CBOR, but nbd.)

So the largest request without roaring bitmaps is ~45 kB---we could reduce that to ~30 kB---and the largest request with a roaring bitmap is ~3 kB.

That's the upshot of the roaring bitmap complexity.

Is it worth it? If you juxtapose it against the size of an EB, then saving merely 27 kB per EB seems like a distraction.

But if you compare it to "all requests are tiny" versus "some requests are about half the size of a Praos block", then the roaring bitmap a very localized complexity cost that makes the design simpler to reason about in various scenarios. For example, it would justify simpler ingress buffer management: you can limit the count of requests rather than their total byte size. And even in the worst-case that would remain true. It guarantees that "individual requests are small--- approximately 2 Praos headers, so we don't need to dedicate any complexity to worrying about their overhead".

That's the argument in favor of roaring bitmaps. I don't consider myself the owner here despite authoring this section originally; it's a CIP after all! But that'd be my argument for keeping the roaring bitmap: 1) it's a bit of cruft, yeah, 2) but it's isolated here and 3) it gives a very small and tight bound on the message size---even in the worst-case---which relieves cognitive load elsewhere.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @ch1bo ^^^ that's the most time I've spent trying to explain the motivation for roaring bitmaps since the original discussions last year. I know they've been on the chopping block ever since :D. So let me know what you think.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hehe, thanks for laying out the rationale here. I can see the appeal of requests being absolutely small. If I recall correctly, @rkuhn had an argument that any request that is orders of magnitudes smaller than the response should be fine - this would be true for both schemes.

Maybe the sweet spot lies somewhere in between? e.g. a single 15k length bitfield or that bitfield run-length encoded (so we can finally put our tech interview experience to good use)

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation! One thing I don’t yet fully grasp is the expected behaviour of the network in terms of dissemination of transactions and EBs. Why would the node ever need to request the majority or even a sizeable fraction of an EB’s transactions after catching up? Shouldn’t the transactions be disseminated somewhat earlier than the EB? Of course the transactions aren’t guaranteed to be present, but I’d be surprised if the average fraction a node would ask for is larger than 10% (if that).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’d be surprised if the average fraction a node would ask for is larger than 10% (if that).

This is the average case, where mempools are largely consistent. We have done recent R&D to analyse the "mempool fragmentation" of the Cardano network under various load points. Both empirically and simulation-based. See the January monthly review and March monthly review sessions, recordings and full notes here.

In summary, if demand is sufficiently high the fragmentation increases. IIRC the analytical results were confirmed in a simulation and match our intuition, but a real world test was not (yet) performed.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using this thread for the roaring bitmap discussion. @rkuhn writes:

would it not also be a solution to pass the bitmap through zstd compression?

Interesting idea, and reminds me of CIP-150 (which has not been implemented yet AFAIK)

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nfrisby it took me a moment to understand your modified roaring bitmap calculation: your chunks aren't 16bit wide but only 6bit, right? If we're designing this tailored to the number of max 15000 txn then I'd store the whole bitmap as a CBOR byte string, with one byte chunk number followed by eight byte chunk, resulting in max 235*9=2115 bytes plus 3 bytes CBOR header.

zstd would very likely save some more bytes, the downside would be a much larger attack surface. So I'm coming around to this scheme, mostly because this scheme would be significantly simpler than true roaring bitmaps.

Copy link
Copy Markdown
Contributor

@ch1bo ch1bo Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worked through this with a 🧞 and like to put forth my opinion:

TLDR I like the simplicity of just a bitfield with the option to slap zstd on it if we need to trade some cpu cycles for less bytes on the wire


This document analyses the wire-encoding options for sets of integer indices used in two Leios mini-protocols:

Use case Max entries (N) Chunks (N/64)
EB tx fetching (LeiosFetch) 15,000 235
Vote fetching (LeiosVotes) 3,000 47

All size estimates assume CBOR encoding. CBOR unsigned integers cost 1 B for values 0–23, 2 B for 24–255, 3 B for 256–65535.

Variants

1. List of integer indices

A flat sorted CBOR array of uint values. The simplest possible encoding.

For N = 15,000 almost all indices are ≥ 256, costing 3 B each in CBOR.

k Wire size
100 ~305 B
625 ~1,880 B
1,000 ~3,005 B
15,000 ~45,005 B

Size grows linearly with k — a full-request encodes as ~45 KB, which is a potential amplification vector without an explicit codec limit.

Codec complexity: O(k) trivially in both directions.

Best choice for very sparse requests (k ≲ 200 for N = 15k, k ≲ 60 for N = 3k).

2. Single bitfield (bstr)

One flat bytestring of ⌈N/8⌉ bytes, one bit per entry, encoded as a CBOR byte string.

N Wire size (always)
15,000 1,878 B (1,875 B + 3 B CBOR header)
3,000 378 B (375 B + 3 B CBOR header)

Fixed size regardless of k. Trivial to encode and decode: O(N/8), one array write per selected index.

For the votes case (N = 3,000) the 378 B constant cost beats:

  • list of indices when k > 126
  • indexed bitfields when k > ~34 occupied chunks

For EB fetch (N = 15,000), the 1,878 B constant cost beats:

  • list of indices when k > 625
  • indexed bitfields when k > ~170 occupied chunks (~200 random entries)

The single bitfield is the simplest encoding with the best worst-case behaviour. Its only disadvantage is the fixed baseline cost for very sparse requests.

Codec complexity: O(N/8) always.

3. Indexed 64-bit bitfields — current [(Word16, Word64)]

The index space is divided into chunks of 64. Each occupied chunk is represented as a (chunk_index, bitmask) pair and the list is encoded as an indefinite-length CBOR map.

CBOR cost per occupied chunk:

  • Key (encodeWord16): 2 B for chunk indices 24–234 (all relevant ones)
  • Value (encodeWord64): 9 B for typical bitmasks (non-trivial upper bits)
  • Total: ~11 B/chunk plus 2 B for the map header and break

For a random selection of k entries the expected number of occupied chunks is

E[chunks] ≈ C × (1 − e^(−k/C))

where C = ⌈N/64⌉. This saturates quickly: for N = 15,000 roughly 80% of all 235 chunks are occupied by only k ≈ 500 random entries.

k (N = 15k) E[occupied chunks] Wire size
100 ~82 ~904 B
500 ~188 ~2,070 B
1,000 ~218 ~2,400 B
15,000 235 ~2,590 B

The encoding is only strictly more efficient than a flat bitfield (1,875 B) when fewer than ~170 chunks are occupied, i.e. for random input at k ≲ 200.

The per-chunk overhead (11 B for 8 B of payload) is a 38% tax, but the key advantage is skipping entirely-zero chunks.

Codec complexity: O(k log k) encode (sort+group), O(C) decode.

Aside: relationship to Roaring Bitmaps

Roaring Bitmaps (Lemire et al., 2016) is a well-known compressed bitset data structure that also partitions the index space into chunks — but with chunks of 2^16 = 65,536 entries and an adaptive per-chunk representation:

  • Sparse chunk (< 4,096 entries): sorted array of uint16 offsets → 2 B/entry
  • Dense chunk (≥ 4,096 entries): raw 8,192-byte bitset
  • (v2) Very dense chunk: run-length encoded list of ranges

The switch point (4,096 entries = 4,096 × 2 B = 8,192 B = bitset size) is where array and bitset are exactly equal in cost.

For our use cases N ≤ 15,000 < 65,536, a Roaring Bitmap would be a single chunk that switches at k = N × 4096/65536 ≈ N/16:

N Switch at k Below: array (2 B/entry) Above: bitset
15,000 ~938 2k B 1,875 B
3,000 ~188 2k B 375 B

This is essentially the same crossover as "list of indices vs single bitfield" (see §2 above), which the Roaring Bitmap formalises as an adaptive threshold.

The current [(Word16, Word64)] encoding is not a Roaring Bitmap: its chunk granularity is 64 (not 65,536), it has no adaptive switching (always uses the bitset representation per chunk), and it carries significant per-chunk overhead without the density-based selection that makes Roaring Bitmaps efficient.

4. Run-length encoded index list

A sorted CBOR array of (start_index, run_length) pairs, one per contiguous run of selected entries. For example, indices {5, 6, 7, 100} out of 200 would encode as [[5, 3], [100, 1]].

Let r be the number of distinct contiguous runs (r ≤ k). Per run: ~3 B for the start index (≥ 256 for most of the N = 15k range) plus 1–2 B for the run length, so ~4–5 B/run.

k (random, N = 15k) E[runs r] Wire size vs list of indices
100 ~100 ~400 B 1.3× larger
500 ~500 ~2,000 B 0.7× smaller
1,000 ~1,000 ~4,000 B 1.3× larger

For uniformly random input r ≈ k (no clustering), so this is slightly worse than a plain list of indices. It only wins over the list when selections are clustered enough that the average run length exceeds ~1.3 (r < 0.75k).

For contiguous range requests it is excellent: k consecutive entries encode as a single pair (~5 B) regardless of k.

Codec complexity: O(k) encode and decode (scan sorted indices for consecutive gaps). Simpler than the indexed-bitfield variant.

5. Compressed bitfield (zstd)

Apply a general-purpose compression algorithm to the flat ⌈N/8⌉-byte bitfield before encoding it as a CBOR bstr.

Algorithm choice: Zstandard (zstd) at level 1 is the natural fit. It decompresses extremely fast (~1 GB/s), has very low latency at level 1, and handles both zero-heavy and one-heavy bitfields well. LZ4 is a reasonable alternative with marginally lower compression ratio and even faster decompression. For buffers this small (≤ 1,875 B), anything heavier (brotli, DEFLATE) adds latency without significant benefit.

How zstd handles sparse bitfields: zstd uses LZ77 back-references plus entropy coding (FSE/ANS). A sparse bitfield is almost entirely 0x00 bytes, which compress to a few back-references and a short entropy table. A dense (near-full) bitfield is almost entirely 0xFF bytes and compresses equally well by symmetry. The worst case is a near-random bitfield (k ≈ N/2) where entropy is maximal.

Estimated compressed sizes for N = 15,000 (1,875 B input, zstd level 1):

k (random) ~P(byte = 0x00) Estimated compressed size vs raw
0 100% ~12 B 156× smaller
100 ~95% ~20–50 B ~60× smaller
500 ~76% ~300–600 B 3–6× smaller
1,000 ~58% ~800–1,200 B 1.5–2× smaller
7,500 ~0.4% ~1,850–1,950 B slightly larger
15,000 0% ~12 B 156× smaller

The worst case is k ≈ N/2 where the bitfield is near-random and zstd gains nothing. The overhead is then just the zstd framing: 4 B magic number + ~6 B frame header + 3 B block header = ~13–18 B over the raw input, i.e. under 1% for either use case.

Codec complexity: O(N) compression and decompression with small constants. For 1,875 B at level 1 the wall-clock time is sub-microsecond on modern hardware — not a bottleneck. libzstd is already a transitive dependency of the node binary.

Access patterns

Votes accumulate incrementally. A node starts with zero votes for an election and collects them from upstreams over time. Requests are spread across multiple peers to avoid redundant work, so a typical request either covers a contiguous range assigned to one peer (e.g. "give me votes 0–1499") or targets the specific stragglers still missing at the end. Both cases have small k: the range case benefits from RLE; the straggler case is sparse and scattered. Since N = 3,000 is small, the encoding choice matters less here — even a flat bitfield is only 378 B.

EB tx fetches have two distinct regimes with different latency requirements:

  • Happy path (light load): a node has seen most txs via the mempool already and is only missing ~10%, i.e. k ≈ 1,500 scattered entries. Latency is not critical because the EB can be partially validated and the missing txs arrive quickly.
  • High load or adversarial EBs: a node may be missing 50–100% of the txs (k ≈ 7,500–15,000). This is the latency-critical case — the node cannot progress until it fetches and validates the full tx set, so the request must be compact and cheap to encode. An encoding that degrades badly at large k is disqualifying here.

Comparison and recommendations

All sizes for N = 15,000 (EB fetch) unless noted.

Encoding k = 0 k = 100 k = 1,000 k = 15,000 Codec
List of indices 3 B ~305 B ~3,005 B ~45,005 B Trivial
Single bitfield 1,878 B 1,878 B 1,878 B 1,878 B Trivial
Indexed bitfields 2 B ~904 B ~2,400 B ~2,590 B Medium
RLE index list 0 B ~400 B ~4,000 B ~5 B† Simple
Compressed bitfield ~15 B ~20–50 B ~800–1,200 B ~15 B Medium

† k = 15,000 selects all entries = one run [[0, 15000]]. The worst case for RLE is k ≈ N/2 with a checkerboard pattern (~7,500 isolated runs ≈ ~37,500 B).

Crossover points for N = 15,000: list beats single bitfield at k = 625; indexed beats single bitfield for k ≲ 200 random entries; compressed beats list above k ≈ 30.

Votes (N = 3,000): Use a single bitfield (378 B constant). The access patterns (range-assigned requests and sparse stragglers) vary widely, but 378 B is so small that it is never the bottleneck. The codec is two lines and there is nothing to tune.

EB tx fetch (N = 15,000): The latency-critical case is 50–100% missing, which immediately disqualifies the list of indices (up to 45 KB) and the RLE index list (up to ~37,500 B for scattered selections). The remaining options at worst case:

Happy path (k ≈ 1,500) High load (k ≈ 7,500–15,000)
Single bitfield 1,878 B 1,878 B
Indexed bitfields ~2,400 B ~2,500–2,590 B
Compressed bitfield ~800–1,200 B ~1,893 B (k=7,500) / ~15 B (k=15,000)

Single bitfield is the safe default: constant 1,878 B regardless of regime, trivial codec.

Compressed bitfield is worth the extra complexity if the happy path (10% missing) is expected to dominate: it cuts the common-case request to ~1,000 B while the worst case (k ≈ N/2) stays at ~1,893 B.

The indexed bitfield (current) adds ~38% overhead over the flat bitfield at all k and saturates to near-maximum size already at k ≈ 500 random entries. It offers no advantage over the single bitfield for either regime.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed this further with @nfrisby and we came up with yet another scheme: A offset variable length bitfield. That is, a starting offset int + a bstr bitfield of variable length up to whatever we want to address. This allows very efficient addressing of "chunks", which we might want to do in the fetch decision logic.

@coot
Copy link
Copy Markdown
Contributor

coot commented Apr 8, 2026

I'd propose a simpler approach which is using a mixture of batching and pipelining. Both currently used in cardano-node node-to-node protocol suite (which is well supported and well tested).

We'd have the following states (and agencies):

State Agency
StIdle client
StBusy n for n > 0 server
StDone nobody

And the following transitions / messages:

transition from to
MsgLeiosAnnounceRequestNext n StIdle StBusy n
MsgLeiosBlockAnnouncement StBusy n for n > 1 StBusy (n-1)
MsgLeiosBlockAnnouncement StBusy 1 StIdle
MsgDone StIdle StDone

We can saturate the server with requests by using protocol pipelining, and save on sending each requests seprately by batching. The client will be in control how much batching / pipelining it is ready to accept on its ingress side.

I don't know how to draw this using mermaid.

We use a similar approach for batching block requests in block-fetch mini-protocol (but we just have a single StBusy state rather than an array of them, we just don't track how many responses are pending at the type level).

@ch1bo
Copy link
Copy Markdown
Contributor

ch1bo commented Apr 9, 2026

I don't know how to draw this using mermaid.

I think the explicit state enumeration using the type-level naturals could look like this in mermaid:

---
title: LeiosVotes
---
graph LR
   classDef client color:black,fill:PaleGreen,stroke:DarkGreen;
   classDef server color:black,fill:PowderBlue,stroke:DarkBlue;

   StIdle:::client --MsgDone--> StDone

   StBusy1[StBusy 1]
   StBusy2[StBusy 2]
   StBusyEtc[StBusy ..]
   StBusy1000[StBusy 1000]
   class StBusy1,StBusy2,StBusyEtc,StBusy1000 server; 

   StBusy1 --MsgLeiosVote--> StIdle
   StBusy1 --MsgLeiosVotesRequestNext--> StBusy2
   StBusy2 --MsgLeiosVote--> StBusy1
   StBusy2 --MsgLeiosVotesRequestNext--> StBusyEtc
   StBusyEtc --MsgLeiosVote--> StBusy2
   StBusyEtc --MsgLeiosVotesRequestNext--> StBusy1000
   StBusy1000 --MsgLeiosVote--> StBusyEtc
Loading

Edit: This was not exactly the same as what @coot described above and I did another attempt here, but this is not even complete and drawing this exhaustively is a mess, I agree :)

---
title: LeiosVotes
---
graph LR
   classDef client color:black,fill:PaleGreen,stroke:DarkGreen;
   classDef server color:black,fill:PowderBlue,stroke:DarkBlue;

   StIdle:::client --MsgDone--> StDone

   StBusy1[StBusy 1]
   StBusy2[StBusy 2]
   StBusyEtc[StBusy ..]
   StBusy1000[StBusy 1000]
   class StBusy1,StBusy2,StBusyEtc,StBusy1000 server; 

   StBusy1 --MsgLeiosVote--> StIdle
   StBusy1 --MsgLeiosVotesRequest 1--> StBusy2
   StBusy1 --MsgLeiosVotesRequest 999--> StBusy1000
   StBusy2 --MsgLeiosVote--> StBusy1
   StBusy2 --MsgLeiosVotesRequest ...--> StBusyEtc
   StBusy2 --MsgLeiosVotesRequest 998--> StBusy1000
   StBusyEtc --MsgLeiosVote--> StBusy2
   StBusyEtc --MsgLeiosVotesRequest ...--> StBusy1000
   StBusy1000 --MsgLeiosVote--> StBusyEtc
Loading

@rkuhn
Copy link
Copy Markdown
Author

rkuhn commented Apr 9, 2026

Yes, @coot that’s also a possibility — I think your proposal with pipelining is behaviourally indistinguishable from reactive streams semantics (without cancellation), so any node implementation can pick its internal representation according to personal style and inclination.

@ch1bo what you sketched is a different way of implementing reactive streams semantics, which would probably be a larger change and more difficult to integrate with the existing mini-protocol machinery.

In order to ensure convergence I’d like to describe formally how I interpret pipelining. If this matches your understanding then I’ll add it to the CIP text.


Protocol pipelining with a factor N runs N instances of a mini-protocol on a single multiplexer subchannel for the given protocol ID. Each instance tracks its own state and agency as per the specification. One protocol state is marked as the switch state; the switch state must be one in which the initiator has agency. The subchannel is governed by a pair of multiplexers, one for sending and one for receiving, that behave in round-robin fashion across the N instances, starting at the first instance. Requests from the node are forwarded by the sending multiplexer to the currently selected instance and sending the resulting protocol message to the network; whenever having sent a message from the switch state, the send multiplexer selects the next instance. The receive multiplexer forwards received messages from the network to the currently selected protocol instance; whenever receiving a message that transitions that instance into the switch state, the receive multiplexer selects the next instance.


This implies that pipelining only works for miniprotocols which have a suitable switch state in which the initiator decides what to do next and the responder then can send one or more messages to get back to the switch state. A protocol in which the initiator would need to send again from a different intermediate state would not support pipelining (such protocols don’t yet exist in the Ouroboros family).

@coot
Copy link
Copy Markdown
Contributor

coot commented Apr 10, 2026

@ch1bo what I propose has this shape:

image

@rkuhn

Protocol pipelining with a factor N runs N instances ...

This is wiki page, which explains protocol pipelining as I use the term. Note that one doesn't need to run N instances of a protocol; one just sends requests ahead of responses (which is enough to hide latency).

For example, with the above protocol, one can have this conversation (with pipelining depth 2):

  • client sends MsgLeiosAnnounceRequestNext 100
  • client sends MsgLeiosAnnounceRequestNext 100
    All of them are sent without waiting for any replies. The server replies to the requests following the protocol, e.g. sends 100 MsgLeiosBlockAnnouncement to answer the first request, then another 100 replies. In the meantime, the client could pipeline more to keep the server busy.

Even if the client pipelines messages, the server doesn't care - it will see requests in exactly the same order as if the client weren't pipelining at all. So protocol pipelining doesn't restrict the kinds of protocols you can use (at least in the class of protocols that one can encode as diagrams with agencies).

@rkuhn
Copy link
Copy Markdown
Author

rkuhn commented Apr 11, 2026

@coot Okay, so you’re confirming that my description is correct, including the restriction that pipelining is only defined for request–response+ shaped protocols — I am not aware of the design rules or well-formedness conditions of the mini-protocols used in Cardano specs, so I’ll take the absence of other protocol shapes as specification by example.

(Just as an illustration of a protocol that would not work, take a hypothetical variant of the block fetch protocol where the initiator sends a block range request, the responder sends NoBlocks or StartStreaming, and then before each Block can be transmitted the initiator needs to explicitly ask for it. This can obviously not be pipelined because the next block range request would arrive when the responder expects a NextBlockPlease message. My naive interpretation of agency diagrams would permit such a protocol to be specified.)


I’ll update the PR soon with the aspects already agreed here. For the roaring bitmaps I still have to understand the solution space better — would it not also be a solution to pass the bitmap through zstd compression?

@coot
Copy link
Copy Markdown
Contributor

coot commented Apr 11, 2026

Yes, that's a good example when one is not able to pipeline. To use pipeling over a message from state s to t (for s client has agency, for t server has agency) there must exist state s' with client agency, such that all paths from t lead to s' through states where the server has agency. This is because when pipelining the client needs to know ahead of time to which state the server will lead to.

ch1bo added a commit to cardano-scaling/cardano-blueprint that referenced this pull request Apr 14, 2026
Create description pages for the three new leios protocols as proposed
in https://github.com/cardano-scaling/CIPs/blob/leios/CIP-0164/README.md
and also
currently discussed in cardano-foundation/CIPs#1167
@ch1bo
Copy link
Copy Markdown
Contributor

ch1bo commented Apr 14, 2026

Even though the discussion is not fully settled, I thought I'd give it a shot to write-up the mini-protocol descriptions and message CDDLs analogous to the existing ones in the cardano-blueprint. This is already having the LeiosVotes, a few message tags shuffled, but still unchanged block request formats (roaring bitmaps).

See PR cardano-scaling/cardano-blueprint#65 and preview https://cardano-scaling.github.io/cardano-blueprint/pr-preview/pr-65/network/node-to-node/leios-votes/index.html

- use Reactive Streams semantics for bounded push of block announcements and votes
- remove votes offer & request communication cycles to cut down latency
- remove probably premature optimisation of block / txn request by just listing txn indices
- split LeiosNotify into LeiosAnnounce (for EBs), LeiosVotes, LeiosBlockNotify (for when EB and/or txns are available upstream) to allow independent treatment in the muxer and remove or reduce head-of-line blocking

We might also want to allocate N2N mini-protocol IDs in this CIP because multiple teams are starting to play with this spec and might want to check interoperability.

Signed-off-by: Roland Kuhn <rk@rkuhn.info>
@rkuhn rkuhn force-pushed the rk/cip-0164-protocol-refinements branch from a04b10d to 90b9d50 Compare April 15, 2026 18:33
@rkuhn
Copy link
Copy Markdown
Author

rkuhn commented Apr 15, 2026

I pushed a new commit, changing the description of the mini-protocols to rely on pipelining instead of mixed choice and adding roaring-inspired bitmap encoding for MsgLeiosBlockTxsRequest. @coot @nfrisby @ch1bo @sandtreader WDYT?

ch1bo added a commit to cardano-scaling/cardano-blueprint that referenced this pull request Apr 20, 2026
Create description pages for the three new leios protocols as proposed
in https://github.com/cardano-scaling/CIPs/blob/leios/CIP-0164/README.md
and also
currently discussed in cardano-foundation/CIPs#1167
sandtreader pushed a commit to cardano-scaling/cardano-blueprint that referenced this pull request Apr 20, 2026
Create description pages for the three new leios protocols as proposed
in https://github.com/cardano-scaling/CIPs/blob/leios/CIP-0164/README.md
and also
currently discussed in cardano-foundation/CIPs#1167
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

State: Confirmed Candiate with CIP number (new PR) or update under review. Update Adds content or significantly reworks an existing proposal

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants