Skip to content

feat(parquet): dictionary fallback heuristic based on compression efficiency#9700

Open
mzabaluev wants to merge 7 commits intoapache:mainfrom
mzabaluev:parquet-dict-fallback-heuristic
Open

feat(parquet): dictionary fallback heuristic based on compression efficiency#9700
mzabaluev wants to merge 7 commits intoapache:mainfrom
mzabaluev:parquet-dict-fallback-heuristic

Conversation

@mzabaluev
Copy link
Copy Markdown
Contributor

Which issue does this PR close?

What changes are included in this PR?

Added ColumnProperties option dictionary_fallback, getting a DictionaryFallback enum value.
Two behavior variants are provided (initially, the enum is non-exhaustive to allow more to be added later if necessary):

  • OnPageSizeLimit - the prior behavior and the default, triggers fallback on exceeding the dictionary page size limit.
  • OnUnfavorableCompression - a new behavior, includes the page size limit check and adds a check for encoded size not exceeding the plain data size.

Implemented the new optional behavior in the encoder.

Are these changes tested?

Added new tests exercising the OnUnfavorableCompression behavior.
The existing tests exercise OnPageSizeLimit.

Are there any user-facing changes?

Added API in parquet:

  • The DictionaryFallback enum
  • ColumnProperties::dictionary_fallback, ColumnProperties::set_dictionary_fallback
  • WriterPropertiesBuilder::set_dictionary_fallback, WriterPropertiesBuilder::set_column_dictionary_fallback
  • WriterProperties::dictionary_fallback

In parquet, add `dictionary_fallback` option to `ColumnProperties`.
Its value type is defined as the `DictionaryFallback` enum with
the varians `OnDataPageSize` (the previous behavior, the default)
and `OnUnfavorableCompression`. The latter replicates the behavior of
parquet-java, which falls back to non-dictionary encoding when the
estimated compressed size of the data page with dictionary encoding is
not smaller than the estimated size of the data page with
plain encoding.
The method as it is used now does not need to return the tuple of
base size and number of elements, instead just compute the encoded
size as appropriate for the type.
@github-actions github-actions bot added the parquet Changes to the parquet crate label Apr 13, 2026
It's not really necessary because the counter is only used while
dictionary encoding is enabled, but it's good to safeguard against
future refactoring.
@etseidl
Copy link
Copy Markdown
Contributor

etseidl commented Apr 13, 2026

Thanks @mzabaluev, I was unaware of this parquet-java behavior. I wonder, however, if even the parquet-java version needs to be updated.

I looked at the java code, and it has been around for quite some time (it was added in late 2014). At that time, I believe the default page size was on the order of a megabyte, so using this heuristic after a single page was probably not a bad idea. However, when the page indexes were added, parquet-java was modified to by default limit pages to 20000 rows (this crate adopted the default 20k page size quite some time later). IMO, 20000 values is too small a sample to decide if a dictionary is having a beneficial effect. Let's say one has a relatively low cardinality (32k) i64 column with a somewhat random distribution. After encoding one 20k row page I think the heuristic here will almost certainly choose plain vs dictionary, but if one were to encode 10 pages, dictionary would then be seen to be superior by far.

I like that this is opt-in, but then wonder if a user knows this heuristic will be helpful (i.e. they know it's a high cardinality column), could they not instead simply disable dictionary encoding for the column in question.

(std::mem::size_of::<u32>(), self.len())
fn dict_encoding_size(&self) -> usize {
4 + self.len()
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use size_of perhaps?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a u32, so it will never be different from 4.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, so you should use size_of, which is clear why^

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be fair, there's magic all over this module. But I agree with @EmilyMatt there's no need to add more.

}

fn dict_encoding_size(&self) -> usize {
12
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Magic?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a type that encodes to 96 bits. The compound Rust type is furthermore not declared with a repr that nails down its in-memory size (the compiler might decide to align such small arrays to 16 bytes one day and jack up the size accordingly, for example) so I'd argue using size_of would not be squeaky-clean here.

Copy link
Copy Markdown
Contributor

@EmilyMatt EmilyMatt Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand, so a const would be enough, just not a magic number.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, the array alignment is nailed down in the spec, but Int96 wrapping the array is a struct with a Rust repr, so... 🤷 Could be a case for declaring the struct with repr(packed), but I'd rather argue we use explicit numbers here because the encoding is not a straight memory copy.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion about the const, I've added Int96::SIZE_IN_BYTES for this purpose.

Ok(values_read)
}

fn dict_encoding_size(&self) -> usize {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Magic?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, it's not only that using size_of would be dubious, the encoding actually uses one bit per value, so this method is a leaky abstraction. I'll see about reworking it.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The dicitionary encoding is never used on bool types, so this just needs some clarifying comments.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... or, better, a panic because this method should never be called for BoolType: the column encoder does a check to fall back to plain/RLE in this case.

@mzabaluev
Copy link
Copy Markdown
Contributor Author

Good points @etseidl. Our motivation for adding this is that in some cases e.g. with high cardinality, the Rust parquet writer produces much larger encoded Parquet than the Spark workloads we're aiming to replace. So using a default option that enables the heuristic akin to the one hardcoded into parquet-java would get us on par (or maybe better, because this implementation may choose to fall back at any page chunk in the dataframe).

@etseidl
Copy link
Copy Markdown
Contributor

etseidl commented Apr 14, 2026

I wonder if this could be adapted to take a larger sample before deciding. Maybe raise the number of values per page if enabled.

In dict_encoded_size, use size_of and symbolic constants instead of
hardcoded values. For the boolean type, backstop with a panic
since any returned value would be bogus for this method's contract
(as plain encoding for BOOLEAN is bit-packed).
The dictionary encoding is never used for boolean values.
@mzabaluev
Copy link
Copy Markdown
Contributor Author

I wonder if this could be adapted to take a larger sample before deciding. Maybe raise the number of values per page if enabled.

Can this be effected with other existing properties by an educated programmer? There may be adverse effects to tweaking other defaults dependently on this. If you mean data_page_row_count_limit, this is a writer property, so I'd be hesitant to auto-adjust it based on any column properties.

I'd rather have the option to choose between:

  1. The default simple behavior, i.e. OnPageSizeLimit;
  2. The Java-like heuristic to closely follow Spark and other systems using parquet-java (however, a bit-for-bit workalike would require a larger rework of the logic and is perhaps not desirable);
  3. Better adaptive behaviors, if and when implemented.

For 3, I have purposefully left the DictionaryFallback enum open-ended.

Copy link
Copy Markdown
Contributor

@etseidl etseidl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I sympathize with wanting to be a bit smarter about when to give up on dictionary encoding. I would, however, like to see something a bit more defensible before proceeding with this change. For example, I'd like to see examples where this heuristic outperforms the current defaults by more than 10%, say, and also outperforms disabling dictionary encoding altogether (something which is already an opt-in option, as this new heuristic would be).

/// that would be written without a dictionary.
/// If there is no dictionary, or the data size statistic is not available,
/// returns `None`.
fn uncompressed_data_size(&self) -> Option<usize>;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To avoid confusion I'd prefer this be named plain_encoded_data_size or some such. To me, compression is performed on the page after encoding.

}

#[test]
fn test_dict_page_size_decided_by_compression_fallback() {
Copy link
Copy Markdown
Contributor

@etseidl etseidl Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a test, I saved the output from this and examined the sizing. Without the heuristic, the encoded size for col0 is 8658384 bytes (the default fallback mechanism kicked in after 7 pages). With the heuristic, col1 is 8391126 bytes, a savings of 3%.

I also modified the test to mod the index with 32767. In that instance, col1 was still 8391126 bytes, but col0 was only 2231581, nearly 4X smaller.

I know this is not entirely representative, but it does again point out the pitfalls of too simplistic an approach.

Edit: I did a test of spark with the latter file (32k cardinality). By default, it opts to fallback for all pages, so the file is even larger. If I modify the global parquet.page.row.count.limit to 132000, it then opts for dictionary encoding as it should.

/// that the writer should fall back to the plain encoding when at a certain point,
/// e.g. after encoding the first batch, the total size of unencoded data
/// is calculated as smaller than `(encodedSize + dictionarySize)`.
pub struct PlainDataSizeCounter {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To my comment about sample size, perhaps this can be adapted to keep track of both the plain and dictionary encoded sizes, and then only make a decision after some critical number of rows or bytes have been processed.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can add a value governing this (as the minimum number of rows) directly into the OnUnsatisfactoryCompression variant. Or perhaps, add another property alongside for future extensibility, since it's not clear if everyone would want it to be rows and there'd be a preference for the byte size threshold later, and I don't want to introduce a nested open-ended property struct.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps just values entered to cover lists as well. I think as long as there are a sufficient number of samples with which to make a decision, this could be quite nice in the end.

(std::mem::size_of::<u32>(), self.len())
fn dict_encoding_size(&self) -> usize {
4 + self.len()
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be fair, there's magic all over this module. But I agree with @EmilyMatt there's no need to add more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

parquet Changes to the parquet crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Parquet dictionary fallback heuristics

3 participants