Skip to content

ORC Predicate Pushdown #48986

@cbb330

Description

@cbb330

Describe the enhancement requested

Arrow's ORC reader already supports column projection (reading only selected columns), but lacks row-level predicate pushdown. Currently, filtering rows from ORC files requires:

  1. Reading all rows from selected columns (all stripes)
  2. Applying filters post-read using Arrow compute

This is inefficient for large ORC files where only a small subset of rows match the filter criteria. ORC files store min/max statistics at the stripe level, which can be used to skip entire stripes that cannot contain matching rows—avoiding I/O for data that will be filtered out anyway.

Use Cases

  1. Data Lake Queries: Efficiently query large ORC datasets with selective predicates
  2. PyIceberg Integration: Enable predicate pushdown for Iceberg tables stored in ORC format
  3. Parity with Parquet: Match the filtering capabilities already available for Parquet files

Component(s)

Python

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions