-
Notifications
You must be signed in to change notification settings - Fork 4k
Open
Description
Describe the enhancement requested
Arrow's ORC reader already supports column projection (reading only selected columns), but lacks row-level predicate pushdown. Currently, filtering rows from ORC files requires:
- Reading all rows from selected columns (all stripes)
- Applying filters post-read using Arrow compute
This is inefficient for large ORC files where only a small subset of rows match the filter criteria. ORC files store min/max statistics at the stripe level, which can be used to skip entire stripes that cannot contain matching rows—avoiding I/O for data that will be filtered out anyway.
Use Cases
- Data Lake Queries: Efficiently query large ORC datasets with selective predicates
- PyIceberg Integration: Enable predicate pushdown for Iceberg tables stored in ORC format
- Parity with Parquet: Match the filtering capabilities already available for Parquet files
Component(s)
Python