Skip to content

Conversation

@cbb330
Copy link

@cbb330 cbb330 commented Jan 26, 2026

Summary

  • Add comprehensive predicate pushdown support for ORC reader in C++ and Python
  • Implement OrcMultiStripeReader for efficient reading of filtered stripes
  • Add detailed design documentation for ORC predicate pushdown
  • Enhance Python bindings to expose stripe filtering capabilities
  • Add extensive test coverage for predicate pushdown functionality

Changes

  • C++ Core: New OrcMultiStripeReader class handles reading from multiple selected stripes after predicate filtering
  • Python Bindings: Expose stripe filtering through updated _orc.pyx interface
  • Dataset Integration: Enhanced file_orc.cc with predicate pushdown support
  • Documentation: Comprehensive design document explaining implementation approach
  • Testing: Added unit tests for stripe filtering and predicate evaluation

Test Plan

  • New C++ tests in file_orc_test.cc verify stripe filtering behavior
  • Python tests in test_orc.py validate end-to-end predicate pushdown
  • Standalone test program cpp/test_orc_pushdown.cc for integration testing

cbb330 and others added 3 commits January 24, 2026 13:40
Add `filters` parameter to ORC reader that enables stripe skipping
for integer columns using liborc's SearchArgument mechanism.

C++ changes:
- Add ExpressionToSearchArgument() to convert Arrow Expression to
  liborc SearchArgument
- Add Read() overload accepting compute::Expression filter
- Support comparison operators (=, !=, <, <=, >, >=), is_in, is_null,
  and logical operators (AND, OR, NOT)

Python changes:
- Add filters parameter to ORCFile.read() and read_table()
- Support both DNF format and Arrow Expression filters
- Apply row-level filtering after stripe-level pushdown

Tests:
- Add TestORCPredicatePushdown with 11 tests covering correctness,
  safety, operators, null handling, and Expression API
Add comprehensive design document for ORC predicate pushdown feature:
- Architecture overview and data flow
- API design with examples
- Implementation plan across multiple PRs
- Technical decisions and rationale
- Type mapping and operator support tables
- Testing strategy
- Future considerations

This document serves as a reference for community review and tracks
progress across the phased implementation.
Add OrcMultiStripeReader class to handle reading from multiple selected stripes
after predicate pushdown filtering. Update Python bindings to expose stripe
filtering capabilities and enhance test coverage for ORC predicate pushdown.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
@github-actions
Copy link

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose

Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename the pull request title in the following format?

GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

See also:

@cbb330 cbb330 changed the title [C++][Python] Add ORC predicate pushdown support GH-48986: [C++][Python] Add ORC predicate pushdown support Jan 26, 2026
@github-actions
Copy link

⚠️ GitHub issue #48986 has been automatically assigned in GitHub to PR creator.

cbb330 and others added 2 commits January 26, 2026 15:39
Fix critical lifetime management issues in OrcMultiStripeReader where the
underlying liborc::Reader could be destroyed while stripe readers were still
in use. The reader now holds a shared_ptr to the liborc::Reader, ensuring
proper lifetime management and preventing use-after-free errors.

Changes:
- OrcMultiStripeReader now holds shared_ptr to liborc::Reader
- Add lifetime management documentation
- Ensure reader remains valid for all stripe operations
- Fix memory safety issues in multi-stripe predicate pushdown

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Change GetStripeStatsAsExpression to return std::optional<Expression> instead
of Result<Expression> for better handling of missing statistics. Return nullopt
when statistics are unavailable rather than returning literal(true) or propagating
errors, allowing stripe filtering to gracefully skip fields without statistics.

Add comprehensive dataset API integration tests covering stripe skipping with
filters, column projection, batch size handling, and edge cases.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant