-
Notifications
You must be signed in to change notification settings - Fork 4k
GH-48986: [C++][Python] Add ORC predicate pushdown support #48984
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Add `filters` parameter to ORC reader that enables stripe skipping for integer columns using liborc's SearchArgument mechanism. C++ changes: - Add ExpressionToSearchArgument() to convert Arrow Expression to liborc SearchArgument - Add Read() overload accepting compute::Expression filter - Support comparison operators (=, !=, <, <=, >, >=), is_in, is_null, and logical operators (AND, OR, NOT) Python changes: - Add filters parameter to ORCFile.read() and read_table() - Support both DNF format and Arrow Expression filters - Apply row-level filtering after stripe-level pushdown Tests: - Add TestORCPredicatePushdown with 11 tests covering correctness, safety, operators, null handling, and Expression API
Add comprehensive design document for ORC predicate pushdown feature: - Architecture overview and data flow - API design with examples - Implementation plan across multiple PRs - Technical decisions and rationale - Type mapping and operator support tables - Testing strategy - Future considerations This document serves as a reference for community review and tracks progress across the phased implementation.
Add OrcMultiStripeReader class to handle reading from multiple selected stripes after predicate pushdown filtering. Update Python bindings to expose stripe filtering capabilities and enhance test coverage for ORC predicate pushdown. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
|
Thanks for opening a pull request! If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project. Then could you also rename the pull request title in the following format? or See also: |
|
|
Fix critical lifetime management issues in OrcMultiStripeReader where the underlying liborc::Reader could be destroyed while stripe readers were still in use. The reader now holds a shared_ptr to the liborc::Reader, ensuring proper lifetime management and preventing use-after-free errors. Changes: - OrcMultiStripeReader now holds shared_ptr to liborc::Reader - Add lifetime management documentation - Ensure reader remains valid for all stripe operations - Fix memory safety issues in multi-stripe predicate pushdown Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Change GetStripeStatsAsExpression to return std::optional<Expression> instead of Result<Expression> for better handling of missing statistics. Return nullopt when statistics are unavailable rather than returning literal(true) or propagating errors, allowing stripe filtering to gracefully skip fields without statistics. Add comprehensive dataset API integration tests covering stripe skipping with filters, column projection, batch size handling, and edge cases. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Summary
Changes
Test Plan