Standardizes the RecordExtractor output type contract across every input format#18378
Standardizes the RecordExtractor output type contract across every input format#18378Jackie-Jiang wants to merge 1 commit intoapache:masterfrom
Conversation
dfff63f to
36f13f3
Compare
Codecov Report✅ All modified and coverable lines are covered by tests.
Additional details and impacted files@@ Coverage Diff @@
## master #18378 +/- ##
============================================
- Coverage 63.45% 55.38% -8.07%
+ Complexity 1683 843 -840
============================================
Files 3253 2546 -707
Lines 198884 147413 -51471
Branches 30798 23738 -7060
============================================
- Hits 126196 81650 -44546
+ Misses 62620 58758 -3862
+ Partials 10068 7005 -3063
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
7411b40 to
4e21a72
Compare
There was a problem hiding this comment.
Pull request overview
Standardizes the RecordExtractor output type contract across Pinot input formats so ingestion transforms and downstream type coercion see a uniform Java type matrix independent of source format.
Changes:
- Documents a unified
RecordExtractortype contract and updatesBaseRecordExtractorconversion behavior (notably wideningByte/Short→Integerand clarifying nullability semantics). - Refactors ORC ingestion to use a dedicated
ORCRecordExtractor(moving per-row extraction logic out ofORCRecordReader). - Rewrites/introduces per-format extractor tests (Avro/JSON/CSV/ProtoBuf/Thrift/ORC/Parquet/CLP-Log) to validate contract adherence; removes old shared test base classes.
Reviewed changes
Copilot reviewed 18 out of 18 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| pinot-spi/src/main/java/org/apache/pinot/spi/data/readers/RecordExtractor.java | Adds/expands the documented cross-format output-type contract and clarifies conversion expectations. |
| pinot-spi/src/main/java/org/apache/pinot/spi/data/readers/BaseRecordExtractor.java | Updates base conversion semantics and documents default dispatch behavior. |
| pinot-spi/src/test/java/org/apache/pinot/spi/data/readers/BaseRecordExtractorTest.java | Reworks tests to assert the updated base contract (multi-value, map recursion, nested-record default behavior, etc.). |
| pinot-spi/src/test/java/org/apache/pinot/spi/data/readers/AbstractRecordReaderTest.java | Minor cleanup (uses imported Schema type instead of FQCN). |
| pinot-spi/src/test/java/org/apache/pinot/spi/data/readers/AbstractRecordExtractorTest.java | Deletes unused shared extractor test base. |
| pinot-plugins/pinot-input-format/pinot-orc/src/main/java/org/apache/pinot/plugin/inputformat/orc/ORCRecordReader.java | Delegates row extraction to ORCRecordExtractor; keeps file IO + include filtering. |
| pinot-plugins/pinot-input-format/pinot-orc/src/main/java/org/apache/pinot/plugin/inputformat/orc/ORCRecordExtractor.java | New ORC per-row extractor implementing the standardized contract (primitives, maps, structs, timestamps, decimals). |
| pinot-plugins/pinot-input-format/pinot-orc/src/test/java/org/apache/pinot/plugin/inputformat/orc/ORCRecordExtractorTest.java | New in-memory vectorized tests for ORC extraction behavior (no file IO). |
| pinot-plugins/pinot-input-format/pinot-thrift/src/test/java/org/apache/pinot/plugin/inputformat/thrift/ThriftRecordExtractorTest.java | Rewrites Thrift tests to be type-driven and contract-focused (including nested structs, lists/sets/maps, optionals). |
| pinot-plugins/pinot-input-format/pinot-protobuf/src/test/java/org/apache/pinot/plugin/inputformat/protobuf/ProtoBufRecordExtractorTest.java | Rewrites ProtoBuf tests using dynamic descriptors/messages for isolated, type-driven verification. |
| pinot-plugins/pinot-input-format/pinot-protobuf/src/test/java/org/apache/pinot/plugin/inputformat/protobuf/ProtoBufRecordExtractorLowLevelTest.java | Removes older low-level test suite in favor of the new contract-based tests. |
| pinot-plugins/pinot-input-format/pinot-parquet/src/test/java/org/apache/pinot/plugin/inputformat/parquet/ParquetNativeRecordExtractorTest.java | New Parquet-Group-based tests for primitive/logical/complex mappings (LIST/MAP/struct/INT96). |
| pinot-plugins/pinot-input-format/pinot-parquet/src/test/java/org/apache/pinot/plugin/inputformat/parquet/ParquetAvroRecordExtractorTest.java | New focused test for Parquet-Avro INT96 override behavior. |
| pinot-plugins/pinot-input-format/pinot-json/src/test/java/org/apache/pinot/plugin/inputformat/json/JSONRecordExtractorTest.java | Rewrites JSON tests to validate the realistic Jackson type matrix and recursive conversions. |
| pinot-plugins/pinot-input-format/pinot-csv/src/test/java/org/apache/pinot/plugin/inputformat/csv/CSVRecordExtractorTest.java | Rewrites CSV tests around CSV’s untyped nature + MV delimiter behavior (string/empty/null semantics). |
| pinot-plugins/pinot-input-format/pinot-clp-log/src/test/java/org/apache/pinot/plugin/inputformat/clplog/CLPLogRecordExtractorTest.java | Updates CLP-Log tests for encoding config parsing + topic-name preservation with cleaner parameterization. |
| pinot-plugins/pinot-input-format/pinot-avro-base/src/test/java/org/apache/pinot/plugin/inputformat/avro/AvroRecordExtractorTest.java | Rewrites Avro tests to cover primitives, bytes/fixed, enums, arrays/maps/nested records, nullable unions, and logical types. |
| pinot-plugins/pinot-input-format/pinot-avro-base/src/test/java/org/apache/pinot/plugin/inputformat/avro/AvroRecordExtractorComplexTypesTest.java | Removes separate complex-type IO-based test class (coverage moved into the new type-driven tests). |
0938582 to
d5ea0c2
Compare
There was a problem hiding this comment.
Pull request overview
This PR standardizes the RecordExtractor output-type contract across Pinot input formats so ingestion transforms see a consistent Java type matrix regardless of source format (including modern date/time handling).
Changes:
- Documents and enforces a unified
RecordExtractoroutput contract, with shared conversions implemented inBaseRecordExtractor(numeric widening, bytes materialization,java.timebridging). - Adds
DATE(LocalDate) andTIME(LocalTime) toPinotDataType, plus corresponding tests and type inference updates. - Refactors and aligns format-specific extractors/readers (notably ORC) and rewrites per-format extractor tests to be type-driven and self-contained.
Reviewed changes
Copilot reviewed 29 out of 29 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| pinot-spi/src/main/java/org/apache/pinot/spi/data/readers/RecordExtractor.java | Defines/documentes the cross-format output type matrix and makes init config nullable. |
| pinot-spi/src/main/java/org/apache/pinot/spi/data/readers/BaseRecordExtractor.java | Implements shared conversions (small-int widening, BigInteger→BigDecimal, ByteBuffer→byte[], Temporal→Timestamp, map key dropping rules). |
| pinot-spi/src/test/java/org/apache/pinot/spi/data/readers/BaseRecordExtractorTest.java | Adds contract-level tests for single/multi/map/temporal conversion behavior. |
| pinot-spi/src/test/java/org/apache/pinot/spi/data/readers/AbstractRecordReaderTest.java | Simplifies schema references (uses imported Schema). |
| pinot-spi/src/test/java/org/apache/pinot/spi/data/readers/AbstractRecordExtractorTest.java | Removes obsolete shared extractor test base. |
| pinot-common/src/main/java/org/apache/pinot/common/utils/PinotDataType.java | Adds DATE/TIME types and updates class-to-type inference. |
| pinot-common/src/test/java/org/apache/pinot/common/utils/PinotDataTypeTest.java | Adds unit tests for new DATE/TIME conversions and inference. |
| pinot-plugins/pinot-input-format/pinot-parquet/src/main/java/org/apache/pinot/plugin/inputformat/parquet/ParquetNativeRecordExtractor.java | Aligns Parquet logical types to the contract (LocalDate/LocalTime/Timestamp, INT96→Timestamp). |
| pinot-plugins/pinot-input-format/pinot-parquet/src/main/java/org/apache/pinot/plugin/inputformat/parquet/ParquetAvroRecordExtractor.java | Updates deprecated INT96 handling to return Timestamp via the native extractor helper. |
| pinot-plugins/pinot-input-format/pinot-parquet/src/test/java/org/apache/pinot/plugin/inputformat/parquet/ParquetNativeRecordExtractorTest.java | New in-memory tests for Parquet primitive/logical/list/map/struct extraction. |
| pinot-plugins/pinot-input-format/pinot-parquet/src/test/java/org/apache/pinot/plugin/inputformat/parquet/ParquetAvroRecordExtractorTest.java | New focused tests for parquet-avro INT96 override behavior. |
| pinot-plugins/pinot-input-format/pinot-parquet/src/test/java/org/apache/pinot/plugin/inputformat/parquet/ParquetCollectionAndEquivalenceTest.java | Updates fixture expectations for date/timestamp outputs to match the new contract. |
| pinot-plugins/pinot-input-format/pinot-orc/src/main/java/org/apache/pinot/plugin/inputformat/orc/ORCRecordReader.java | Delegates row extraction to a dedicated ORCRecordExtractor and updates supported type checks. |
| pinot-plugins/pinot-input-format/pinot-orc/src/main/java/org/apache/pinot/plugin/inputformat/orc/ORCRecordExtractor.java | New ORC-specific extractor enforcing the contract (Boolean/BigDecimal/LocalDate/Timestamp, complex types). |
| pinot-plugins/pinot-input-format/pinot-orc/src/test/java/org/apache/pinot/plugin/inputformat/orc/ORCRecordExtractorTest.java | Replaces IO-based tests with direct vectorized batch extractor tests. |
| pinot-plugins/pinot-input-format/pinot-protobuf/src/main/java/org/apache/pinot/plugin/inputformat/protobuf/ProtoBufRecordExtractor.java | Simplifies convertSingleValue to only special-case ByteString, delegates other conversions to base; aligns map key dropping semantics. |
| pinot-plugins/pinot-input-format/pinot-protobuf/src/test/java/org/apache/pinot/plugin/inputformat/protobuf/ProtoBufRecordExtractorTest.java | Rewrites tests to be descriptor-driven and type-focused without temp files. |
| pinot-plugins/pinot-input-format/pinot-protobuf/src/test/java/org/apache/pinot/plugin/inputformat/protobuf/ProtoBufRecordExtractorLowLevelTest.java | Removes redundant low-level protobuf extractor tests. |
| pinot-plugins/pinot-input-format/pinot-thrift/src/main/java/org/apache/pinot/plugin/inputformat/thrift/ThriftRecordExtractor.java | Updates extractor documentation to match the contract (incl. small-int widening). |
| pinot-plugins/pinot-input-format/pinot-thrift/src/test/java/org/apache/pinot/plugin/inputformat/thrift/ThriftRecordExtractorTest.java | Rewrites tests to validate the contract directly via extractor output. |
| pinot-plugins/pinot-input-format/pinot-thrift/src/test/java/org/apache/pinot/plugin/inputformat/thrift/ThriftRecordReaderTest.java | Updates reader tests for widened small ints and contract-aligned equality. |
| pinot-plugins/pinot-input-format/pinot-json/src/main/java/org/apache/pinot/plugin/inputformat/json/JSONRecordExtractor.java | Documents JSON-to-Java type behavior and relies on base conversions for widening. |
| pinot-plugins/pinot-input-format/pinot-json/src/test/java/org/apache/pinot/plugin/inputformat/json/JSONRecordExtractorTest.java | Rewrites tests to be contract-driven without JSON file IO. |
| pinot-plugins/pinot-input-format/pinot-csv/src/main/java/org/apache/pinot/plugin/inputformat/csv/CSVRecordExtractor.java | Documents CSV extractor semantics (untyped strings, empty→null, optional MV splitting). |
| pinot-plugins/pinot-input-format/pinot-csv/src/test/java/org/apache/pinot/plugin/inputformat/csv/CSVRecordExtractorTest.java | Rewrites tests to focus on extractor behavior using in-memory parsing. |
| pinot-plugins/pinot-input-format/pinot-clp-log/src/test/java/org/apache/pinot/plugin/inputformat/clplog/CLPLogRecordExtractorTest.java | Updates CLP log extractor tests to align with contract/expectations and reduces setup. |
| pinot-plugins/pinot-input-format/pinot-avro-base/src/main/java/org/apache/pinot/plugin/inputformat/avro/AvroRecordExtractor.java | Updates documentation and removes now-redundant Instant→Timestamp handling (delegated to base). |
| pinot-plugins/pinot-input-format/pinot-avro-base/src/test/java/org/apache/pinot/plugin/inputformat/avro/AvroRecordExtractorComplexTypesTest.java | Removes legacy complex-type tests superseded by the new test approach. |
d5ea0c2 to
15af8f5
Compare
Each per-format `*RecordExtractorTest` is rewritten to follow a self-contained, type-driven pattern: one `@Test` per format-native input type (ordered to match the format's type list), no inheritance from `AbstractRecordExtractorTest`, and Markdown Javadoc using `///` and `[X]` references. Tests cover only what is format-specific; cross-format common behavior (`init` filtering, primitive preservation) is covered once in `BaseRecordExtractorTest`. - New `ORCRecordExtractor` extracted from `ORCRecordReader`: a thin `Record` wrapper holding `VectorizedRowBatch + TypeDescription + rowId`, null/repeating handled upfront. ORC `BOOLEAN` now surfaces as `Boolean` (was `String`), `DECIMAL` as `BigDecimal` (was `String`), and `TIMESTAMP` / `TIMESTAMP_INSTANT` combine seconds + nanos into epoch-millis `Long` (was dropping sub-second precision). - `BaseRecordExtractor.convertSingleValue` now widens `Byte` / `Short` to `Integer` so all small ints unify behind a single Pinot type. - Comprehensive Markdown Javadoc on `RecordExtractor` documenting the type contract (`Boolean` / `Integer` / `Long` / `Float` / `Double` / `BigDecimal` / `String` / `byte[]` / `Object[]` / `Map`). - `convert` and `convertSingleValue` annotated `@Nullable`: a non-null Java input may still represent a logical null (format-native sentinel) and format-specific overrides are free to translate to `null`. - New `ParquetNativeRecordExtractorTest` and `ParquetAvroRecordExtractorTest` using in-memory `Group` / `GenericRecord` inputs (no temp files). - Drop `AbstractRecordExtractorTest` (no remaining users; the no-include-list path is exercised by every per-format test).
15af8f5 to
93987b8
Compare
There was a problem hiding this comment.
Pull request overview
This PR standardizes the RecordExtractor output type contract across Pinot input formats by centralizing more conversion behavior in BaseRecordExtractor, aligning format-specific extractors (notably ORC/Parquet/ProtoBuf/Thrift/JSON/CSV) to the same Java type matrix, and updating tests accordingly.
Changes:
- Expanded
BaseRecordExtractor.convertSingleValue()to unify small-int widening,BigInteger → BigDecimal, and modernjava.timehandling (LocalDate/LocalTimepassthrough;Instant/OffsetDateTime/ZonedDateTime/LocalDateTime → Timestamp). - Added
PinotDataType.DATEandPinotDataType.TIMEplus tests, and updatedRecordExtractordocumentation to formally define the contract. - Refactored ORC ingestion to use a new
ORCRecordExtractor(separating extraction from file IO), and rewrote per-format extractor tests to be type-driven/self-contained (removing legacy shared test bases).
Reviewed changes
Copilot reviewed 29 out of 29 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| pinot-spi/src/test/java/org/apache/pinot/spi/data/readers/BaseRecordExtractorTest.java | Rewritten to cover the updated base conversion/type contract (including temporal + shape-preservation cases). |
| pinot-spi/src/test/java/org/apache/pinot/spi/data/readers/AbstractRecordReaderTest.java | Cleans up schema type usage (imports Schema directly). |
| pinot-spi/src/test/java/org/apache/pinot/spi/data/readers/AbstractRecordExtractorTest.java | Deleted legacy shared extractor test base (no remaining users). |
| pinot-spi/src/main/java/org/apache/pinot/spi/data/readers/RecordExtractor.java | Documents the standardized output-type contract; makes init config nullable; convert return nullable. |
| pinot-spi/src/main/java/org/apache/pinot/spi/data/readers/BaseRecordExtractor.java | Centralizes widened numeric + modern java.time conversions; improves map key null handling; better default record error. |
| pinot-plugins/pinot-input-format/pinot-thrift/src/test/java/org/apache/pinot/plugin/inputformat/thrift/ThriftRecordReaderTest.java | Updates expectations for small-int widening and multi-value extraction shape. |
| pinot-plugins/pinot-input-format/pinot-thrift/src/test/java/org/apache/pinot/plugin/inputformat/thrift/ThriftRecordExtractorTest.java | Rewritten as direct extractor tests with explicit type-matrix coverage (including nested/map/list/set). |
| pinot-plugins/pinot-input-format/pinot-thrift/src/main/java/org/apache/pinot/plugin/inputformat/thrift/ThriftRecordExtractor.java | Updates class-level contract documentation to match standardized outputs. |
| pinot-plugins/pinot-input-format/pinot-protobuf/src/test/java/org/apache/pinot/plugin/inputformat/protobuf/ProtoBufRecordExtractorTest.java | Rewritten to use dynamic descriptors/messages; covers presence semantics + map/repeated/nested cases. |
| pinot-plugins/pinot-input-format/pinot-protobuf/src/test/java/org/apache/pinot/plugin/inputformat/protobuf/ProtoBufRecordExtractorLowLevelTest.java | Deleted older low-level tests in favor of the new direct/type-driven suite. |
| pinot-plugins/pinot-input-format/pinot-protobuf/src/main/java/org/apache/pinot/plugin/inputformat/protobuf/ProtoBufRecordExtractor.java | Aligns init signature; map-key null handling; delegates almost all single-value conversion to base. |
| pinot-plugins/pinot-input-format/pinot-parquet/src/test/java/org/apache/pinot/plugin/inputformat/parquet/ParquetNativeRecordExtractorTest.java | New in-memory Group tests covering Parquet primitives, logical types, list/map/struct, and timestamp rounding. |
| pinot-plugins/pinot-input-format/pinot-parquet/src/test/java/org/apache/pinot/plugin/inputformat/parquet/ParquetCollectionAndEquivalenceTest.java | Updates fixture assertions for LocalDate/Timestamp outputs and canonicalization docs. |
| pinot-plugins/pinot-input-format/pinot-parquet/src/test/java/org/apache/pinot/plugin/inputformat/parquet/ParquetAvroRecordExtractorTest.java | New tests for parquet-avro INT96 override producing Timestamp. |
| pinot-plugins/pinot-input-format/pinot-parquet/src/main/java/org/apache/pinot/plugin/inputformat/parquet/ParquetNativeRecordExtractor.java | Implements Parquet logical-type conversions to LocalDate/LocalTime/Timestamp; preserves sub-ms nanos. |
| pinot-plugins/pinot-input-format/pinot-parquet/src/main/java/org/apache/pinot/plugin/inputformat/parquet/ParquetAvroRecordExtractor.java | Narrows scope to INT96 handling and aligns INT96 conversion output to Timestamp. |
| pinot-plugins/pinot-input-format/pinot-orc/src/test/java/org/apache/pinot/plugin/inputformat/orc/ORCRecordExtractorTest.java | Rewritten to be in-memory VectorizedRowBatch tests directly against the new ORC extractor. |
| pinot-plugins/pinot-input-format/pinot-orc/src/main/java/org/apache/pinot/plugin/inputformat/orc/ORCRecordReader.java | Refactored to delegate per-row extraction to ORCRecordExtractor (reader now focuses on IO/filtering). |
| pinot-plugins/pinot-input-format/pinot-orc/src/main/java/org/apache/pinot/plugin/inputformat/orc/ORCRecordExtractor.java | New extractor implementing ORC category → standardized Java output types (incl. timestamps with nanos). |
| pinot-plugins/pinot-input-format/pinot-json/src/test/java/org/apache/pinot/plugin/inputformat/json/JSONRecordExtractorTest.java | Rewritten to direct extractor tests emphasizing standardized type outputs and shape preservation. |
| pinot-plugins/pinot-input-format/pinot-json/src/main/java/org/apache/pinot/plugin/inputformat/json/JSONRecordExtractor.java | Adds contract documentation for Jackson’s parsed JSON shapes and numeric behaviors. |
| pinot-plugins/pinot-input-format/pinot-csv/src/test/java/org/apache/pinot/plugin/inputformat/csv/CSVRecordExtractorTest.java | Rewritten to extractor-only tests (no file IO), focusing on cell handling + delimiter semantics. |
| pinot-plugins/pinot-input-format/pinot-csv/src/main/java/org/apache/pinot/plugin/inputformat/csv/CSVRecordExtractor.java | Updates contract documentation and init signature for nullable fields; behavior unchanged. |
| pinot-plugins/pinot-input-format/pinot-clp-log/src/test/java/org/apache/pinot/plugin/inputformat/clplog/CLPLogRecordExtractorTest.java | Simplifies tests and updates assertions consistent with the standardized conversion behavior. |
| pinot-plugins/pinot-input-format/pinot-avro-base/src/test/java/org/apache/pinot/plugin/inputformat/avro/AvroRecordExtractorComplexTypesTest.java | Deleted legacy complex-type test suite (replaced by newer format-specific coverage elsewhere). |
| pinot-plugins/pinot-input-format/pinot-avro-base/src/main/java/org/apache/pinot/plugin/inputformat/avro/AvroRecordExtractor.java | Updates Avro contract docs; relies on base for temporal conversion; keeps Avro fixed → byte[]. |
| pinot-common/src/test/java/org/apache/pinot/common/utils/PinotDataTypeTest.java | Adds DATE/TIME conversion tests and updates getSingleValueType coverage. |
| pinot-common/src/main/java/org/apache/pinot/common/utils/PinotDataType.java | Adds DATE and TIME PinotDataTypes and updates class-to-type detection. |
|
|
||
| @Override | ||
| public void init(Set<String> fields, RecordExtractorConfig recordExtractorConfig) { | ||
| public void init(@Nullable Set<String> fields, RecordExtractorConfig recordExtractorConfig) { |
| * gets content-based equality, and {@code Date}/{@code Timestamp}/{@code LocalDate}/{@code Instant} all | ||
| * collapse to the underlying raw numeric form the native reader returns. |
Summary
Standardizes the
RecordExtractoroutput type contract across every input format, including the date/time matrix. Every per-format extractor now produces the same Java type matrix, so downstream ingestion transforms see a uniform shape regardless of source.The contract (now documented on
RecordExtractor):Boolean(not stringified)Byte/Short/int8/int16/int32)Integer(small ints widen)LongFloatDoubleBigDecimal,BigInteger)BigDecimal(not stringified, not lossily converted toDouble;BigIntegerwidens since Pinot has noBigIntegertype)StringByteBuffer,ByteString,Binary, ORCBytesColumnVector, …)byte[]java.time.LocalDate(TZ-independent)java.time.LocalTime(nanosecond precision, TZ-independent)java.sql.Timestamp(sub-millisecond nanos preserved)Collection, array exceptbyte[])Object[](input shape preserved — null elements staynull, empty inputs round-trip asObject[0])Map<Object, Object>nullUtf8, ProtoBufEnumValueDescriptor, ThriftTEnum, …)value.toString()Base contract additions (
BaseRecordExtractor.convertSingleValue)The base now recognizes the full modern
java.timefamily so format-specific extractors don't each have to bridge them:LocalDate/LocalTime/Timestamppass through (preserved).Instant/OffsetDateTime/ZonedDateTimebridge toTimestampviaTimestamp.from(instant)— lossless (sub-second nanos preserved).LocalDateTimebridges toTimestampinterpreting the wall-clock as UTC (matches Icebergtimestamp, ParquetisAdjustedToUTC=false, SnowflakeTIMESTAMP_NTZ).Temporalfamily is gated behind a singleinstanceof Temporalcheck so non-temporal values (the common case) skip 5 instanceof checks on the hot path.BigIntegerwidens toBigDecimal(Pinot has noBigIntegerdata type;BigDecimalpreserves arbitrary precision and downstream transforms handle it natively). This is reachable in practice from JSON parsers (Jackson producesBigIntegerfor integer literals overflowingLong).PinotDataTypeadditionsDATE(wrapsLocalDate) andTIME(wrapsLocalTime) are added alongsideTIMESTAMP. Recognized viagetSingleValueType(LocalDate.class)/getSingleValueType(LocalTime.class). Each handlesconvert(value, sourceType)forINTEGER/LONG/STRING/JSON/TIMESTAMPsources.Production fixes to bring formats in line with the contract:
BOOLEAN→Boolean(wasString).DECIMAL→BigDecimal(wasString).DATE→LocalDateviaLocalDate.ofEpochDay(days)(wasLongdays).TIMESTAMP/TIMESTAMP_INSTANT→Timestampwith full sub-second nanos viasetNanos(wasLongepoch millis dropping sub-second precision).LISTpreserves input shape — null elements staynull, empty lists round-trip asObject[0](was filtering nulls and surfacing empty / all-null lists asnull, making "empty" and "null" indistinguishable downstream).LocalDate/LocalTimepass through (handled by base).INT32withDATEannotation →LocalDate(wasIntegerdays).INT32withTIME_MILLIS/INT64withTIME_MICROS/TIME_NANOS→LocalTime(was raw int / long).INT96andINT64withTIMESTAMP_MILLIS/TIMESTAMP_MICROS/TIMESTAMP_NANOS→Timestamp(was epoch millisLong); sub-millisecond nanos preserved.Byte/Short→Integer(small ints unify behind a single Pinot type) — lifted intoBaseRecordExtractor.convertSingleValueso all formats inherit the widening.convertSingleValuesimplified — onlyByteString → byte[]is genuinely format-specific; everything else delegates to the base.BaseRecordExtractor.convertMap(and the matching ProtoBuf override) rechecks the converted key fornullafterconvertSingleValueand drops the entry — format-specific overrides may translate format-native null sentinels into anullkey, whichHashMap.putwould otherwise accept as a single null-keyed entry.Architectural changes
ORCRecordExtractorextracted fromORCRecordReader(a thinRecordwrapper aroundVectorizedRowBatch + TypeDescription + rowId); ORC was the only format without a separate extractor class.convertandconvertSingleValueannotated@Nullable: a non-null Java input may still represent a logical null (format-native sentinel), and format-specific overrides are free to translate tonull. The base impl never returnsnull.RecordExtractor.init'srecordExtractorConfigparameter annotated@Nullable(formats that don't need config passnull).Tests
Every per-format
*RecordExtractorTestis rewritten to a self-contained, type-driven pattern:@Testper format-native input type, ordered to match the format's native type list (documented in each extractor's class Javadoc).AbstractRecordExtractorTest(deleted — no remaining users).///,[X]references, backtick code).initfiltering, primitive preservation,Temporalfamily handling, multi-value shape preservation, map null-key dropping) is covered once inBaseRecordExtractorTest.testListPreservesNullElementsandtestEmptyListExtractedAsEmptyArrayadded at the format level for inputs that can carry the shape natively (ORC, JSON, Avro, Parquet, Thrift).ParquetNativeRecordExtractorTestandParquetAvroRecordExtractorTestwith in-memoryGroup/GenericRecordinputs (no temp files).Test correctness fixes
ThriftRecordReaderTest.testPrimitiveTypePreservationwas assertingShort.classfori16elements ("stay Short — not promoted to Integer") — directly contradicting the new contract. Rewritten to expectIntegerper the widening.