Skip to content

Standardizes the RecordExtractor output type contract across every input format#18378

Open
Jackie-Jiang wants to merge 1 commit intoapache:masterfrom
Jackie-Jiang:record-extractor-tests
Open

Standardizes the RecordExtractor output type contract across every input format#18378
Jackie-Jiang wants to merge 1 commit intoapache:masterfrom
Jackie-Jiang:record-extractor-tests

Conversation

@Jackie-Jiang
Copy link
Copy Markdown
Contributor

@Jackie-Jiang Jackie-Jiang commented Apr 29, 2026

Summary

Standardizes the RecordExtractor output type contract across every input format, including the date/time matrix. Every per-format extractor now produces the same Java type matrix, so downstream ingestion transforms see a uniform shape regardless of source.

The contract (now documented on RecordExtractor):

Logical value Java output type
Boolean Boolean (not stringified)
Small int (Byte / Short / int8 / int16 / int32) Integer (small ints widen)
Long Long
Float Float
Double Double
Decimal (BigDecimal, BigInteger) BigDecimal (not stringified, not lossily converted to Double; BigInteger widens since Pinot has no BigInteger type)
String String
Bytes (ByteBuffer, ByteString, Binary, ORC BytesColumnVector, …) byte[]
Date java.time.LocalDate (TZ-independent)
Time of day java.time.LocalTime (nanosecond precision, TZ-independent)
Timestamp / instant java.sql.Timestamp (sub-millisecond nanos preserved)
Multi-value (Collection, array except byte[]) Object[] (input shape preserved — null elements stay null, empty inputs round-trip as Object[0])
Map / nested record / struct Map<Object, Object>
Null / unset null
Anything else (Avro Utf8, ProtoBuf EnumValueDescriptor, Thrift TEnum, …) value.toString()

Base contract additions (BaseRecordExtractor.convertSingleValue)

The base now recognizes the full modern java.time family so format-specific extractors don't each have to bridge them:

  • LocalDate / LocalTime / Timestamp pass through (preserved).
  • Instant / OffsetDateTime / ZonedDateTime bridge to Timestamp via Timestamp.from(instant) — lossless (sub-second nanos preserved).
  • LocalDateTime bridges to Timestamp interpreting the wall-clock as UTC (matches Iceberg timestamp, Parquet isAdjustedToUTC=false, Snowflake TIMESTAMP_NTZ).
  • The Temporal family is gated behind a single instanceof Temporal check so non-temporal values (the common case) skip 5 instanceof checks on the hot path.

BigInteger widens to BigDecimal (Pinot has no BigInteger data type; BigDecimal preserves arbitrary precision and downstream transforms handle it natively). This is reachable in practice from JSON parsers (Jackson produces BigInteger for integer literals overflowing Long).

PinotDataType additions

DATE (wraps LocalDate) and TIME (wraps LocalTime) are added alongside TIMESTAMP. Recognized via getSingleValueType(LocalDate.class) / getSingleValueType(LocalTime.class). Each handles convert(value, sourceType) for INTEGER / LONG / STRING / JSON / TIMESTAMP sources.

Production fixes to bring formats in line with the contract:

  • ORC BOOLEANBoolean (was String).
  • ORC DECIMALBigDecimal (was String).
  • ORC DATELocalDate via LocalDate.ofEpochDay(days) (was Long days).
  • ORC TIMESTAMP / TIMESTAMP_INSTANTTimestamp with full sub-second nanos via setNanos (was Long epoch millis dropping sub-second precision).
  • ORC LIST preserves input shape — null elements stay null, empty lists round-trip as Object[0] (was filtering nulls and surfacing empty / all-null lists as null, making "empty" and "null" indistinguishable downstream).
  • Avro LocalDate / LocalTime pass through (handled by base).
  • Parquet INT32 with DATE annotation → LocalDate (was Integer days).
  • Parquet INT32 with TIME_MILLIS / INT64 with TIME_MICROS / TIME_NANOSLocalTime (was raw int / long).
  • Parquet INT96 and INT64 with TIMESTAMP_MILLIS / TIMESTAMP_MICROS / TIMESTAMP_NANOSTimestamp (was epoch millis Long); sub-millisecond nanos preserved.
  • Thrift / Avro / ProtoBuf Byte / ShortInteger (small ints unify behind a single Pinot type) — lifted into BaseRecordExtractor.convertSingleValue so all formats inherit the widening.
  • ProtoBuf convertSingleValue simplified — only ByteString → byte[] is genuinely format-specific; everything else delegates to the base.
  • BaseRecordExtractor.convertMap (and the matching ProtoBuf override) rechecks the converted key for null after convertSingleValue and drops the entry — format-specific overrides may translate format-native null sentinels into a null key, which HashMap.put would otherwise accept as a single null-keyed entry.

Architectural changes

  • New ORCRecordExtractor extracted from ORCRecordReader (a thin Record wrapper around VectorizedRowBatch + TypeDescription + rowId); ORC was the only format without a separate extractor class.
  • convert and convertSingleValue annotated @Nullable: a non-null Java input may still represent a logical null (format-native sentinel), and format-specific overrides are free to translate to null. The base impl never returns null.
  • RecordExtractor.init's recordExtractorConfig parameter annotated @Nullable (formats that don't need config pass null).

Tests

Every per-format *RecordExtractorTest is rewritten to a self-contained, type-driven pattern:

  • One @Test per format-native input type, ordered to match the format's native type list (documented in each extractor's class Javadoc).
  • Self-contained: no inheritance from AbstractRecordExtractorTest (deleted — no remaining users).
  • Markdown Javadoc throughout (///, [X] references, backtick code).
  • Each per-format test focuses on what is format-specific. Cross-format common behavior (init filtering, primitive preservation, Temporal family handling, multi-value shape preservation, map null-key dropping) is covered once in BaseRecordExtractorTest.
  • testListPreservesNullElements and testEmptyListExtractedAsEmptyArray added at the format level for inputs that can carry the shape natively (ORC, JSON, Avro, Parquet, Thrift).
  • New ParquetNativeRecordExtractorTest and ParquetAvroRecordExtractorTest with in-memory Group / GenericRecord inputs (no temp files).
  • Class-level Javadoc of every extractor now documents the source type → Java input → Java output matrix, replacing the previous one-liner.

Test correctness fixes

  • ThriftRecordReaderTest.testPrimitiveTypePreservation was asserting Short.class for i16 elements ("stay Short — not promoted to Integer") — directly contradicting the new contract. Rewritten to expect Integer per the widening.

@Jackie-Jiang Jackie-Jiang added bug Something is not working as expected ingestion Related to data ingestion pipeline testing Related to tests or test infrastructure labels Apr 29, 2026
@Jackie-Jiang Jackie-Jiang force-pushed the record-extractor-tests branch from dfff63f to 36f13f3 Compare April 29, 2026 20:17
@Jackie-Jiang Jackie-Jiang changed the title Refactor RecordExtractor tests and add ORCRecordExtractor Standardizes the RecordExtractor output type contract across every input format Apr 29, 2026
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Apr 29, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 55.38%. Comparing base (6d85277) to head (4e21a72).

❗ There is a different number of reports uploaded between BASE (6d85277) and HEAD (4e21a72). Click for more details.

HEAD has 4 uploads less than BASE
Flag BASE (6d85277) HEAD (4e21a72)
java-21 5 4
unittests 2 1
temurin 5 4
unittests2 1 0
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #18378      +/-   ##
============================================
- Coverage     63.45%   55.38%   -8.07%     
+ Complexity     1683      843     -840     
============================================
  Files          3253     2546     -707     
  Lines        198884   147413   -51471     
  Branches      30798    23738    -7060     
============================================
- Hits         126196    81650   -44546     
+ Misses        62620    58758    -3862     
+ Partials      10068     7005    -3063     
Flag Coverage Δ
custom-integration1 100.00% <ø> (ø)
integration 100.00% <ø> (ø)
integration1 100.00% <ø> (ø)
integration2 0.00% <ø> (ø)
java-21 55.38% <100.00%> (-8.07%) ⬇️
temurin 55.38% <100.00%> (-8.07%) ⬇️
unittests 55.38% <100.00%> (-8.07%) ⬇️
unittests1 55.38% <100.00%> (+<0.01%) ⬆️
unittests2 ?

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@Jackie-Jiang Jackie-Jiang force-pushed the record-extractor-tests branch 2 times, most recently from 7411b40 to 4e21a72 Compare April 29, 2026 22:21
@Jackie-Jiang Jackie-Jiang added the needs-attention Used for sensitive changes - allows searching PRs post release to narrow down causes for regression. label Apr 29, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Standardizes the RecordExtractor output type contract across Pinot input formats so ingestion transforms and downstream type coercion see a uniform Java type matrix independent of source format.

Changes:

  • Documents a unified RecordExtractor type contract and updates BaseRecordExtractor conversion behavior (notably widening Byte/ShortInteger and clarifying nullability semantics).
  • Refactors ORC ingestion to use a dedicated ORCRecordExtractor (moving per-row extraction logic out of ORCRecordReader).
  • Rewrites/introduces per-format extractor tests (Avro/JSON/CSV/ProtoBuf/Thrift/ORC/Parquet/CLP-Log) to validate contract adherence; removes old shared test base classes.

Reviewed changes

Copilot reviewed 18 out of 18 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
pinot-spi/src/main/java/org/apache/pinot/spi/data/readers/RecordExtractor.java Adds/expands the documented cross-format output-type contract and clarifies conversion expectations.
pinot-spi/src/main/java/org/apache/pinot/spi/data/readers/BaseRecordExtractor.java Updates base conversion semantics and documents default dispatch behavior.
pinot-spi/src/test/java/org/apache/pinot/spi/data/readers/BaseRecordExtractorTest.java Reworks tests to assert the updated base contract (multi-value, map recursion, nested-record default behavior, etc.).
pinot-spi/src/test/java/org/apache/pinot/spi/data/readers/AbstractRecordReaderTest.java Minor cleanup (uses imported Schema type instead of FQCN).
pinot-spi/src/test/java/org/apache/pinot/spi/data/readers/AbstractRecordExtractorTest.java Deletes unused shared extractor test base.
pinot-plugins/pinot-input-format/pinot-orc/src/main/java/org/apache/pinot/plugin/inputformat/orc/ORCRecordReader.java Delegates row extraction to ORCRecordExtractor; keeps file IO + include filtering.
pinot-plugins/pinot-input-format/pinot-orc/src/main/java/org/apache/pinot/plugin/inputformat/orc/ORCRecordExtractor.java New ORC per-row extractor implementing the standardized contract (primitives, maps, structs, timestamps, decimals).
pinot-plugins/pinot-input-format/pinot-orc/src/test/java/org/apache/pinot/plugin/inputformat/orc/ORCRecordExtractorTest.java New in-memory vectorized tests for ORC extraction behavior (no file IO).
pinot-plugins/pinot-input-format/pinot-thrift/src/test/java/org/apache/pinot/plugin/inputformat/thrift/ThriftRecordExtractorTest.java Rewrites Thrift tests to be type-driven and contract-focused (including nested structs, lists/sets/maps, optionals).
pinot-plugins/pinot-input-format/pinot-protobuf/src/test/java/org/apache/pinot/plugin/inputformat/protobuf/ProtoBufRecordExtractorTest.java Rewrites ProtoBuf tests using dynamic descriptors/messages for isolated, type-driven verification.
pinot-plugins/pinot-input-format/pinot-protobuf/src/test/java/org/apache/pinot/plugin/inputformat/protobuf/ProtoBufRecordExtractorLowLevelTest.java Removes older low-level test suite in favor of the new contract-based tests.
pinot-plugins/pinot-input-format/pinot-parquet/src/test/java/org/apache/pinot/plugin/inputformat/parquet/ParquetNativeRecordExtractorTest.java New Parquet-Group-based tests for primitive/logical/complex mappings (LIST/MAP/struct/INT96).
pinot-plugins/pinot-input-format/pinot-parquet/src/test/java/org/apache/pinot/plugin/inputformat/parquet/ParquetAvroRecordExtractorTest.java New focused test for Parquet-Avro INT96 override behavior.
pinot-plugins/pinot-input-format/pinot-json/src/test/java/org/apache/pinot/plugin/inputformat/json/JSONRecordExtractorTest.java Rewrites JSON tests to validate the realistic Jackson type matrix and recursive conversions.
pinot-plugins/pinot-input-format/pinot-csv/src/test/java/org/apache/pinot/plugin/inputformat/csv/CSVRecordExtractorTest.java Rewrites CSV tests around CSV’s untyped nature + MV delimiter behavior (string/empty/null semantics).
pinot-plugins/pinot-input-format/pinot-clp-log/src/test/java/org/apache/pinot/plugin/inputformat/clplog/CLPLogRecordExtractorTest.java Updates CLP-Log tests for encoding config parsing + topic-name preservation with cleaner parameterization.
pinot-plugins/pinot-input-format/pinot-avro-base/src/test/java/org/apache/pinot/plugin/inputformat/avro/AvroRecordExtractorTest.java Rewrites Avro tests to cover primitives, bytes/fixed, enums, arrays/maps/nested records, nullable unions, and logical types.
pinot-plugins/pinot-input-format/pinot-avro-base/src/test/java/org/apache/pinot/plugin/inputformat/avro/AvroRecordExtractorComplexTypesTest.java Removes separate complex-type IO-based test class (coverage moved into the new type-driven tests).

Comment thread pinot-spi/src/main/java/org/apache/pinot/spi/data/readers/RecordExtractor.java Outdated
@Jackie-Jiang Jackie-Jiang force-pushed the record-extractor-tests branch 2 times, most recently from 0938582 to d5ea0c2 Compare April 30, 2026 02:48
@Jackie-Jiang Jackie-Jiang requested a review from Copilot April 30, 2026 02:50
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR standardizes the RecordExtractor output-type contract across Pinot input formats so ingestion transforms see a consistent Java type matrix regardless of source format (including modern date/time handling).

Changes:

  • Documents and enforces a unified RecordExtractor output contract, with shared conversions implemented in BaseRecordExtractor (numeric widening, bytes materialization, java.time bridging).
  • Adds DATE (LocalDate) and TIME (LocalTime) to PinotDataType, plus corresponding tests and type inference updates.
  • Refactors and aligns format-specific extractors/readers (notably ORC) and rewrites per-format extractor tests to be type-driven and self-contained.

Reviewed changes

Copilot reviewed 29 out of 29 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
pinot-spi/src/main/java/org/apache/pinot/spi/data/readers/RecordExtractor.java Defines/documentes the cross-format output type matrix and makes init config nullable.
pinot-spi/src/main/java/org/apache/pinot/spi/data/readers/BaseRecordExtractor.java Implements shared conversions (small-int widening, BigInteger→BigDecimal, ByteBuffer→byte[], Temporal→Timestamp, map key dropping rules).
pinot-spi/src/test/java/org/apache/pinot/spi/data/readers/BaseRecordExtractorTest.java Adds contract-level tests for single/multi/map/temporal conversion behavior.
pinot-spi/src/test/java/org/apache/pinot/spi/data/readers/AbstractRecordReaderTest.java Simplifies schema references (uses imported Schema).
pinot-spi/src/test/java/org/apache/pinot/spi/data/readers/AbstractRecordExtractorTest.java Removes obsolete shared extractor test base.
pinot-common/src/main/java/org/apache/pinot/common/utils/PinotDataType.java Adds DATE/TIME types and updates class-to-type inference.
pinot-common/src/test/java/org/apache/pinot/common/utils/PinotDataTypeTest.java Adds unit tests for new DATE/TIME conversions and inference.
pinot-plugins/pinot-input-format/pinot-parquet/src/main/java/org/apache/pinot/plugin/inputformat/parquet/ParquetNativeRecordExtractor.java Aligns Parquet logical types to the contract (LocalDate/LocalTime/Timestamp, INT96→Timestamp).
pinot-plugins/pinot-input-format/pinot-parquet/src/main/java/org/apache/pinot/plugin/inputformat/parquet/ParquetAvroRecordExtractor.java Updates deprecated INT96 handling to return Timestamp via the native extractor helper.
pinot-plugins/pinot-input-format/pinot-parquet/src/test/java/org/apache/pinot/plugin/inputformat/parquet/ParquetNativeRecordExtractorTest.java New in-memory tests for Parquet primitive/logical/list/map/struct extraction.
pinot-plugins/pinot-input-format/pinot-parquet/src/test/java/org/apache/pinot/plugin/inputformat/parquet/ParquetAvroRecordExtractorTest.java New focused tests for parquet-avro INT96 override behavior.
pinot-plugins/pinot-input-format/pinot-parquet/src/test/java/org/apache/pinot/plugin/inputformat/parquet/ParquetCollectionAndEquivalenceTest.java Updates fixture expectations for date/timestamp outputs to match the new contract.
pinot-plugins/pinot-input-format/pinot-orc/src/main/java/org/apache/pinot/plugin/inputformat/orc/ORCRecordReader.java Delegates row extraction to a dedicated ORCRecordExtractor and updates supported type checks.
pinot-plugins/pinot-input-format/pinot-orc/src/main/java/org/apache/pinot/plugin/inputformat/orc/ORCRecordExtractor.java New ORC-specific extractor enforcing the contract (Boolean/BigDecimal/LocalDate/Timestamp, complex types).
pinot-plugins/pinot-input-format/pinot-orc/src/test/java/org/apache/pinot/plugin/inputformat/orc/ORCRecordExtractorTest.java Replaces IO-based tests with direct vectorized batch extractor tests.
pinot-plugins/pinot-input-format/pinot-protobuf/src/main/java/org/apache/pinot/plugin/inputformat/protobuf/ProtoBufRecordExtractor.java Simplifies convertSingleValue to only special-case ByteString, delegates other conversions to base; aligns map key dropping semantics.
pinot-plugins/pinot-input-format/pinot-protobuf/src/test/java/org/apache/pinot/plugin/inputformat/protobuf/ProtoBufRecordExtractorTest.java Rewrites tests to be descriptor-driven and type-focused without temp files.
pinot-plugins/pinot-input-format/pinot-protobuf/src/test/java/org/apache/pinot/plugin/inputformat/protobuf/ProtoBufRecordExtractorLowLevelTest.java Removes redundant low-level protobuf extractor tests.
pinot-plugins/pinot-input-format/pinot-thrift/src/main/java/org/apache/pinot/plugin/inputformat/thrift/ThriftRecordExtractor.java Updates extractor documentation to match the contract (incl. small-int widening).
pinot-plugins/pinot-input-format/pinot-thrift/src/test/java/org/apache/pinot/plugin/inputformat/thrift/ThriftRecordExtractorTest.java Rewrites tests to validate the contract directly via extractor output.
pinot-plugins/pinot-input-format/pinot-thrift/src/test/java/org/apache/pinot/plugin/inputformat/thrift/ThriftRecordReaderTest.java Updates reader tests for widened small ints and contract-aligned equality.
pinot-plugins/pinot-input-format/pinot-json/src/main/java/org/apache/pinot/plugin/inputformat/json/JSONRecordExtractor.java Documents JSON-to-Java type behavior and relies on base conversions for widening.
pinot-plugins/pinot-input-format/pinot-json/src/test/java/org/apache/pinot/plugin/inputformat/json/JSONRecordExtractorTest.java Rewrites tests to be contract-driven without JSON file IO.
pinot-plugins/pinot-input-format/pinot-csv/src/main/java/org/apache/pinot/plugin/inputformat/csv/CSVRecordExtractor.java Documents CSV extractor semantics (untyped strings, empty→null, optional MV splitting).
pinot-plugins/pinot-input-format/pinot-csv/src/test/java/org/apache/pinot/plugin/inputformat/csv/CSVRecordExtractorTest.java Rewrites tests to focus on extractor behavior using in-memory parsing.
pinot-plugins/pinot-input-format/pinot-clp-log/src/test/java/org/apache/pinot/plugin/inputformat/clplog/CLPLogRecordExtractorTest.java Updates CLP log extractor tests to align with contract/expectations and reduces setup.
pinot-plugins/pinot-input-format/pinot-avro-base/src/main/java/org/apache/pinot/plugin/inputformat/avro/AvroRecordExtractor.java Updates documentation and removes now-redundant Instant→Timestamp handling (delegated to base).
pinot-plugins/pinot-input-format/pinot-avro-base/src/test/java/org/apache/pinot/plugin/inputformat/avro/AvroRecordExtractorComplexTypesTest.java Removes legacy complex-type tests superseded by the new test approach.

@Jackie-Jiang Jackie-Jiang force-pushed the record-extractor-tests branch from d5ea0c2 to 15af8f5 Compare April 30, 2026 03:08
Each per-format `*RecordExtractorTest` is rewritten to follow a self-contained,
type-driven pattern: one `@Test` per format-native input type (ordered to match
the format's type list), no inheritance from `AbstractRecordExtractorTest`, and
Markdown Javadoc using `///` and `[X]` references. Tests cover only what is
format-specific; cross-format common behavior (`init` filtering, primitive
preservation) is covered once in `BaseRecordExtractorTest`.

- New `ORCRecordExtractor` extracted from `ORCRecordReader`: a thin
  `Record` wrapper holding `VectorizedRowBatch + TypeDescription + rowId`,
  null/repeating handled upfront. ORC `BOOLEAN` now surfaces as `Boolean`
  (was `String`), `DECIMAL` as `BigDecimal` (was `String`), and
  `TIMESTAMP` / `TIMESTAMP_INSTANT` combine seconds + nanos into epoch-millis
  `Long` (was dropping sub-second precision).
- `BaseRecordExtractor.convertSingleValue` now widens `Byte` / `Short` to
  `Integer` so all small ints unify behind a single Pinot type.
- Comprehensive Markdown Javadoc on `RecordExtractor` documenting the type
  contract (`Boolean` / `Integer` / `Long` / `Float` / `Double` / `BigDecimal`
  / `String` / `byte[]` / `Object[]` / `Map`).
- `convert` and `convertSingleValue` annotated `@Nullable`: a non-null Java
  input may still represent a logical null (format-native sentinel) and
  format-specific overrides are free to translate to `null`.
- New `ParquetNativeRecordExtractorTest` and `ParquetAvroRecordExtractorTest`
  using in-memory `Group` / `GenericRecord` inputs (no temp files).
- Drop `AbstractRecordExtractorTest` (no remaining users; the no-include-list
  path is exercised by every per-format test).
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR standardizes the RecordExtractor output type contract across Pinot input formats by centralizing more conversion behavior in BaseRecordExtractor, aligning format-specific extractors (notably ORC/Parquet/ProtoBuf/Thrift/JSON/CSV) to the same Java type matrix, and updating tests accordingly.

Changes:

  • Expanded BaseRecordExtractor.convertSingleValue() to unify small-int widening, BigInteger → BigDecimal, and modern java.time handling (LocalDate/LocalTime passthrough; Instant/OffsetDateTime/ZonedDateTime/LocalDateTime → Timestamp).
  • Added PinotDataType.DATE and PinotDataType.TIME plus tests, and updated RecordExtractor documentation to formally define the contract.
  • Refactored ORC ingestion to use a new ORCRecordExtractor (separating extraction from file IO), and rewrote per-format extractor tests to be type-driven/self-contained (removing legacy shared test bases).

Reviewed changes

Copilot reviewed 29 out of 29 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
pinot-spi/src/test/java/org/apache/pinot/spi/data/readers/BaseRecordExtractorTest.java Rewritten to cover the updated base conversion/type contract (including temporal + shape-preservation cases).
pinot-spi/src/test/java/org/apache/pinot/spi/data/readers/AbstractRecordReaderTest.java Cleans up schema type usage (imports Schema directly).
pinot-spi/src/test/java/org/apache/pinot/spi/data/readers/AbstractRecordExtractorTest.java Deleted legacy shared extractor test base (no remaining users).
pinot-spi/src/main/java/org/apache/pinot/spi/data/readers/RecordExtractor.java Documents the standardized output-type contract; makes init config nullable; convert return nullable.
pinot-spi/src/main/java/org/apache/pinot/spi/data/readers/BaseRecordExtractor.java Centralizes widened numeric + modern java.time conversions; improves map key null handling; better default record error.
pinot-plugins/pinot-input-format/pinot-thrift/src/test/java/org/apache/pinot/plugin/inputformat/thrift/ThriftRecordReaderTest.java Updates expectations for small-int widening and multi-value extraction shape.
pinot-plugins/pinot-input-format/pinot-thrift/src/test/java/org/apache/pinot/plugin/inputformat/thrift/ThriftRecordExtractorTest.java Rewritten as direct extractor tests with explicit type-matrix coverage (including nested/map/list/set).
pinot-plugins/pinot-input-format/pinot-thrift/src/main/java/org/apache/pinot/plugin/inputformat/thrift/ThriftRecordExtractor.java Updates class-level contract documentation to match standardized outputs.
pinot-plugins/pinot-input-format/pinot-protobuf/src/test/java/org/apache/pinot/plugin/inputformat/protobuf/ProtoBufRecordExtractorTest.java Rewritten to use dynamic descriptors/messages; covers presence semantics + map/repeated/nested cases.
pinot-plugins/pinot-input-format/pinot-protobuf/src/test/java/org/apache/pinot/plugin/inputformat/protobuf/ProtoBufRecordExtractorLowLevelTest.java Deleted older low-level tests in favor of the new direct/type-driven suite.
pinot-plugins/pinot-input-format/pinot-protobuf/src/main/java/org/apache/pinot/plugin/inputformat/protobuf/ProtoBufRecordExtractor.java Aligns init signature; map-key null handling; delegates almost all single-value conversion to base.
pinot-plugins/pinot-input-format/pinot-parquet/src/test/java/org/apache/pinot/plugin/inputformat/parquet/ParquetNativeRecordExtractorTest.java New in-memory Group tests covering Parquet primitives, logical types, list/map/struct, and timestamp rounding.
pinot-plugins/pinot-input-format/pinot-parquet/src/test/java/org/apache/pinot/plugin/inputformat/parquet/ParquetCollectionAndEquivalenceTest.java Updates fixture assertions for LocalDate/Timestamp outputs and canonicalization docs.
pinot-plugins/pinot-input-format/pinot-parquet/src/test/java/org/apache/pinot/plugin/inputformat/parquet/ParquetAvroRecordExtractorTest.java New tests for parquet-avro INT96 override producing Timestamp.
pinot-plugins/pinot-input-format/pinot-parquet/src/main/java/org/apache/pinot/plugin/inputformat/parquet/ParquetNativeRecordExtractor.java Implements Parquet logical-type conversions to LocalDate/LocalTime/Timestamp; preserves sub-ms nanos.
pinot-plugins/pinot-input-format/pinot-parquet/src/main/java/org/apache/pinot/plugin/inputformat/parquet/ParquetAvroRecordExtractor.java Narrows scope to INT96 handling and aligns INT96 conversion output to Timestamp.
pinot-plugins/pinot-input-format/pinot-orc/src/test/java/org/apache/pinot/plugin/inputformat/orc/ORCRecordExtractorTest.java Rewritten to be in-memory VectorizedRowBatch tests directly against the new ORC extractor.
pinot-plugins/pinot-input-format/pinot-orc/src/main/java/org/apache/pinot/plugin/inputformat/orc/ORCRecordReader.java Refactored to delegate per-row extraction to ORCRecordExtractor (reader now focuses on IO/filtering).
pinot-plugins/pinot-input-format/pinot-orc/src/main/java/org/apache/pinot/plugin/inputformat/orc/ORCRecordExtractor.java New extractor implementing ORC category → standardized Java output types (incl. timestamps with nanos).
pinot-plugins/pinot-input-format/pinot-json/src/test/java/org/apache/pinot/plugin/inputformat/json/JSONRecordExtractorTest.java Rewritten to direct extractor tests emphasizing standardized type outputs and shape preservation.
pinot-plugins/pinot-input-format/pinot-json/src/main/java/org/apache/pinot/plugin/inputformat/json/JSONRecordExtractor.java Adds contract documentation for Jackson’s parsed JSON shapes and numeric behaviors.
pinot-plugins/pinot-input-format/pinot-csv/src/test/java/org/apache/pinot/plugin/inputformat/csv/CSVRecordExtractorTest.java Rewritten to extractor-only tests (no file IO), focusing on cell handling + delimiter semantics.
pinot-plugins/pinot-input-format/pinot-csv/src/main/java/org/apache/pinot/plugin/inputformat/csv/CSVRecordExtractor.java Updates contract documentation and init signature for nullable fields; behavior unchanged.
pinot-plugins/pinot-input-format/pinot-clp-log/src/test/java/org/apache/pinot/plugin/inputformat/clplog/CLPLogRecordExtractorTest.java Simplifies tests and updates assertions consistent with the standardized conversion behavior.
pinot-plugins/pinot-input-format/pinot-avro-base/src/test/java/org/apache/pinot/plugin/inputformat/avro/AvroRecordExtractorComplexTypesTest.java Deleted legacy complex-type test suite (replaced by newer format-specific coverage elsewhere).
pinot-plugins/pinot-input-format/pinot-avro-base/src/main/java/org/apache/pinot/plugin/inputformat/avro/AvroRecordExtractor.java Updates Avro contract docs; relies on base for temporal conversion; keeps Avro fixed → byte[].
pinot-common/src/test/java/org/apache/pinot/common/utils/PinotDataTypeTest.java Adds DATE/TIME conversion tests and updates getSingleValueType coverage.
pinot-common/src/main/java/org/apache/pinot/common/utils/PinotDataType.java Adds DATE and TIME PinotDataTypes and updates class-to-type detection.


@Override
public void init(Set<String> fields, RecordExtractorConfig recordExtractorConfig) {
public void init(@Nullable Set<String> fields, RecordExtractorConfig recordExtractorConfig) {
Comment on lines +505 to +506
* gets content-based equality, and {@code Date}/{@code Timestamp}/{@code LocalDate}/{@code Instant} all
* collapse to the underlying raw numeric form the native reader returns.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something is not working as expected ingestion Related to data ingestion pipeline needs-attention Used for sensitive changes - allows searching PRs post release to narrow down causes for regression. testing Related to tests or test infrastructure

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants