Skip to content

[Feature Request]: add native, portable support for the Date data type within the Beam IcebergIO, ensuring it works smoothly from the Python SDK. #37823

@thanatham-google

Description

@thanatham-google

What would you like to happen?

Issue: The Apache Beam IcebergIO connector currently does not support writing the native Date data type to Iceberg tables. This seems to affect users of the Python SDK in particular, due to cross-language transform limitations. The issue appears linked to older Joda-time library dependencies in the underlying Java IO implementation.

Client Impact: A key client using Dataflow with the Apache Beam Python SDK and Iceberg tables on GCS/BigQuery is significantly impacted and unhappy. They cannot natively write Python datetime.date objects. They are forced to use workarounds like storing dates as Integers or Strings, which they find suboptimal for their data representation and query needs.

Root Cause & External Trackers: This is a known issue in the Apache Beam community, related to the need for a portable Date type and the migration from Joda-time to Java.time.

Main issue: #25946 Dependencies: #28359 #19215 Current Workarounds Considered: The client has considered treating dates as Strings (e.g., 'YYYY-MM-DD'), Integers (e.g., YYYYMMDD or epoch days), Timestamps, or using a custom cross-language transform wrapper like the one found at https://github.com/johanesalxd/beam-iceberg-date.

Suggested Resolution: The request is to add native, portable support for the Date data type within the Beam IcebergIO, ensuring it works smoothly from the Python SDK.

Investigation Details:

What IO is having the issues: Apache Beam IcebergIO sink (the write transform).

What are the configurations used for this IO? The customer's specific code snippet for configuring the IcebergIO.write() transform is not available. The configuration would typically involve specifying catalog details and the target table name. The issue occurs when the pipeline data contains standard Python datetime.date objects.

The shape of the user data: Schema: The exact schema is not provided, but the data includes fields intended to be DATE type, represented as Python datetime.date objects. Volume: Specific details on data volume (records or bytes) are not available. The issue is type-related, so it likely affects any volume.

Issue Priority

Priority: 2 (default / most feature requests should be filed as P2)

Issue Components

  • Component: Python SDK
  • Component: Java SDK
  • Component: Go SDK
  • Component: Typescript SDK
  • Component: IO connector
  • Component: Beam YAML
  • Component: Beam examples
  • Component: Beam playground
  • Component: Beam katas
  • Component: Website
  • Component: Infrastructure
  • Component: Spark Runner
  • Component: Flink Runner
  • Component: Samza Runner
  • Component: Twister2 Runner
  • Component: Hazelcast Jet Runner
  • Component: Google Cloud Dataflow Runner

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions