BUG: honor XObject /ColorSpace hint when deriving image mode#3498
BUG: honor XObject /ColorSpace hint when deriving image mode#3498eerose2603 wants to merge 6 commits intopy-pdf:mainfrom
Conversation
Preserve hinted color spaces when deriving XObject image modes
|
Thanks for the PR. Could you please elaborate on the reasons for this change? Additionally, please have a look at the CI failures for us to continue. |
|
What changed & why _get_mode_and_invert_color now treats the XObject’s /ColorSpace entry as an authoritative hint for the final Pillow mode and propagates that hint through the existing _get_imagemode(...) logic: We derive a preferred_mode from IA.COLOR_SPACE: /DeviceRGB → "RGB", /DeviceGray → "L", /DeviceCMYK → "CMYK". We pass preferred_mode as prev_mode to _get_imagemode(...) for both the <8 bpc path and the general path. If _get_imagemode(...) collapses to "" (e.g., NullObject, ambiguous Indexed/Separation/DeviceN chains), we fall back to the hint and set invert_color=True when the hint is CMYK—matching existing behavior where CMYK implies inversion. Problem this solves Some PDFs declare a reliable /ColorSpace on the image XObject, but the resolved color space passed to _get_imagemode(...) can be ambiguous or collapse (e.g., NullObject, ICCBased without usable Alternate, certain DeviceN/Indexed cases). Previously we only special-cased /DeviceRGB, so the function could return "" (unknown) and lose the author intent—leading to wrong modes (e.g., grayscale treated as RGB) or missed CMYK inversion. Compatibility If _get_imagemode(...) returns a concrete mode, nothing changes. The hint is only used as a tie-breaker / fallback and to seed prev_mode, so it doesn’t override explicit resolvable color spaces. |
|
My above requests still hold through. Where do we have such an example? Could we please have a proper test for this? (I do not think that the possibly AI-generated initial and previous comment explain this issue in a suitable way to make it clear to me.) |
Summary
Honor the XObject /ColorSpace hint when deriving Pillow image modes in _get_mode_and_invert_color, especially when nested/indirect color spaces collapse or are ambiguous. This prevents mode mis-detection and preserves CMYK inversion semantics.
Motivation / Problem
In real-world PDFs the image XObject often carries a correct /ColorSpace, while the resolved color space provided to _get_imagemode(...) may be:
NullObject (e.g., after following references),
ICCBased without a usable /Alternate,
Indexed / Separation / DeviceN chains that don’t yield a direct Device* at the point of resolution.
Previously we:
Special-cased only /DeviceRGB, ignoring /DeviceGray and /DeviceCMYK.
Didn’t pass any prior/hinted mode into _get_imagemode(...).
Had no fallback if _get_imagemode(...) returned "".
This could yield the wrong final mode and miss CMYK inversion.
Approach
Compute preferred_mode from IA.COLOR_SPACE:
/DeviceRGB → "RGB", /DeviceGray → "L", /DeviceCMYK → "CMYK".
Pass preferred_mode to _get_imagemode(...) as prev_mode for both:
the <8 bpc branch and
the general branch (including the “2 components if not explicitly gray” rule).
If the derived mode is still "", fall back to preferred_mode.
If the fallback is CMYK, set invert_color=True (consistent with current CMYK handling elsewhere).
Effects / Compatibility
No change when _get_imagemode(...) already returns a concrete mode.
Only ambiguous/degenerate cases change—and they now align with the XObject’s declared /ColorSpace.
Keeps existing decode and flate/JPX handling untouched.
Tests
Added tests asserting that, when color_space collapses (e.g., NullObject), we still:
choose "RGB" for a /DeviceRGB hint (no invert),
choose "L" for /DeviceGray (no invert),
choose "CMYK" for /DeviceCMYK and set invert_color=True.