Skip to content

fix(document_loaders): handle 1-bit images in PyPDFParser.extract_images_from_page#542

Open
Sonic-79 wants to merge 1 commit intolangchain-ai:mainfrom
Sonic-79:fix/pypdf-1bit-image-reshape
Open

fix(document_loaders): handle 1-bit images in PyPDFParser.extract_images_from_page#542
Sonic-79 wants to merge 1 commit intolangchain-ai:mainfrom
Sonic-79:fix/pypdf-1bit-image-reshape

Conversation

@Sonic-79
Copy link

Summary

  • Fix ValueError: cannot reshape array of size N into shape (H,W,newaxis) when extract_images_from_page() encounters 1-bit monochrome images in PDFs (common in scanned documents using CCITT Fax, JBIG2, or FlateDecode compression)
  • Read /BitsPerComponent from PDF image XObject and unpack 1-bit images correctly using np.unpackbits(), accounting for row-level byte-boundary padding
  • Add ValueError catch as safety net for other unexpected data shapes

Root cause: The code assumes every pixel occupies one byte (dtype=np.uint8), but 1-bit images pack 8 pixels per byte. For example, a 430×645 image expects 277,350 bytes but only has 34,830 — exactly 1/8th, because each bit represents one pixel.

Closes #307
Related: langchain-ai/langchain#31724

Test plan

  • test_1bit_image_does_not_raise — regression test with real-world dimensions (430×645, CCITT)
  • test_1bit_pixel_values — verifies correct bit unpacking and 0/255 scaling
  • test_1bit_row_padding — verifies correct handling of non-byte-aligned row widths
  • test_8bit_image_unaffected — confirms no regression for standard 8-bit images

AI Disclosure

This fix was developed with assistance from Claude (Anthropic). The root cause analysis, fix design, and testing were performed collaboratively between a human developer and AI.

🤖 Generated with Claude Code

…ges_from_page

Read /BitsPerComponent from PDF image XObject and unpack 1-bit monochrome
images correctly using np.unpackbits(), accounting for row-level byte-boundary
padding. Adds a ValueError catch as safety net for unexpected data shapes.

Closes langchain-ai#307

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions github-actions bot added the fix label Feb 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

PyPDFLoader + LLMImageBlobParser.extract_images_from_page() fails on Flate/LZW-compressed PDF images with ValueError: cannot reshape array

1 participant