fix(document_loaders): handle 1-bit images in PyPDFParser.extract_images_from_page#542
Open
Sonic-79 wants to merge 1 commit intolangchain-ai:mainfrom
Open
fix(document_loaders): handle 1-bit images in PyPDFParser.extract_images_from_page#542Sonic-79 wants to merge 1 commit intolangchain-ai:mainfrom
Sonic-79 wants to merge 1 commit intolangchain-ai:mainfrom
Conversation
…ges_from_page Read /BitsPerComponent from PDF image XObject and unpack 1-bit monochrome images correctly using np.unpackbits(), accounting for row-level byte-boundary padding. Adds a ValueError catch as safety net for unexpected data shapes. Closes langchain-ai#307 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
ValueError: cannot reshape array of size N into shape (H,W,newaxis)whenextract_images_from_page()encounters 1-bit monochrome images in PDFs (common in scanned documents using CCITT Fax, JBIG2, or FlateDecode compression)/BitsPerComponentfrom PDF image XObject and unpack 1-bit images correctly usingnp.unpackbits(), accounting for row-level byte-boundary paddingValueErrorcatch as safety net for other unexpected data shapesRoot cause: The code assumes every pixel occupies one byte (
dtype=np.uint8), but 1-bit images pack 8 pixels per byte. For example, a 430×645 image expects 277,350 bytes but only has 34,830 — exactly 1/8th, because each bit represents one pixel.Closes #307
Related: langchain-ai/langchain#31724
Test plan
test_1bit_image_does_not_raise— regression test with real-world dimensions (430×645, CCITT)test_1bit_pixel_values— verifies correct bit unpacking and 0/255 scalingtest_1bit_row_padding— verifies correct handling of non-byte-aligned row widthstest_8bit_image_unaffected— confirms no regression for standard 8-bit imagesAI Disclosure
This fix was developed with assistance from Claude (Anthropic). The root cause analysis, fix design, and testing were performed collaboratively between a human developer and AI.
🤖 Generated with Claude Code