fix(document_loaders): handle 1-bit images in PyPDFParser.extract_images_from_page by Sonic-79 · Pull Request #542 · langchain-ai/langchain-community

Sonic-79 · 2026-02-17T14:24:55Z

Summary

Fix ValueError: cannot reshape array of size N into shape (H,W,newaxis) when extract_images_from_page() encounters 1-bit monochrome images in PDFs (common in scanned documents using CCITT Fax, JBIG2, or FlateDecode compression)
Read /BitsPerComponent from PDF image XObject and unpack 1-bit images correctly using np.unpackbits(), accounting for row-level byte-boundary padding
Add ValueError catch as safety net for other unexpected data shapes

Root cause: The code assumes every pixel occupies one byte (dtype=np.uint8), but 1-bit images pack 8 pixels per byte. For example, a 430×645 image expects 277,350 bytes but only has 34,830 — exactly 1/8th, because each bit represents one pixel.

Closes #307
Related: langchain-ai/langchain#31724

Test plan

test_1bit_image_does_not_raise — regression test with real-world dimensions (430×645, CCITT)
test_1bit_pixel_values — verifies correct bit unpacking and 0/255 scaling
test_1bit_row_padding — verifies correct handling of non-byte-aligned row widths
test_8bit_image_unaffected — confirms no regression for standard 8-bit images

AI Disclosure

This fix was developed with assistance from Claude (Anthropic). The root cause analysis, fix design, and testing were performed collaboratively between a human developer and AI.

🤖 Generated with Claude Code

…ges_from_page Read /BitsPerComponent from PDF image XObject and unpack 1-bit monochrome images correctly using np.unpackbits(), accounting for row-level byte-boundary padding. Adds a ValueError catch as safety net for unexpected data shapes. Closes langchain-ai#307 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

github-actions bot added the fix label Feb 17, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(document_loaders): handle 1-bit images in PyPDFParser.extract_images_from_page#542

fix(document_loaders): handle 1-bit images in PyPDFParser.extract_images_from_page#542
Sonic-79 wants to merge 1 commit intolangchain-ai:mainfrom
Sonic-79:fix/pypdf-1bit-image-reshape

Sonic-79 commented Feb 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Sonic-79 commented Feb 17, 2026

Summary

Test plan

AI Disclosure

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant