Decode all streams in a PDF #3635

j-t-1 · 2026-02-10T17:21:19Z

j-t-1
Feb 10, 2026

What is a good way to do the following:

Iterate over all streams (including non-referenced) and decode them using its filter to produce the original non-encoded data.
Once streams are decoded, any inline images are also decoded. (This part is more difficult.)
Then save as a new PDF.

Python code in this discussion would be ideal. Preferably using this function as it already exists:

pypdf/pypdf/filters.py

Line 766 in 219153e

def decode_stream_data(stream: StreamObject) -> bytes:

Optionally, are there any files that contain all or most of the filter types and inline images to test this with?

Answered by stefan6419846

Feb 10, 2026

As pypdf is written, this will only work for filters which are image-only and thus do not rely on external libraries like Pillow or jbig2dec.

If you do not care about using internal APIs, something like this works:

from pypdf import PdfWriter
from pypdf.generic import DecodedStreamObject, EncodedStreamObject


writer = PdfWriter(clone_from="resources/crazyones.pdf")
for index, obj in enumerate(writer._objects):
    if not isinstance(obj, EncodedStreamObject):
        continue
    new_stream = DecodedStreamObject()
    new_stream.set_data(obj.get_data())
    for key, value in dict(obj).items():
        if key not in {"/Filter"}:
            new_stream[key] = value
    writer._objects[index] =

View full answer

stefan6419846 · 2026-02-10T19:13:32Z

stefan6419846
Feb 10, 2026
Maintainer

As pypdf is written, this will only work for filters which are image-only and thus do not rely on external libraries like Pillow or jbig2dec.

If you do not care about using internal APIs, something like this works:

from pypdf import PdfWriter
from pypdf.generic import DecodedStreamObject, EncodedStreamObject


writer = PdfWriter(clone_from="resources/crazyones.pdf")
for index, obj in enumerate(writer._objects):
    if not isinstance(obj, EncodedStreamObject):
        continue
    new_stream = DecodedStreamObject()
    new_stream.set_data(obj.get_data())
    for key, value in dict(obj).items():
        if key not in {"/Filter"}:
            new_stream[key] = value
    writer._objects[index] = new_stream
writer.write("out.pdf")

I have not tested this with inline images or similar though, and relying on internal APIs is not recommended.

Preferably using this function as it already exists:

It should not be necessary to use this explicitly, EncodedStreamObject.get_data already takes care of this.

Optionally, are there any files that contain all or most of the filter types and inline images to test this with?

I am not aware of this and it is rather uncommon to have such a file - except for explicit testing purposes.

2 replies

j-t-1 Feb 10, 2026
Author

Great this is what was hoping for. I am fine with internal APIs.

I am not aware of this and it is rather uncommon to have such a file - except for explicit testing purposes.

Yes such a file would likely exist for testing purposes. Or a set of small files, one for each filter type, including inline images.

I was surprised that PDF32000_2008.pdf has a CCITTFaxDecode filter. Maybe it also has other filters as it needs to demonstrate them? #3609 (comment). Where is this file located in the repository?

Aside: I think we could rename some of the files in the resources folder to be more descriptive like sample-files.

stefan6419846 Feb 11, 2026
Maintainer

Where is this file located in the repository?

In a proper development environment, you can find the file at tests/pdf_cache/PDF32000_2008.pdf. Apart from this, the GitHub search is your friend.

Aside: I think we could rename some of the files in the resources folder to be more descriptive like sample-files.

The generic files do not need to be more descriptive IMHO.

j-t-1 · 2026-03-25T17:32:45Z

j-t-1
Mar 25, 2026
Author

@stefan6419846 thank you for the code above; it is a good method.

What would be the reverse, to revert it back?

You "lose" the original filter so just say you want to encode all the streams using zlib/deflate (so the stream will have filter FlateDecode).

1 reply

stefan6419846 Mar 26, 2026
Maintainer

The only officially supported encoding filter is Deflate anyway. You basically invert the decoding code, although due to current internals, some special lines are required:

from pypdf import PdfWriter
from pypdf.generic import DecodedStreamObject, EncodedStreamObject, NameObject


writer = PdfWriter(clone_from="decoded.pdf")
for index, obj in enumerate(writer._objects):
    if not isinstance(obj, DecodedStreamObject):
        continue
    new_stream = EncodedStreamObject()
    new_stream.decoded_self = DecodedStreamObject()
    new_stream[NameObject("/Filter")] = NameObject("/FlateDecode")
    new_stream.set_data(obj.get_data())
    for key, value in dict(obj).items():
        if key not in {"/Filter"}:
            new_stream[key] = value
    writer._objects[index] = new_stream
writer.write("encoded.pdf")

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Decode all streams in a PDF #3635

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Decode all streams in a PDF #3635

Uh oh!

j-t-1 Feb 10, 2026

Replies: 2 comments · 3 replies

Uh oh!

stefan6419846 Feb 10, 2026 Maintainer

Uh oh!

Uh oh!

j-t-1 Feb 10, 2026 Author

Uh oh!

stefan6419846 Feb 11, 2026 Maintainer

Uh oh!

j-t-1 Mar 25, 2026 Author

Uh oh!

stefan6419846 Mar 26, 2026 Maintainer

j-t-1
Feb 10, 2026

Replies: 2 comments 3 replies

stefan6419846
Feb 10, 2026
Maintainer

j-t-1 Feb 10, 2026
Author

stefan6419846 Feb 11, 2026
Maintainer

j-t-1
Mar 25, 2026
Author

stefan6419846 Mar 26, 2026
Maintainer