Skip to content

initialize_pipeline should have extract_code and extract_formulas default value as False #3109

@yqliving

Description

@yqliving

Bug

When I call initialize_pipeline (InputFormat.PDF) first, the pipeline is cached, then when I call the convert function, it should get the pipeline from the cache directly, but now it does reinitialize. I found the reason is:
When initialize_pipeline() is called, the cache hash is computed with code_formula_options.extract_code=True, extract_formulas=True (defaults)
however, during pipeline initialization, _init_models() mutates the options:

code_formula_opts.extract_code = self.pipeline_options.do_code_enrichment
code_formula_opts.extract_formulas = self.pipeline_options.do_formula_enrichment

So it makes initialize pipeline twice.
So initialize_pipeline should have extract_code and extract_formulas default value as False.

Steps to reproduce

  1. Define docling converter
  2. call converter.initialize_pipeline(InputFormat.PDF) first
  3. then call converter.convert()

Docling version

docling==2.73.1
...

Python version

3.11

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions