Bug
When I call initialize_pipeline (InputFormat.PDF) first, the pipeline is cached, then when I call the convert function, it should get the pipeline from the cache directly, but now it does reinitialize. I found the reason is:
When initialize_pipeline() is called, the cache hash is computed with code_formula_options.extract_code=True, extract_formulas=True (defaults)
however, during pipeline initialization, _init_models() mutates the options:
code_formula_opts.extract_code = self.pipeline_options.do_code_enrichment
code_formula_opts.extract_formulas = self.pipeline_options.do_formula_enrichment
So it makes initialize pipeline twice.
So initialize_pipeline should have extract_code and extract_formulas default value as False.
Steps to reproduce
- Define docling converter
- call converter.initialize_pipeline(InputFormat.PDF) first
- then call converter.convert()
Docling version
docling==2.73.1
...
Python version
3.11