{{ message }}
refactor(parser): replace Docling CLI subprocess with Python API (closes #222)#261
Open
Abdeltoto wants to merge 1 commit intoHKUDS:mainfrom
Open
refactor(parser): replace Docling CLI subprocess with Python API (closes #222)#261Abdeltoto wants to merge 1 commit intoHKUDS:mainfrom
Abdeltoto wants to merge 1 commit intoHKUDS:mainfrom
Conversation
HKUDS#222) Replace `DoclingParser`'s `_run_docling_command` (subprocess + disk round-trip on every call) with `_run_docling_python`, which drives `docling.document_converter.DocumentConverter` directly and feeds the exported document dict to `read_from_block_recursive` without an intermediate JSON read-back. Key changes ----------- - New `_get_converter()`: lazily builds a `DocumentConverter` and caches one instance per pipeline-option tuple (table_mode, do_tables, do_ocr, artifacts_path) so layout / OCR / TableFormer model weights are loaded only once per process for a given configuration. - New `_run_docling_python()`: invokes the converter, exports the doc to a dict, and still writes the legacy `<file_stem>.json` / `<file_stem>.md` artifacts to `<output_dir>/<file_stem>/docling/` for backward compatibility with downstream tooling that expects them. - `parse_pdf`, `parse_office_doc`, and `parse_html` now consume the in-memory dict directly instead of re-reading JSON from disk. - `check_installation()` switches from `subprocess.run(["docling", "--version"])` to `import docling.document_converter`, which is faster, more accurate (it tests the actual import path the parser uses), and works on Windows without `CREATE_NO_WINDOW` flags. - The legacy `env={...}` kwarg is still accepted and type-validated for backward compatibility, but now logs a debug message and is otherwise ignored — the Python API does not require subprocess environment overrides. Backward compatibility ---------------------- - Public signatures of `parse_pdf`, `parse_office_doc`, `parse_html`, `parse_document`, and `check_installation` are unchanged. - The on-disk layout (`<output_dir>/<file_stem>/docling/<file_stem>.json` and `.md`) is preserved. - Image extraction continues to write PNGs into the `<file_stem>/docling/images/` directory via the existing `read_from_block` logic. - Picture image data is now requested from the converter via `generate_picture_images=True` so that base64 picture URIs are available in the dict, mirroring what the CLI produced. Performance ----------- Eliminating subprocess spawn, Python re-init, and per-call model load yields large speedups on multi-document workloads — the second and subsequent calls reuse the cached converter and skip the most expensive part of the Docling pipeline. No new required dependencies. `docling` remains an optional install (`pip install docling`). Made-with: Cursor
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

Summary
Closes #222.
Replaces
DoclingParser's subprocess-based path that shells out to thedoclingCLI on every parse call with a direct integration through the Docling Python API (docling.document_converter.DocumentConverter). Eliminates process-spawn overhead, removes the JSON disk round-trip, and — most importantly — enables in-memory model reuse across consecutive parse calls via a per-pipeline-option converter cache.What changed
_get_converter(**kwargs): lazily builds aDocumentConverterand caches one instance per(table_mode, do_tables, do_ocr, artifacts_path)tuple, so layout / OCR / TableFormer model weights load once per process for a given configuration._run_docling_python(...): invokes the cached converter, exports the document viaresult.document.export_to_dict(), still writes the legacy<file_stem>.jsonand<file_stem>.mdartifacts to<output_dir>/<file_stem>/docling/for backward compatibility, and returns the in-memory dict.parse_pdf/parse_office_doc/parse_htmlnow feed the in-memory dict directly toread_from_block_recursive— the JSON disk round-trip is gone.check_installation()switches fromsubprocess.run(["docling", "--version"])toimport docling.document_converter, which is faster, more accurate (it tests the actual import path the parser uses), and removes the WindowsCREATE_NO_WINDOWquirk.env={...}kwarg is still accepted and type-validated for backward compatibility, but now logs a debug message and is otherwise ignored — the Python API path does not require subprocess environment overrides.Backward compatibility
parse_pdf,parse_office_doc,parse_html,parse_document, andcheck_installationare unchanged.<output_dir>/<file_stem>/docling/<file_stem>.jsonand.md) is preserved — anything that grepped those files keeps working.<file_stem>/docling/images/via the existingread_from_blocklogic.generate_picture_images=Trueso that base64 picture URIs are present in the dict, mirroring what the CLI produced.doclingremains an optional install (pip install docling).Performance
The big wins are not visible on a single document — they show up on multi-document workloads where the cached converter is reused:
For workloads that run dozens of
.pdf/.docxingestions back-to-back this is a substantial speedup — exactly the case that was painful with the CLI-based path.Test plan
ruff formatandruff check --ignore=E402pass onraganything/parser.py..pdfand.docxto confirm the producedcontent_listis byte-identical (or at least equivalent) to the previous CLI-based output, and that the<file_stem>.json/.mdartifacts on disk look right.Happy to iterate on naming, the
_get_convertercache key shape, or whether we should keep writing the legacy on-disk JSON / Markdown at all (current PR keeps it for safety).