refactor(parser): replace Docling CLI subprocess with Python API (closes #222) by Abdeltoto · Pull Request #261 · HKUDS/RAG-Anything · GitHub
Skip to content

refactor(parser): replace Docling CLI subprocess with Python API (closes #222)#261

Open
Abdeltoto wants to merge 1 commit intoHKUDS:mainfrom
Abdeltoto:refactor/docling-python-api
Open

refactor(parser): replace Docling CLI subprocess with Python API (closes #222)#261
Abdeltoto wants to merge 1 commit intoHKUDS:mainfrom
Abdeltoto:refactor/docling-python-api

Conversation

@Abdeltoto
Copy link
Copy Markdown

Summary

Closes #222.

Replaces DoclingParser's subprocess-based path that shells out to the docling CLI on every parse call with a direct integration through the Docling Python API (docling.document_converter.DocumentConverter). Eliminates process-spawn overhead, removes the JSON disk round-trip, and — most importantly — enables in-memory model reuse across consecutive parse calls via a per-pipeline-option converter cache.

What changed

  • New _get_converter(**kwargs): lazily builds a DocumentConverter and caches one instance per (table_mode, do_tables, do_ocr, artifacts_path) tuple, so layout / OCR / TableFormer model weights load once per process for a given configuration.
  • New _run_docling_python(...): invokes the cached converter, exports the document via result.document.export_to_dict(), still writes the legacy <file_stem>.json and <file_stem>.md artifacts to <output_dir>/<file_stem>/docling/ for backward compatibility, and returns the in-memory dict.
  • parse_pdf / parse_office_doc / parse_html now feed the in-memory dict directly to read_from_block_recursive — the JSON disk round-trip is gone.
  • check_installation() switches from subprocess.run(["docling", "--version"]) to import docling.document_converter, which is faster, more accurate (it tests the actual import path the parser uses), and removes the Windows CREATE_NO_WINDOW quirk.
  • The legacy env={...} kwarg is still accepted and type-validated for backward compatibility, but now logs a debug message and is otherwise ignored — the Python API path does not require subprocess environment overrides.

Backward compatibility

  • Public signatures of parse_pdf, parse_office_doc, parse_html, parse_document, and check_installation are unchanged.
  • The on-disk layout (<output_dir>/<file_stem>/docling/<file_stem>.json and .md) is preserved — anything that grepped those files keeps working.
  • Image extraction continues to write PNGs into <file_stem>/docling/images/ via the existing read_from_block logic.
  • Picture image data is now requested from the converter via generate_picture_images=True so that base64 picture URIs are present in the dict, mirroring what the CLI produced.
  • No new required dependencies. docling remains an optional install (pip install docling).

Performance

The big wins are not visible on a single document — they show up on multi-document workloads where the cached converter is reused:

  1. No subprocess fork per call.
  2. No Python interpreter re-init per call.
  3. Layout / OCR / TableFormer models loaded once instead of N times.
  4. No JSON disk round-trip between parsing and content-list construction.

For workloads that run dozens of .pdf / .docx ingestions back-to-back this is a substantial speedup — exactly the case that was painful with the CLI-based path.

Test plan

  • ruff format and ruff check --ignore=E402 pass on raganything/parser.py.
  • AST parses cleanly and no lints reported.
  • Maintainer-side: end-to-end run on a representative .pdf and .docx to confirm the produced content_list is byte-identical (or at least equivalent) to the previous CLI-based output, and that the <file_stem>.json / .md artifacts on disk look right.

Happy to iterate on naming, the _get_converter cache key shape, or whether we should keep writing the legacy on-disk JSON / Markdown at all (current PR keeps it for safety).

 HKUDS#222)

Replace `DoclingParser`'s `_run_docling_command` (subprocess + disk
round-trip on every call) with `_run_docling_python`, which drives
`docling.document_converter.DocumentConverter` directly and feeds the
exported document dict to `read_from_block_recursive` without an
intermediate JSON read-back.

Key changes
-----------
- New `_get_converter()`: lazily builds a `DocumentConverter` and caches
  one instance per pipeline-option tuple (table_mode, do_tables, do_ocr,
  artifacts_path) so layout / OCR / TableFormer model weights are loaded
  only once per process for a given configuration.
- New `_run_docling_python()`: invokes the converter, exports the doc
  to a dict, and still writes the legacy `<file_stem>.json` /
  `<file_stem>.md` artifacts to `<output_dir>/<file_stem>/docling/`
  for backward compatibility with downstream tooling that expects them.
- `parse_pdf`, `parse_office_doc`, and `parse_html` now consume the
  in-memory dict directly instead of re-reading JSON from disk.
- `check_installation()` switches from `subprocess.run(["docling",
  "--version"])` to `import docling.document_converter`, which is
  faster, more accurate (it tests the actual import path the parser
  uses), and works on Windows without `CREATE_NO_WINDOW` flags.
- The legacy `env={...}` kwarg is still accepted and type-validated for
  backward compatibility, but now logs a debug message and is otherwise
  ignored — the Python API does not require subprocess environment
  overrides.

Backward compatibility
----------------------
- Public signatures of `parse_pdf`, `parse_office_doc`, `parse_html`,
  `parse_document`, and `check_installation` are unchanged.
- The on-disk layout (`<output_dir>/<file_stem>/docling/<file_stem>.json`
  and `.md`) is preserved.
- Image extraction continues to write PNGs into the
  `<file_stem>/docling/images/` directory via the existing
  `read_from_block` logic.
- Picture image data is now requested from the converter via
  `generate_picture_images=True` so that base64 picture URIs are
  available in the dict, mirroring what the CLI produced.

Performance
-----------
Eliminating subprocess spawn, Python re-init, and per-call model load
yields large speedups on multi-document workloads — the second and
subsequent calls reuse the cached converter and skip the most expensive
part of the Docling pipeline.

No new required dependencies. `docling` remains an optional install
(`pip install docling`).

Made-with: Cursor
@Abdeltoto Abdeltoto marked this pull request as ready for review April 22, 2026 03:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature Request]: Replace Docling Parser's CLI subprocess with Python API

1 participant