iframe-proxy

Abdeltoto · 2026-04-22T03:31:09Z

Summary

Closes #222.

Replaces DoclingParser's subprocess-based path that shells out to the docling CLI on every parse call with a direct integration through the Docling Python API (docling.document_converter.DocumentConverter). Eliminates process-spawn overhead, removes the JSON disk round-trip, and — most importantly — enables in-memory model reuse across consecutive parse calls via a per-pipeline-option converter cache.

What changed

New _get_converter(**kwargs): lazily builds a DocumentConverter and caches one instance per (table_mode, do_tables, do_ocr, artifacts_path) tuple, so layout / OCR / TableFormer model weights load once per process for a given configuration.
New _run_docling_python(...): invokes the cached converter, exports the document via result.document.export_to_dict(), still writes the legacy <file_stem>.json and <file_stem>.md artifacts to <output_dir>/<file_stem>/docling/ for backward compatibility, and returns the in-memory dict.
parse_pdf / parse_office_doc / parse_html now feed the in-memory dict directly to read_from_block_recursive — the JSON disk round-trip is gone.
check_installation() switches from subprocess.run(["docling", "--version"]) to import docling.document_converter, which is faster, more accurate (it tests the actual import path the parser uses), and removes the Windows CREATE_NO_WINDOW quirk.
The legacy env={...} kwarg is still accepted and type-validated for backward compatibility, but now logs a debug message and is otherwise ignored — the Python API path does not require subprocess environment overrides.

Backward compatibility

Public signatures of parse_pdf, parse_office_doc, parse_html, parse_document, and check_installation are unchanged.
The on-disk layout (<output_dir>/<file_stem>/docling/<file_stem>.json and .md) is preserved — anything that grepped those files keeps working.
Image extraction continues to write PNGs into <file_stem>/docling/images/ via the existing read_from_block logic.
Picture image data is now requested from the converter via generate_picture_images=True so that base64 picture URIs are present in the dict, mirroring what the CLI produced.
No new required dependencies. docling remains an optional install (pip install docling).

Performance

The big wins are not visible on a single document — they show up on multi-document workloads where the cached converter is reused:

No subprocess fork per call.
No Python interpreter re-init per call.
Layout / OCR / TableFormer models loaded once instead of N times.
No JSON disk round-trip between parsing and content-list construction.

For workloads that run dozens of .pdf / .docx ingestions back-to-back this is a substantial speedup — exactly the case that was painful with the CLI-based path.

Test plan

ruff format and ruff check --ignore=E402 pass on raganything/parser.py.
AST parses cleanly and no lints reported.
Maintainer-side: end-to-end run on a representative .pdf and .docx to confirm the produced content_list is byte-identical (or at least equivalent) to the previous CLI-based output, and that the <file_stem>.json / .md artifacts on disk look right.

Happy to iterate on naming, the _get_converter cache key shape, or whether we should keep writing the legacy on-disk JSON / Markdown at all (current PR keeps it for safety).

HKUDS#222) Replace `DoclingParser`'s `_run_docling_command` (subprocess + disk round-trip on every call) with `_run_docling_python`, which drives `docling.document_converter.DocumentConverter` directly and feeds the exported document dict to `read_from_block_recursive` without an intermediate JSON read-back. Key changes ----------- - New `_get_converter()`: lazily builds a `DocumentConverter` and caches one instance per pipeline-option tuple (table_mode, do_tables, do_ocr, artifacts_path) so layout / OCR / TableFormer model weights are loaded only once per process for a given configuration. - New `_run_docling_python()`: invokes the converter, exports the doc to a dict, and still writes the legacy `<file_stem>.json` / `<file_stem>.md` artifacts to `<output_dir>/<file_stem>/docling/` for backward compatibility with downstream tooling that expects them. - `parse_pdf`, `parse_office_doc`, and `parse_html` now consume the in-memory dict directly instead of re-reading JSON from disk. - `check_installation()` switches from `subprocess.run(["docling", "--version"])` to `import docling.document_converter`, which is faster, more accurate (it tests the actual import path the parser uses), and works on Windows without `CREATE_NO_WINDOW` flags. - The legacy `env={...}` kwarg is still accepted and type-validated for backward compatibility, but now logs a debug message and is otherwise ignored — the Python API does not require subprocess environment overrides. Backward compatibility ---------------------- - Public signatures of `parse_pdf`, `parse_office_doc`, `parse_html`, `parse_document`, and `check_installation` are unchanged. - The on-disk layout (`<output_dir>/<file_stem>/docling/<file_stem>.json` and `.md`) is preserved. - Image extraction continues to write PNGs into the `<file_stem>/docling/images/` directory via the existing `read_from_block` logic. - Picture image data is now requested from the converter via `generate_picture_images=True` so that base64 picture URIs are available in the dict, mirroring what the CLI produced. Performance ----------- Eliminating subprocess spawn, Python re-init, and per-call model load yields large speedups on multi-document workloads — the second and subsequent calls reuse the cached converter and skip the most expensive part of the Docling pipeline. No new required dependencies. `docling` remains an optional install (`pip install docling`). Made-with: Cursor

Abdeltoto marked this pull request as ready for review April 22, 2026 03:45

Sunbelt Computer Software

PL/B Language Development and Support

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor(parser): replace Docling CLI subprocess with Python API (closes #222)#261

refactor(parser): replace Docling CLI subprocess with Python API (closes #222)#261
Abdeltoto wants to merge 1 commit intoHKUDS:mainfrom
Abdeltoto:refactor/docling-python-api

Abdeltoto commented Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Sunbelt Computer Software

PL/B Language Development and Support

Conversation

Abdeltoto commented Apr 22, 2026

Summary

What changed

Backward compatibility

Performance

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant