A small command-line tool that extracts the text from a PDF with PyMuPDF and then cleans it up using an OpenAI-compatible chat model. PDF text extraction often introduces broken line breaks, stray hyphenation, and layout artifacts; PdfCleaner streams the raw text through an LLM that rewrites it into a clean, faithful verbatim version.
- Reads a PDF you provide and extracts its raw text (PyMuPDF).
- Saves that raw text alongside the PDF as
<name>.raw.txt(unless skipped). - Sends the raw text to an OpenAI-compatible endpoint and streams the cleaned
result to
<name>.txt.
- Python 3.8+
- An OpenAI-compatible chat endpoint. This can be a hosted API or a local
server such as Ollama or
LM Studio that exposes an OpenAI-compatible
/v1route.
pip install -r requirements.txtPdfCleaner reads its endpoint settings from two environment variables:
Copy .env.example to .env and adjust as needed (it is loaded automatically),
or export the variables in your shell:
export OPENAI_BASE_URL="http://127.0.0.1:11434/v1" # e.g. Ollama
export OPENAI_API_KEY="not-needed"
export OPENAI_MODEL="qwen2.5-coder:7b" # a model you have pulledIf you use a local server such as Ollama or LM Studio, start it and make sure
the configured OPENAI_MODEL is available there.
Point the tool at your own PDF:
./pdf_cleaner.py path/to/your.pdfThis writes path/to/your.raw.txt (raw extracted text) and
path/to/your.txt (LLM-cleaned text).
Options:
# Choose a custom output path for the cleaned text
./pdf_cleaner.py path/to/your.pdf --output cleaned.txt
# Skip writing the raw extracted text
./pdf_cleaner.py path/to/your.pdf --skip-rawYou can also invoke it through Python directly:
python pdf_cleaner.py path/to/your.pdf