GitHub - Anoopnk/PdfCleaner: CLI tool that extracts PDF text with PyMuPDF and cleans it via an OpenAI-compatible LLM. · GitHub
Skip to content

Anoopnk/PdfCleaner

Folders and files

Repository files navigation

PdfCleaner

A small command-line tool that extracts the text from a PDF with PyMuPDF and then cleans it up using an OpenAI-compatible chat model. PDF text extraction often introduces broken line breaks, stray hyphenation, and layout artifacts; PdfCleaner streams the raw text through an LLM that rewrites it into a clean, faithful verbatim version.

What it does

  1. Reads a PDF you provide and extracts its raw text (PyMuPDF).
  2. Saves that raw text alongside the PDF as <name>.raw.txt (unless skipped).
  3. Sends the raw text to an OpenAI-compatible endpoint and streams the cleaned result to <name>.txt.

Requirements

  • Python 3.8+
  • An OpenAI-compatible chat endpoint. This can be a hosted API or a local server such as Ollama or LM Studio that exposes an OpenAI-compatible /v1 route.

Install

pip install -r requirements.txt

Configuration

PdfCleaner reads its endpoint settings from two environment variables:

Variable Default Purpose
OPENAI_API_KEY not-needed API key. Local servers usually ignore it.
OPENAI_BASE_URL http://127.0.0.1:9740/v1 Base URL of the OpenAI-compatible endpoint.
OPENAI_MODEL qwen2.5-7b-instruct-1m Model to request from the endpoint.

Copy .env.example to .env and adjust as needed (it is loaded automatically), or export the variables in your shell:

export OPENAI_BASE_URL="http://127.0.0.1:11434/v1"   # e.g. Ollama
export OPENAI_API_KEY="not-needed"
export OPENAI_MODEL="qwen2.5-coder:7b"               # a model you have pulled

If you use a local server such as Ollama or LM Studio, start it and make sure the configured OPENAI_MODEL is available there.

Usage

Point the tool at your own PDF:

./pdf_cleaner.py path/to/your.pdf

This writes path/to/your.raw.txt (raw extracted text) and path/to/your.txt (LLM-cleaned text).

Options:

# Choose a custom output path for the cleaned text
./pdf_cleaner.py path/to/your.pdf --output cleaned.txt

# Skip writing the raw extracted text
./pdf_cleaner.py path/to/your.pdf --skip-raw

You can also invoke it through Python directly:

python pdf_cleaner.py path/to/your.pdf

License

MIT

About

CLI tool that extracts PDF text with PyMuPDF and cleans it via an OpenAI-compatible LLM.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

Contributors

Languages