Sunbelt Computer Software

PdfCleaner

A small command-line tool that extracts the text from a PDF with PyMuPDF and then cleans it up using an OpenAI-compatible chat model. PDF text extraction often introduces broken line breaks, stray hyphenation, and layout artifacts; PdfCleaner streams the raw text through an LLM that rewrites it into a clean, faithful verbatim version.

What it does

Reads a PDF you provide and extracts its raw text (PyMuPDF).
Saves that raw text alongside the PDF as <name>.raw.txt (unless skipped).
Sends the raw text to an OpenAI-compatible endpoint and streams the cleaned result to <name>.txt.

Requirements

Python 3.8+
An OpenAI-compatible chat endpoint. This can be a hosted API or a local server such as Ollama or LM Studio that exposes an OpenAI-compatible /v1 route.

Install

pip install -r requirements.txt

Configuration

PdfCleaner reads its endpoint settings from two environment variables:

Copy .env.example to .env and adjust as needed (it is loaded automatically), or export the variables in your shell:

export OPENAI_BASE_URL="http://127.0.0.1:11434/v1"   # e.g. Ollama
export OPENAI_API_KEY="not-needed"
export OPENAI_MODEL="qwen2.5-coder:7b"               # a model you have pulled

If you use a local server such as Ollama or LM Studio, start it and make sure the configured OPENAI_MODEL is available there.

Usage

Point the tool at your own PDF:

./pdf_cleaner.py path/to/your.pdf

This writes path/to/your.raw.txt (raw extracted text) and path/to/your.txt (LLM-cleaned text).

Options:

# Choose a custom output path for the cleaned text
./pdf_cleaner.py path/to/your.pdf --output cleaned.txt

# Skip writing the raw extracted text
./pdf_cleaner.py path/to/your.pdf --skip-raw

You can also invoke it through Python directly:

python pdf_cleaner.py path/to/your.pdf

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pdf_cleaner.py		pdf_cleaner.py
requirements.txt		requirements.txt

Variable	Default	Purpose
`OPENAI_API_KEY`	`not-needed`	API key. Local servers usually ignore it.
`OPENAI_BASE_URL`	`http://127.0.0.1:9740/v1`	Base URL of the OpenAI-compatible endpoint.
`OPENAI_MODEL`	`qwen2.5-7b-instruct-1m`	Model to request from the endpoint.

Sunbelt Computer Software

PL/B Language Development and Support

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PdfCleaner

What it does

Requirements

Install

Configuration

Usage

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Sunbelt Computer Software

PL/B Language Development and Support

Folders and files

Latest commit

History

Repository files navigation

PdfCleaner

What it does

Requirements

Install

Configuration

Usage

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages