Sunbelt Computer Software

LLM-Aided OCR

Live site: omprxkash.github.io/llm-aided-ocr

Turn scanned PDFs into clean, accurate text. This project takes the raw output from Tesseract OCR and passes it through a large language model to fix the usual OCR errors — split words, misread characters, stray page numbers — and reformat the result as either plain text or proper Markdown.

It works with OpenAI or a fully local Llama model. There's a Flask web app for uploads, a CLI for batch jobs, and a Docker setup if you want GPU acceleration.

The project website (index.html at the root) walks through the pipeline, shows a live before/after comparison from a 1975 Warren Buffett letter, and links to every artefact in docs/.

Why this exists

Tesseract is good, but it isn't perfect. Scanned documents usually come back with errors like fo instead of To, hank instead of bank, and broken page breaks scattered throughout. Cleaning that up by hand is painful for anything longer than a few pages.

An LLM is well-suited to fixing exactly this kind of noise. It understands context, recognises that fo control promises should be To control promises, and can reformat the whole document at the same time.

That's the whole idea.

What it does

Converts each page of a PDF into an image with pdf2image + Poppler.
Pre-processes the image (greyscale, Otsu threshold, dilation) with OpenCV before OCR.
Runs Tesseract on every page in parallel via a thread pool.
Splits the combined OCR output into overlapping chunks at paragraph and sentence boundaries.
Sends each chunk to the LLM with a structured correction prompt — fixes typos, joins broken words, removes page numbers, keeps headings intact.
Optionally runs a second pass to reformat the corrected text as Markdown.
Reassembles the chunks and writes both the raw OCR (*__raw_ocr_output.txt) and the cleaned result (*_llm_corrected.md) to disk.
Asks the LLM to score the final output 0–100 against the raw text and gives a short explanation.

The Flask web app also produces a summary, a French translation, a named-entity list, and audio versions of all of these via gTTS.

Repo layout

llm-aided-ocr/
├── src/                            # Python source
│   ├── llm_aided_ocr.py            # Pipeline + Flask web app
│   └── llm-aided-ocr-cli.py        # Command-line wrapper
├── notebooks/
│   └── OCR_phase1.ipynb            # Early experiments
├── docs/
│   ├── research/
│   │   ├── Research_paper_OCR_LLM.docx
│   │   └── 21BCE1950_53074.pdf
│   ├── patents/
│   │   └── PATENT_OCR_LLM_17_mar.docx
│   └── presentations/
│       ├── capstone_ppt.pptx
│       └── poster_llm_ocr.pptx
├── samples/                        # A Warren Buffett letter, raw OCR, and LLM-corrected output
├── Dockerfile
├── docker-compose.yml
├── requirements.txt
├── index.html                      # Project website
└── README.md

Requirements

You need three things installed on your machine before the Python deps will be useful.

Python 3.8 or higher.
Tesseract OCR. Install from the Tesseract releases and make sure tesseract is on your PATH. Grab the English language pack (eng.traineddata) at minimum.
Poppler.
- Linux: sudo apt-get install poppler-utils
- macOS: brew install poppler
- Windows: download the Poppler binaries, unzip, and add the bin/ folder to your PATH.

If you want to run the local Llama option on a GPU, you'll also need CUDA and llama-cpp-python built with CUDA support:

CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python

Setup

git clone https://github.com/omprxkash/llm-aided-ocr.git
cd llm-aided-ocr

python -m venv venv
source venv/bin/activate            # Windows: venv\Scripts\activate

pip install -r requirements.txt

Then create a .env file in the repo root with your provider settings.

# Pick one: API or local
USE_LOCAL_LLM=False
API_PROVIDER="OPENAI"

# OpenAI
OPENAI_API_KEY="sk-..."
OPENAI_COMPLETION_MODEL="gpt-4o-mini"

# Local Llama (only if USE_LOCAL_LLM=True)
DEFAULT_LOCAL_MODEL_NAME="Llama-3.1-8B-Lexi-Uncensored_Q5_fixedrope.gguf"
LOCAL_LLM_CONTEXT_SIZE_IN_TOKENS=2048
USE_VERBOSE=False

The local model file will be downloaded automatically on first run if it isn't already in ./models/.

Running it

Web app

python src/llm_aided_ocr.py

Open http://127.0.0.1:5000 in your browser, upload a PDF, and you'll get back the cleaned text, a summary, a French translation, the named entities, and audio playback for each.

CLI

# Whole document, default settings
python src/llm-aided-ocr-cli.py document.pdf

# Pages 2–10 only, no Markdown formatting
python src/llm-aided-ocr-cli.py document.pdf --skip-pages 1 --max-pages 9 --no-markdown

# Tune the hallucination filter
python src/llm-aided-ocr-cli.py document.pdf --threshold 0.35 --test-filtering

Available flags:

Docker

docker-compose.yml is configured for NVIDIA GPU passthrough and ships with Tesseract, Poppler, Python 3.12, and the Llama dependencies baked in.

docker compose up -d

The container exposes SSH on port 3232 and shares 32 GB of shared memory.

How the pipeline works

Configuration — reads .env.
PDF → images — pdf2image rasterises each page using Poppler.
OCR — Tesseract runs over every page in a thread pool.
Chunking — the combined text is split into ~8000-character chunks that respect paragraph and sentence boundaries, with overlap between chunks so the LLM never loses context.
LLM correction — each chunk is sent to the configured LLM with a prompt that asks it to fix OCR errors without changing the meaning. The OpenAI path runs chunks concurrently with asyncio.gather(); the local Llama path runs sequentially.
Markdown pass (optional) — a second prompt converts the corrected text into properly structured Markdown.
Reassembly + save — chunks are stitched back together and written to disk.
Quality score — a final LLM call compares the raw OCR and the cleaned output and returns a score out of 100 with an explanation.

A worked example

The samples/ folder has a real before/after pair from a 1975 Warren Buffett letter to Katharine Graham about pension fund management.

Raw OCR (a few lines):

fo control promises rationally, it is necessary to
understand the basic arithmetic and practical rules
governing pension plans.

-2-

they had $2.2 billion in the hank", whose economic
results would impact Future values for the shareholders.

This requires a very wise and informed
elient ~ and even then is not free from pitfaLLs.

After LLM correction:

## The Irreversible Nature of Pension Promises

To control promises rationally, it is necessary to understand
the basic arithmetic and practical rules governing pension plans.

they had $2.2 billion in the bank, whose economic results would
impact future values for the shareholders.

This requires a very wise and informed client — and even then
is not free from pitfalls.

fo → To, hank → bank, elient → client, pitfaLLs → pitfalls, the stray -2- page number is gone, and the heading is properly formatted.

Project artefacts

The docs/ folder contains the academic work that goes with this project:

Research paper — full write-up of the pipeline and results
Patent filing — covers the chunking strategy and quality scoring approach
Capstone presentation — defence slides
Conference poster — visual summary

The samples/ folder has the sample document, the raw Tesseract output, and the LLM-corrected version side by side.

The project website (index.html) ties all of it together with a walk-through of the pipeline, a live before/after comparison, and links to every artefact.

License

See the repository for license details.

Flag	Default	What it does
`--max-pages`	`0` (all)	How many pages to process
`--skip-pages`	`0`	How many pages to skip at the start
`--threshold`	`0.40`	Starting hallucination similarity threshold
`--check-english`	off	Validate that extracted text is English
`--no-markdown`	off	Output plain text instead of Markdown
`--db-path`	`./sentence_embeddings.sqlite`	Embeddings DB path
`--test-filtering`	off	Test hallucination filtering on existing output

Sunbelt Computer Software

PL/B Language Development and Support

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM-Aided OCR

Why this exists

What it does

Repo layout

Requirements

Setup

Running it

Web app

CLI

Docker

How the pipeline works

A worked example

Project artefacts

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
docs		docs
notebooks		notebooks
samples		samples
src		src
.gitignore		.gitignore
.nojekyll		.nojekyll
.python-version		.python-version
Dockerfile		Dockerfile
README.md		README.md
docker-compose.yml		docker-compose.yml
index.html		index.html
requirements.txt		requirements.txt

Sunbelt Computer Software

PL/B Language Development and Support

Folders and files

Latest commit

History

Repository files navigation

LLM-Aided OCR

Why this exists

What it does

Repo layout

Requirements

Setup

Running it

Web app

CLI

Docker

How the pipeline works

A worked example

Project artefacts

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages