GitHub - loganpowell/MinerOS: Transforms complex documents like PDFs and Office docs into LLM-ready markdown/JSON for your Agentic workflows. · GitHub
Skip to content

loganpowell/MinerOS

 
 

Repository files navigation

MinerOS

open issues License Python Version

MinerOS is an Apache 2.0-licensed fork of MinerU — a high-accuracy document parsing engine that converts PDF, Word, PPT, and images into structured Markdown/JSON for LLM · RAG · Agent workflows.

Why "OS"?

MinerU's upstream license history is complicated: the project briefly adopted AGPLv3 before reverting. MinerOS is pinned to the last clean Apache 2.0 commit (e148afa9) and kept Apache 2.0 going forward — making it safe to embed in commercial and government applications without AGPL copyleft concerns.

The OS suffix signals:

  • Open Source — fully Apache 2.0, no AGPL, no CC-BY-NC
  • Open Standard — suitable for government procurement, regulated industries, and open-data pipelines
  • OS-level reliability — designed to run as infrastructure, not just a script

What MinerOS adds over upstream MinerU

Feature MinerU upstream MinerOS
License Briefly AGPLv3, reverted Apache 2.0 throughout
Tracked-changes / strikethrough detection ✅ (~~struck text~~ in Markdown)
Remote VLM via .env auto-config manual .env loaded automatically
Package name mineru mineros

Core Parsing Capabilities

  • PDF · DOCX · PPTX · Images → Markdown + JSON
  • Tracked-changes detection — renders struck-through text as ~~...~~ in Markdown output (critical for government contracts, legislative drafts, redlined legal documents)
  • Formulas → LaTeX · Tables → HTML · accurate layout reconstruction
  • Scanned docs, handwriting, multi-column layouts, cross-page table merging
  • Output follows human reading order with automatic header/footer removal
  • VLM + OCR dual engine, 109-language OCR recognition

Deployment Backends

Backend Best For
pipeline Fast & stable, no hallucination, runs on CPU or GPU
vlm-http-client High accuracy via remote OpenAI-compatible VLM server (e.g., Azure llama.cpp)
hybrid-http-client High accuracy + local OCR, minimal local VRAM
vlm-auto-engine High accuracy via local vLLM / LMDeploy / mlx
hybrid-auto-engine Best accuracy, native text extraction, low hallucination

Quick Start

Install

pip install uv
uv pip install -e ".[core]"

Or from PyPI (once published):

uv pip install "mineros[core]"

Run

# Basic parsing (auto-selects best available backend)
mineros -p <input.pdf> -o <output_dir>

# CPU-only (pipeline backend)
mineros -p <input.pdf> -o <output_dir> -b pipeline

# Remote VLM server (reads MINERU_VL_SERVER / MINERU_VL_API_KEY / MINERU_VL_MODEL_NAME from .env)
mineros -p <input.pdf> -o <output_dir> -b vlm-http-client

# Specific page range
mineros -p <input.pdf> -o <output_dir> -b vlm-http-client -s 0 -e 3

Environment Variables (.env)

# Remote VLM server (OpenAI-compatible)
MINERU_VL_SERVER=https://your-llm-server.example.com
MINERU_VL_API_KEY=your-api-key
MINERU_VL_MODEL_NAME=your-model-name

# Required when server n_ctx is small (e.g., llama.cpp with 8192 context)
MINEROS_PROCESSING_WINDOW_SIZE=1

The .env file is loaded automatically via python-dotenv — no manual export needed.

Hardware Requirements

Backend pipeline *-auto-engine *-http-client
hybrid vlm hybrid vlm
Pure CPU
Min VRAM 4 GB 8 GB 8 GB 2 GB None
Min RAM 16 GB (32 GB recommended) 16 GB
Python 3.10 – 3.13
OS Linux (2019+) · Windows · macOS 14+

Docker

# Build
docker build -f docker/global/Dockerfile -t mineros:latest .

# Run via Compose
docker compose -f docker/compose.yaml up

Known Issues

  • Reading order may be out of sequence in extremely complex multi-column layouts.
  • Strikethrough detection relies on the VLM visually identifying struck text — accuracy depends on model capability and image resolution.
  • Tables of contents and lists are recognized via rules; uncommon formats may be missed.
  • Comic books, art albums, and heavily stylized documents parse poorly.
  • OCR may produce inaccurate characters for lesser-known languages.

License

Apache 2.0

MinerOS is a derivative of MinerU (opendatalab), used and redistributed under the terms of the Apache 2.0 license as it existed at commit e148afa9. All modifications are also released under Apache 2.0.

Acknowledgments

MinerOS stands on the shoulders of MinerU and its dependencies:

About

Transforms complex documents like PDFs and Office docs into LLM-ready markdown/JSON for your Agentic workflows.

Resources

License

Security policy

Stars

Watchers

Forks

Packages

Contributors

Languages

  • Python 99.3%
  • Dockerfile 0.7%