Sunbelt Computer Software

MinerOS

MinerOS is an Apache 2.0-licensed fork of MinerU — a high-accuracy document parsing engine that converts PDF, Word, PPT, and images into structured Markdown/JSON for LLM · RAG · Agent workflows.

Why "OS"?

MinerU's upstream license history is complicated: the project briefly adopted AGPLv3 before reverting. MinerOS is pinned to the last clean Apache 2.0 commit (e148afa9) and kept Apache 2.0 going forward — making it safe to embed in commercial and government applications without AGPL copyleft concerns.

The OS suffix signals:

Open Source — fully Apache 2.0, no AGPL, no CC-BY-NC
Open Standard — suitable for government procurement, regulated industries, and open-data pipelines
OS-level reliability — designed to run as infrastructure, not just a script

What MinerOS adds over upstream MinerU

Feature	MinerU upstream	MinerOS
License	Briefly AGPLv3, reverted	Apache 2.0 throughout
Tracked-changes / strikethrough detection	❌	✅ (`~~struck text~~` in Markdown)
Remote VLM via `.env` auto-config	manual	`.env` loaded automatically
Package name	`mineru`	`mineros`

Core Parsing Capabilities

PDF · DOCX · PPTX · Images → Markdown + JSON
Tracked-changes detection — renders struck-through text as ~~...~~ in Markdown output (critical for government contracts, legislative drafts, redlined legal documents)
Formulas → LaTeX · Tables → HTML · accurate layout reconstruction
Scanned docs, handwriting, multi-column layouts, cross-page table merging
Output follows human reading order with automatic header/footer removal
VLM + OCR dual engine, 109-language OCR recognition

Deployment Backends

Backend	Best For
`pipeline`	Fast & stable, no hallucination, runs on CPU or GPU
`vlm-http-client`	High accuracy via remote OpenAI-compatible VLM server (e.g., Azure llama.cpp)
`hybrid-http-client`	High accuracy + local OCR, minimal local VRAM
`vlm-auto-engine`	High accuracy via local vLLM / LMDeploy / mlx
`hybrid-auto-engine`	Best accuracy, native text extraction, low hallucination

Quick Start

Install

pip install uv
uv pip install -e ".[core]"

Or from PyPI (once published):

uv pip install "mineros[core]"

Run

# Basic parsing (auto-selects best available backend)
mineros -p <input.pdf> -o <output_dir>

# CPU-only (pipeline backend)
mineros -p <input.pdf> -o <output_dir> -b pipeline

# Remote VLM server (reads MINERU_VL_SERVER / MINERU_VL_API_KEY / MINERU_VL_MODEL_NAME from .env)
mineros -p <input.pdf> -o <output_dir> -b vlm-http-client

# Specific page range
mineros -p <input.pdf> -o <output_dir> -b vlm-http-client -s 0 -e 3

Environment Variables (`.env`)

# Remote VLM server (OpenAI-compatible)
MINERU_VL_SERVER=https://your-llm-server.example.com
MINERU_VL_API_KEY=your-api-key
MINERU_VL_MODEL_NAME=your-model-name

# Required when server n_ctx is small (e.g., llama.cpp with 8192 context)
MINEROS_PROCESSING_WINDOW_SIZE=1

The .env file is loaded automatically via python-dotenv — no manual export needed.

Hardware Requirements

Docker

# Build
docker build -f docker/global/Dockerfile -t mineros:latest .

# Run via Compose
docker compose -f docker/compose.yaml up

Known Issues

Reading order may be out of sequence in extremely complex multi-column layouts.
Strikethrough detection relies on the VLM visually identifying struck text — accuracy depends on model capability and image resolution.
Tables of contents and lists are recognized via rules; uncommon formats may be missed.
Comic books, art albums, and heavily stylized documents parse poorly.
OCR may produce inaccurate characters for lesser-known languages.

License

Apache 2.0

MinerOS is a derivative of MinerU (opendatalab), used and redistributed under the terms of the Apache 2.0 license as it existed at commit e148afa9. All modifications are also released under Apache 2.0.

Acknowledgments

MinerOS stands on the shoulders of MinerU and its dependencies:

Name		Name	Last commit message	Last commit date
Latest commit History 5,164 Commits
.github		.github
demo		demo
docker		docker
docs		docs
mineros		mineros
scripts		scripts
tests		tests
.env.example		.env.example
.gitattributes		.gitattributes
.gitignore		.gitignore
.pylintrc		.pylintrc
LICENSE.md		LICENSE.md
MinerOS_CLA.md		MinerOS_CLA.md
README.md		README.md
README_zh-CN.md		README_zh-CN.md
SECURITY.md		SECURITY.md
TODO.md		TODO.md
mineros.template.json		mineros.template.json
mkdocs.yml		mkdocs.yml
pyproject.toml		pyproject.toml
test_strikethrough.py		test_strikethrough.py
update_version.py		update_version.py
uv.lock		uv.lock

Backend	pipeline	*-auto-engine		*-http-client
Backend	pipeline	hybrid	vlm	hybrid	vlm
Pure CPU	✅	❌		✅
Min VRAM	4 GB	8 GB	8 GB	2 GB	None
Min RAM	16 GB (32 GB recommended)			16 GB
Python	3.10 – 3.13
OS	Linux (2019+) · Windows · macOS 14+

Sunbelt Computer Software

PL/B Language Development and Support

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MinerOS

Why "OS"?

What MinerOS adds over upstream MinerU

Core Parsing Capabilities

Deployment Backends

Quick Start

Install

Run

Environment Variables (`.env`)

Hardware Requirements

Docker

Known Issues

License

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Sunbelt Computer Software

PL/B Language Development and Support

Folders and files

Latest commit

History

Repository files navigation

MinerOS

Why "OS"?

What MinerOS adds over upstream MinerU

Core Parsing Capabilities

Deployment Backends

Quick Start

Install

Run

Environment Variables (.env)

Hardware Requirements

Docker

Known Issues

License

Acknowledgments

About

Resources

License

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Environment Variables (`.env`)

Packages