Sunbelt Computer Software

LabelForge

Deterministic, replayable, ablation-friendly multimodal labeling and synthetic-data pipeline built on Ray + vLLM.

Features

Multimodal Labeling: VLM-based image captioning, attribute tagging, and text classification
Deterministic Pipelines: Prompt/model pinning, stable row IDs, and explicit seed management
Replayable Runs: Content-addressed caching and comprehensive manifests for full reproducibility
Ablation-Friendly: Matrix experiments with shared caches and run comparison tools
Rubric Scoring: Large-batch structured evaluation with guided decoding
Hard-Negative Mining: Embedding + reranking pipeline for high-quality training data
Synthetic Data Generation: Text and VLM-grounded conversation synthesis with quality filtering

Architecture

LabelForge uses Ray Data LLM (build_processor + vLLMEngineProcessorConfig) as the inference backbone, providing:

Efficient batch inference across multiple GPUs
Per-row sampling parameters for ablations
Native VLM support with PIL image inputs
Guided decoding for structured JSON outputs

Key Components

labelforge/
├── core/           # Schemas, hashing, seeds, environment capture
├── io/             # Dataset I/O, JSONL manifests, images
├── llm/            # Ray Data LLM processor factory, determinism toggles
├── pipelines/      # Stage abstraction, DAG, runner
├── cache/          # Content-addressed row/stage caching
├── mining/         # Hard-negative candidate generation and selection
├── synth/          # Synthetic data generation and deduplication
├── eval/           # Score normalization and metrics
└── cli/            # Command-line interface

Installation

# Basic installation
pip install labelforge

# With S3 cache backend
pip install labelforge[s3]

# Development
pip install -e ".[dev]"

Quick Start

# Run a labeling pipeline
labelforge run --config configs/mvp.yaml

# Replay a previous run
labelforge replay --manifest runs/<run_id>/manifest.jsonl

# Compare two runs
labelforge diff runs/<run_a> runs/<run_b>

# Inspect run artifacts
labelforge inspect runs/<run_id>

Determinism

LabelForge provides two determinism modes:

Standard Mode (default): Uses VLLM_BATCH_INVARIANT=1 for scheduling-insensitive outputs. Best throughput with reproducible results.
Strict Mode: Additionally enables Ray Data preserve_order and disables vLLM multiprocessing. Maximum reproducibility at the cost of throughput.

Requirements for Reproducibility

Same code revision and config
Pinned prompt pack version
Pinned model revision
Fixed seeds
Same hardware profile (GPU type, count)
Same Ray + vLLM versions

See docs/determinism.md for detailed caveats.

Documentation

License and Usage

This software is proprietary under a Portfolio/Research-Only License.

No commercial or professional use permitted
Research use requires citation — see CITATION.cff
No external contributions accepted

See LICENSE for full terms.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.github		.github
configs		configs
docs		docs
labelforge		labelforge
prompts/mvp		prompts/mvp
tests		tests
.gitignore		.gitignore
CITATION.cff		CITATION.cff
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
NOTICE		NOTICE
README.md		README.md
pyproject.toml		pyproject.toml

Sunbelt Computer Software

PL/B Language Development and Support

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LabelForge

Features

Architecture

Key Components

Installation

Quick Start

Determinism

Requirements for Reproducibility

Documentation

License and Usage

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Sunbelt Computer Software

PL/B Language Development and Support

Folders and files

Latest commit

History

Repository files navigation

LabelForge

Features

Architecture

Key Components

Installation

Quick Start

Determinism

Requirements for Reproducibility

Documentation

License and Usage

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages