Distributed reliability layer for agentic AI. Raft-inspired, step-level consensus for multi-step agent pipelines — verify every step before it commits, and roll back to the last good checkpoint on failure.
Agent reliability compounds in the wrong direction. At 95% per-step reliability, a 20-step pipeline succeeds only ~36% of the time (0.95²⁰ ≈ 0.36). Errors propagate confidently and silently, which is why enterprises can't ship agentic AI for critical workflows.
AgentRaft borrows from the Raft consensus algorithm: nothing is committed to the log without agreement. Here, a cheap verifier stands in for the quorum — it judges each step's output before it's committed to a checkpoint store. Verification is far cheaper than generation (verification asymmetry), so a small 3–7B model can guard the work of a much larger agent at 10–100× lower cost than re-running the pipeline.
run step → verify → ✓ commit checkpoint
→ ✗ classify error → rollback → retry with typed hint
→ circuit breaker if it cascades
pip install agentraft # core (zero deps, rules-only verifier)
pip install "agentraft[bedrock]" # + Amazon Bedrock verifier (Converse API)
pip install "agentraft[openai]" # + OpenAI verifier
pip install "agentraft[anthropic]" # + Anthropic verifier
pip install "agentraft[google]" # + Google Gemini verifier
pip install "agentraft[redis]" # + durable Redis checkpoint store
pip install "agentraft[all]" # everythingMost enterprise agents run on Amazon Bedrock, so it's a first-class provider. AgentRaft uses the unified Bedrock Converse API, which means a single verifier code path supports every chat model on Bedrock — just change the model id:
from agentraft import wrap
from agentraft.verifier import LLMVerifier, TieredVerifier, RulesVerifier
# Amazon Bedrock — Claude, Llama, Mistral, Amazon Nova, Cohere, AI21
verifier = TieredVerifier(l1=RulesVerifier(), l2=LLMVerifier.bedrock(
model="anthropic.claude-3-5-sonnet-20241022-v2:0", # or meta.llama3-1-70b-instruct-v1:0, amazon.nova-lite-v1:0, …
region="us-east-1",
))
coordinator = wrap(pipeline, verifier=verifier)| Provider | Constructor | Models |
|---|---|---|
| Amazon Bedrock | LLMVerifier.bedrock(model=…) |
Claude · Llama · Mistral · Amazon Nova/Titan · Cohere · AI21 |
| OpenAI | LLMVerifier.openai(model=…) |
GPT-4o, GPT-4o-mini, … |
| Anthropic | LLMVerifier.anthropic(model=…) |
Claude (direct API) |
LLMVerifier.gemini(model=…) |
Gemini 1.5/2.x |
wrap() auto-detects the provider from the environment: AWS credentials → Bedrock, else OPENAI_API_KEY, ANTHROPIC_API_KEY, or GOOGLE_API_KEY. Force one with AGENTRAFT_VERIFIER_PROVIDER=bedrock and AGENTRAFT_VERIFIER_MODEL=….
import asyncio
from agentraft import wrap, Pipeline, Step, Task, Criticality
async def research(ctx): return f"Sources on: {ctx.task.goal}"
async def draft(ctx): return "Board memo draft …"
async def review(ctx): return "Reviewed and approved."
pipeline = Pipeline([
Step("research", research, goal="Gather relevant sources"),
Step("draft", draft, goal="Write an on-topic memo", criticality=Criticality.HIGH),
Step("review", review, goal="Check accuracy and tone"),
])
async def main():
result = await wrap(pipeline).run(Task(goal="Write the Q3 board memo"))
print(result.summary()) # {'success': True, 'verified': '3/3', 'rollbacks': 0, ...}
print(result.output)
asyncio.run(main())Set OPENAI_API_KEY or ANTHROPIC_API_KEY and wrap() automatically uses a tiered verifier (L1 rules → LLM). With no key, it runs rules-only.
python -m examples.document_workflow # scripted: step 3 drifts, then recovers
python -m examples.document_workflow --live # uses a real LLM verifier ▶ research_agent
✓ research_agent COMMITTED
✓ analysis_agent COMMITTED
✗ draft_agent GOAL_DRIFT
↺ draft_agent rollback → checkpoint_2
⟳ draft_agent retry with hint
✓ draft_agent COMMITTED
✓ review_agent COMMITTED
✓ publish_agent COMMITTED
🎉 run_success verified 5/5 · rollbacks 1 · reliability 1.0
AgentRaft is composed of five replaceable components:
| Component | Role | Default impl |
|---|---|---|
| Coordinator | Runs the consensus loop — sequence, verify, commit, rollback, retry | coordinator.py |
| Worker Agent | Your existing pipeline steps — unchanged | your code |
| Verifier | Judges each step against its goal; assigns a typed error class | RulesVerifier + LLMVerifier, routed by TieredVerifier |
| Checkpoint Store | Append-only log of verified outputs; rollback target | InMemoryCheckpointStore / RedisCheckpointStore |
| Circuit Breaker | Stops error cascades and runaway cost | CircuitBreaker + RetryPolicy |
Verification isn't binary. Each failure is classified, and the class maps to a typed correction hint injected into the retry:
TieredVerifier runs the cheap L1 rules gate on every step, then escalates by step criticality:
Criticality.LOW→ L1 rules onlyCriticality.MEDIUM→ L2 (small LLM verifier)Criticality.HIGH→ L3 (large LLM verifier)
Most outputs clear at L1 for free; only critical or borderline ones pay for a model call.
Pass an on_event hook to stream protocol events (STEP_COMMITTED, STEP_ROLLBACK, …) into a dashboard, logger, or the live monitor on agentraft.io:
from agentraft import wrap, EventType
def on_event(e):
if e.type == EventType.STEP_ROLLBACK:
print("rolled back", e.step_name)
coordinator = wrap(pipeline, on_event=on_event)from agentraft import wrap, RulesVerifier, TieredVerifier
from agentraft.verifier import LLMVerifier
coordinator = wrap(
pipeline,
verifier=TieredVerifier(l1=RulesVerifier(), l2=LLMVerifier(provider="anthropic")),
max_retries=3, # per-step retry budget
failure_threshold=5, # consecutive failures before the breaker opens
cooldown_seconds=30, # breaker cooldown
rollback_on_failure=True, # revert checkpoints before retrying
)How much reliability does AgentRaft actually buy? The benchmark measures it via
controlled fault injection — agents fail at a tunable per-step rate and emit
taxonomy-typed bad outputs, so ground truth is known exactly and runs go through the
real Coordinator.
python -m benchmarks --quick # fast smoke run
python -m benchmarks --trials 1000 # tighter numbers
python -m benchmarks --live --provider bedrock \
--model anthropic.claude-3-5-sonnet-20241022-v2:0 # measure a real verifierIt reports three things:
- Baseline vs AgentRaft — success and silent-corruption rate (a wrong result shipped undetected — the metric AgentRaft is built to crush).
- Length sweep — the baseline follows the
0.9ⁿreliability-compounding decay (59% → 35% → 21% → 12% at 5/10/15/20 steps) while AgentRaft stays flat. - Verifier-quality sweep — end-to-end reliability tracks verifier recall, which is the quantitative case for the fine-tuned verifier as the moat.
In --live mode it also prints a per-error-class confusion table for a real verifier
— the rules gate catches INCOMPLETE but misses the semantic classes, which is exactly
why the LLM/fine-tuned verifier matters. Full methodology and honest limitations:
benchmarks/README.md.
pip install -e ".[dev]"
pytest # run the test suite (SDK + benchmark)
ruff check . # lintEarly alpha. The Python SDK is the reference implementation of the protocol. A high-performance Go Coordinator, a fine-tuned verifier model served via vLLM, and a Kubernetes operator are on the roadmap.
Apache 2.0 © 2026 AgentRaft
