Sunbelt Computer Software

TraceOS

The Operating System for AI Experiments.

Turn every experiment into a standardized, reproducible asset — run, track, compare, and analyze with one workflow.

What is TraceOS

TraceOS is a lightweight experiment runtime system for AI research. It unifies the full lifecycle of an experiment:

Run experiments
Track execution history
Generate structured reports
Analyze results (capabilities, failures, recommendations)
Compare runs side by side

Every experiment becomes a standardized, traceable asset.

Quick Start

git clone https://github.com/aaa-mvc/traceos.git
cd traceos
pip install -e .
experiment run configs/mock_bottles.yaml

No GPU required. You will see:

TraceOS v0.1.0
Experiment: mock-bottles-v1
Run: run_9392e4aecb07
Plugin: mock

[1/4] Preparing dataset...     Done
[2/4] Training...              Completed
[3/4] Evaluating...            85.0% success
[4/4] Generating report...     Done

Done. Run: run_9392e4aecb07

What you got:
  report.html   --  outputs/run_9392e4aecb07/report.html
  analysis.json --  outputs/run_9392e4aecb07/analysis.json
  events.jsonl  --  outputs/run_9392e4aecb07/events.jsonl (17 lifecycle events)

Why this matters:
  experiment runs list              # All your experiments, searchable
  experiment compare <id1> <id2>    # Compare any two runs side by side
  experiment analyze <run-id>       # Capability scores + failure analysis + recommendations

Analyze a Run

experiment analyze <run-id>

Example analysis.json:

{
  "capability": {
    "cap_success":   { "label": "Success Rate", "value": 0.85 },
    "cap_precision": { "label": "Precision",    "value": 0.765 },
    "cap_speed":     { "label": "Speed",        "value": 0.636 },
    "cap_robustness":{ "label": "Robustness",   "value": 0.75 }
  },
  "failure": {
    "total_failures": 1,
    "categories": { "grasp_failure": 1 }
  },
  "recommendations": [
    {
      "priority": "high",
      "description": "Success rate is 85.0%. 1/5 episodes failed. Consider increasing training data.",
      "evidence": "success_rate=0.85, failures=1"
    }
  ]
}

Compare Runs

experiment compare <run-id-1> <run-id-2>

Metric          Baseline   Current    Delta      Winner
----------------------------------------------------------
Success Rate    0.850      0.850      0.0%       tie
Precision       0.765      0.765      0.0%       tie
Speed           0.636      0.636      0.0%       tie
Robustness      0.750      0.750      0.0%       tie

Output Structure

Every experiment produces a standardized artifact directory:

outputs/<run-id>/
├── report.html          # Self-contained HTML report
├── analysis.json        # Structured analysis (RFC-0005)
├── events.jsonl         # Full execution trace (17+ events)
├── experiment.json      # Frozen config snapshot
├── artifacts.json       # Output index
├── train/
│   ├── checkpoint/last.pt
│   ├── metrics.jsonl
│   └── stdout.log
└── eval/
    └── summary.json

Architecture

CLI (10 commands)
    ↓
Runner (lifecycle + event emission)
    ↓
Adapter Layer (ABC / Mock / Dummy)
    ↓
Artifact Layer (standard outputs + registry)
    ↓
Analysis Engine (capability / failure / compare / recommend)

Why TraceOS

Without TraceOS	With TraceOS
Scripts scattered across directories	One command per experiment
Results hard to compare	`experiment compare` built in
No execution trace	Full event log per run
Manual analysis	Automated capability + failure analysis
No run history	Searchable registry

Adapters

Adapter	Config	GPU	Notes
mock	`configs/mock_bottles.yaml`	No	Instant demo, deterministic
dummy	`configs/dummy.yaml`	No	Minimal SDK example (45 lines)
abc	`configs/bottle.yaml`	Yes (8x)	Full ABC-130K DiT pipeline

Standards

TraceOS capability analysis follows the Capability Schema Spec — a lightweight standard for defining what a capability is, how metrics witness it, and how evidence supports scores. The schema is consumed by schema_adapter.py (transformation-only, no runtime participation).

Status

Roadmap (v0.2)

entry_points plugin discovery
OpenVLA / ACT / Pi0 adapters
Multi-domain capability schemas (agent, ocr, llm)
Improved failure analysis accuracy
Cloud runner abstraction

Acknowledgments

TraceOS was inspired by and initially built as an experiment wrapper for the ABC-130K project.

Paper: Scalable Behavior Cloning with Open Data, Training, and Evaluation — Allshire et al., arXiv 2606.27375 (2026)
Repository: amazon-far/abc
Dataset: XDOF/ABC-130k

The ABC Adapter wraps the original scripts via subprocess — zero lines of ABC code are modified. TraceOS is not affiliated with the ABC authors or Amazon.

License

Apache 2.0

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
configs		configs
docs		docs
examples		examples
tests		tests
traceos		traceos
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
PROJECT-REPORT.md		PROJECT-REPORT.md
README.md		README.md
pyproject.toml		pyproject.toml

Sunbelt Computer Software

PL/B Language Development and Support

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TraceOS

What is TraceOS

Quick Start

Analyze a Run

Compare Runs

Output Structure

Architecture

Why TraceOS

Adapters

Standards

Status

Roadmap (v0.2)

Acknowledgments

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Metric	Value
Version	v0.1.0
Tests	97 passed
CLI commands	10
Adapters	3
RFCs	5
License	Apache 2.0

Sunbelt Computer Software

PL/B Language Development and Support

Folders and files

Latest commit

History

Repository files navigation

TraceOS

What is TraceOS

Quick Start

Analyze a Run

Compare Runs

Output Structure

Architecture

Why TraceOS

Adapters

Standards

Status

Roadmap (v0.2)

Acknowledgments

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages