GitHub - aaa-mvc/traceos: TraceOS standardizes AI experiments into reproducible, searchable, and comparable assets. One command runs experiments, generates reports, and produces structured analysis: capability vectors, failure taxonomy, and recommendations. Every run is tracked, traceable, and comparable. Built on ABC-130K (amazon-far/abc). Apache 2.0. · GitHub
Skip to content

aaa-mvc/traceos

Folders and files

Repository files navigation

TraceOS

The Operating System for AI Experiments.

Turn every experiment into a standardized, reproducible asset — run, track, compare, and analyze with one workflow.

Version Tests License Status


What is TraceOS

TraceOS is a lightweight experiment runtime system for AI research. It unifies the full lifecycle of an experiment:

  • Run experiments
  • Track execution history
  • Generate structured reports
  • Analyze results (capabilities, failures, recommendations)
  • Compare runs side by side

Every experiment becomes a standardized, traceable asset.


Quick Start

git clone https://github.com/aaa-mvc/traceos.git
cd traceos
pip install -e .
experiment run configs/mock_bottles.yaml

No GPU required. You will see:

TraceOS v0.1.0
Experiment: mock-bottles-v1
Run: run_9392e4aecb07
Plugin: mock

[1/4] Preparing dataset...     Done
[2/4] Training...              Completed
[3/4] Evaluating...            85.0% success
[4/4] Generating report...     Done

Done. Run: run_9392e4aecb07

What you got:
  report.html   --  outputs/run_9392e4aecb07/report.html
  analysis.json --  outputs/run_9392e4aecb07/analysis.json
  events.jsonl  --  outputs/run_9392e4aecb07/events.jsonl (17 lifecycle events)

Why this matters:
  experiment runs list              # All your experiments, searchable
  experiment compare <id1> <id2>    # Compare any two runs side by side
  experiment analyze <run-id>       # Capability scores + failure analysis + recommendations

Analyze a Run

experiment analyze <run-id>

Example analysis.json:

{
  "capability": {
    "cap_success":   { "label": "Success Rate", "value": 0.85 },
    "cap_precision": { "label": "Precision",    "value": 0.765 },
    "cap_speed":     { "label": "Speed",        "value": 0.636 },
    "cap_robustness":{ "label": "Robustness",   "value": 0.75 }
  },
  "failure": {
    "total_failures": 1,
    "categories": { "grasp_failure": 1 }
  },
  "recommendations": [
    {
      "priority": "high",
      "description": "Success rate is 85.0%. 1/5 episodes failed. Consider increasing training data.",
      "evidence": "success_rate=0.85, failures=1"
    }
  ]
}

Compare Runs

experiment compare <run-id-1> <run-id-2>
Metric          Baseline   Current    Delta      Winner
----------------------------------------------------------
Success Rate    0.850      0.850      0.0%       tie
Precision       0.765      0.765      0.0%       tie
Speed           0.636      0.636      0.0%       tie
Robustness      0.750      0.750      0.0%       tie

Output Structure

Every experiment produces a standardized artifact directory:

outputs/<run-id>/
├── report.html          # Self-contained HTML report
├── analysis.json        # Structured analysis (RFC-0005)
├── events.jsonl         # Full execution trace (17+ events)
├── experiment.json      # Frozen config snapshot
├── artifacts.json       # Output index
├── train/
│   ├── checkpoint/last.pt
│   ├── metrics.jsonl
│   └── stdout.log
└── eval/
    └── summary.json

Architecture

CLI (10 commands)
    ↓
Runner (lifecycle + event emission)
    ↓
Adapter Layer (ABC / Mock / Dummy)
    ↓
Artifact Layer (standard outputs + registry)
    ↓
Analysis Engine (capability / failure / compare / recommend)

Why TraceOS

Without TraceOS With TraceOS
Scripts scattered across directories One command per experiment
Results hard to compare experiment compare built in
No execution trace Full event log per run
Manual analysis Automated capability + failure analysis
No run history Searchable registry

Adapters

Adapter Config GPU Notes
mock configs/mock_bottles.yaml No Instant demo, deterministic
dummy configs/dummy.yaml No Minimal SDK example (45 lines)
abc configs/bottle.yaml Yes (8x) Full ABC-130K DiT pipeline

Standards

TraceOS capability analysis follows the Capability Schema Spec — a lightweight standard for defining what a capability is, how metrics witness it, and how evidence supports scores. The schema is consumed by schema_adapter.py (transformation-only, no runtime participation).


Status

Metric Value
Version v0.1.0
Tests 97 passed
CLI commands 10
Adapters 3
RFCs 5
License Apache 2.0

Roadmap (v0.2)

  • entry_points plugin discovery
  • OpenVLA / ACT / Pi0 adapters
  • Multi-domain capability schemas (agent, ocr, llm)
  • Improved failure analysis accuracy
  • Cloud runner abstraction

Acknowledgments

TraceOS was inspired by and initially built as an experiment wrapper for the ABC-130K project.

The ABC Adapter wraps the original scripts via subprocess — zero lines of ABC code are modified. TraceOS is not affiliated with the ABC authors or Amazon.


License

Apache 2.0

About

TraceOS standardizes AI experiments into reproducible, searchable, and comparable assets. One command runs experiments, generates reports, and produces structured analysis: capability vectors, failure taxonomy, and recommendations. Every run is tracked, traceable, and comparable. Built on ABC-130K (amazon-far/abc). Apache 2.0.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

Contributors

Languages