OpenRouter provider, cross-model compare + handoff, and a redesigned UI#1
Conversation
OpenRouter joins anthropic/openai/mock (one key, many models) over the OpenAI-compatible REST API, so the zero-dependency story holds. The engine now resolves a model per call instead of binding one provider for the whole run. On top of the existing 44 frameworks that unlocks compare (run the same framework or chain on several models in parallel columns, via --models and the web Models box) and handoff (assign a model per chain stage with id:slug, or per orchestration role with --roles, so models hand work to each other within one run). CLI, server, tests, CI smokes, .env.example and the README cover all three.
Editorial light theme: warm paper, system serif display type (no web-font dependency), the catalog as a numbered index with section rules, a single red accent, and results as serif cards that show which model ran each step. The compare and handoff controls are wired into the composer. The catalog fetch now fails loudly with setup instructions instead of a blank sidebar when nothing answers /api/catalog, and the header shows a live framework count. Adds .claude/launch.json so the app can be previewed.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b8fb2b8300
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
Keep per-run model overrides out of comparison columns
When opts.model or opts.roleModels is supplied together with models (possible through /api/run, and through CLI --models ... --roles ...), forwarding the unchanged options lets OpenRouter's complete() override the provider's column-bound model. Consequently, columns can execute on the same override or on role-specific models while still being labeled as their requested comparison model, corrupting comparison results. Comparison mode should remove these overrides or explicitly force every call to the current column model.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Code Review
This pull request introduces multi-model comparison and handoff capabilities to the framework-lab test bench, adding OpenRouter support, a matrix runner for executing pipelines across multiple models, and a redesigned web UI to configure these options. The review feedback highlights a potential memory leak and state-sharing issue in server.js due to the global engine singleton caching providers, a bug in src/providers/index.js where missing API keys for explicitly requested providers silently fall back to the mock provider instead of throwing errors, and an opportunity to run model comparisons in parallel using Promise.all in src/matrix.js to improve performance.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| // Handoff / single: opts.model + opts.roleModels + per-stage models flow through the engine. | ||
| if (mode === "single") return json(res, 200, await engine.runSingle(ids[0], inputs, opts)); | ||
| if (mode === "all") return json(res, 200, { results: await engine.runAll(ids, inputs, opts) }); | ||
| if (mode === "chain") return json(res, 200, await engine.runChain(ids, inputs, opts)); | ||
| if (mode === "chain") return json(res, 200, await engine.runChain(stages || ids, inputs, opts)); |
There was a problem hiding this comment.
The global engine singleton defined on line 20 caches resolved providers in a Map that is never cleared. If clients send requests with unique or arbitrary model names, this cache will grow indefinitely, causing a memory leak in the long-running server process. Additionally, sharing provider instances across concurrent requests means that stateful providers (like the mock provider with its call counter n) will share state, leading to non-deterministic behavior.\n\nTo fix this, instantiate the engine dynamically per-request inside the POST handler, and remove the global engine declaration on line 20.
| // Handoff / single: opts.model + opts.roleModels + per-stage models flow through the engine. | |
| if (mode === "single") return json(res, 200, await engine.runSingle(ids[0], inputs, opts)); | |
| if (mode === "all") return json(res, 200, { results: await engine.runAll(ids, inputs, opts) }); | |
| if (mode === "chain") return json(res, 200, await engine.runChain(ids, inputs, opts)); | |
| if (mode === "chain") return json(res, 200, await engine.runChain(stages || ids, inputs, opts)); | |
| // Handoff / single: opts.model + opts.roleModels + per-stage models flow through the engine.\n const requestEngine = createEngine(model => getProvider(process.env, { model }));\n if (mode === "single") return json(res, 200, await requestEngine.runSingle(ids[0], inputs, opts));\n if (mode === "all") return json(res, 200, { results: await requestEngine.runAll(ids, inputs, opts) });\n if (mode === "chain") return json(res, 200, await requestEngine.runChain(stages || ids, inputs, opts)); |
| if (want === "openrouter" || (!want && env.OPENROUTER_API_KEY)) { | ||
| if (env.OPENROUTER_API_KEY) return openrouter(env, model); | ||
| } | ||
| if (want === "anthropic" || (!want && env.ANTHROPIC_API_KEY)) { | ||
| if (env.ANTHROPIC_API_KEY) return anthropic(env); | ||
| if (env.ANTHROPIC_API_KEY) return anthropic(env, model); | ||
| } | ||
| if (want === "openai" || (!want && env.OPENAI_API_KEY)) { | ||
| if (env.OPENAI_API_KEY) return openai(env); | ||
| if (env.OPENAI_API_KEY) return openai(env, model); | ||
| } |
There was a problem hiding this comment.
There is a bug in the provider resolution logic when a provider is explicitly requested via PROVIDER but its corresponding API key is missing.\n\nFor example, if PROVIDER=openrouter is set but OPENROUTER_API_KEY is undefined, the outer condition want === "openrouter" is true, but the inner condition if (env.OPENROUTER_API_KEY) is false. This causes the function to silently fall through to the next provider checks, eventually returning the mock provider. This completely bypasses the helpful configuration error checks defined inside the provider modules (e.g., throw new Error("OPENROUTER_API_KEY not set")).\n\nRemoving the inner key checks allows the requested provider to be returned directly, so it can throw a clear and actionable error message.
if (want === "openrouter" || (!want && env.OPENROUTER_API_KEY)) {\n return openrouter(env, model);\n }\n if (want === "anthropic" || (!want && env.ANTHROPIC_API_KEY)) {\n return anthropic(env, model);\n }\n if (want === "openai" || (!want && env.OPENAI_API_KEY)) {\n return openai(env, model);\n }There was a problem hiding this comment.
The comparison mode currently runs sequentially because of the await inside the for...of loop. This means if a user compares multiple models, they must wait for each model's pipeline to completely finish before the next one starts. For multi-call orchestrations or chains, this can be extremely slow.\n\nRunning the models in parallel using Promise.all would significantly improve performance (especially for the web UI). Note that if you run them in parallel, the onEvent console logs for the CLI might interleave, but this can be managed or accepted as a trade-off for the massive speedup.
const tasks = models.map(async (model) => {\n onEvent({ type: "model:start", model });\n const provider = getProvider(env, { model });\n const engine = createEngine(provider, onEvent);\n let result = null, error = null;\n try {\n if (mode === "single") result = await engine.runSingle(ids[0], inputs, opts);\n else if (mode === "chain") result = await engine.runChain(ids, inputs, opts);\n else result = { results: await engine.runAll(ids, inputs, opts) };\n } catch (e) {\n error = e.message;\n }\n onEvent({ type: "model:end", model });\n return { model, provider: provider.name, mode, result, error };\n });\n const columns = await Promise.all(tasks);
Adds OpenRouter as a provider and turns the bench into something you can point at many models — two ways — plus a UI redesign.
The bench could already run any of the 44 frameworks on a task, but only one model at a time, baked in via env var. To compare reasoning frameworks you want to vary the model too; to test agent-style coordination you want different models handing work to each other in one run. OpenRouter (one key, OpenAI-compatible) makes every model reachable, so neither needed a per-provider integration.
The core change is in the engine: it resolves a model per call instead of binding one provider for the whole run. A plain provider instance still works for single-model runs and existing callers, but a caller can pass a resolver so each step picks its own model. That's what makes both new modes work across every framework and orchestration, not just chains.
On top of it:
src/matrix.js): run the same framework or chain on several models in parallel, isolated columns — one model failing doesn't abort the others. CLI--models a,b,c, or the web Models box.chain plan:claude,pot:gpt) or per orchestration role (--roles advocate,opponent,judge). The trace records which model produced each step.The provider talks to OpenRouter over plain
fetch, so the zero-dependency promise holds — no SDK — and keys stay server-side.The UI moves from cramped dev styling to a light editorial layout: the catalog is a searchable index by category, the composer holds the mode and model controls, and results render per framework with the model trace. It also fails loudly with setup steps when the catalog can't load (it was a silent blank sidebar before) and shows a live framework count. System serif fonts only, so it stays offline.
Notes:
PROVIDER=anthropic/openai, slugs must be valid for that provider./healthroute for deploy checks.Tests cover provider request shaping, model-per-call routing, compare isolation, and handoff ordering; CI smokes exercise compare and handoff. Everything runs offline against the mock provider.