CLI to evaluate the GoodData AI agent against a dataset of natural-language questions on a chosen workspace and LLM model — including multi-model comparison.
uv add gooddata-eval
Or install gd-eval as a standalone tool:
uv tool install gooddata-eval
| Command | Description |
|---|---|
gd-eval run |
Run an evaluation dataset against one or more models. |
gd-eval models |
List LLM providers and models configured in the org. |
export GOODDATA_TOKEN='your-api-token'
gd-eval run \
--host https://your.gooddata.cloud \
--workspace ecommerce_demo \
--dataset ./my-dataset \
--model gpt-5.2 \
--runs 1 \
--json results.jsonPass --model multiple times to evaluate the same dataset against several
models and get a side-by-side comparison:
gd-eval run \
--host https://your.gooddata.cloud \
--workspace ecommerce_demo \
--dataset ./my-dataset \
--model gpt-5.2 \
--model claude-opus-4-7 \
--runs 1 \
--json comparison.jsonWhen the same model id is offered by multiple providers, use the
provider/model syntax to disambiguate:
--model "Foundry4o_4.1_5.2/gpt-5.2" \
--model "HN_Anthropic/claude-opus-4-7"Both provider name and provider id are accepted as the prefix.
| Flag | Env var | Description |
|---|---|---|
--host HOST |
— | GoodData host URL. |
--token TOKEN |
GOODDATA_TOKEN |
API token. Pass via flag or env var. |
--profile NAME |
— | Profile name in ~/.gooddata/profiles.yaml (same file as the gdc CLI). |
--workspace ID |
— | Required. Workspace id to evaluate against. |
| Flag | Description |
|---|---|
--dataset PATH |
Flat folder of JSON files — one question per file. |
--langfuse-dataset NAME |
Pull items by name from a Langfuse dataset. Requires LANGFUSE_PUBLIC_KEY, LANGFUSE_SECRET_KEY, LANGFUSE_HOST. |
| Flag | Description |
|---|---|
--model MODEL |
Model id to evaluate. Repeat to compare multiple models. Accepts provider/model syntax to disambiguate when a model is offered by multiple providers (e.g. --model "Foundry4o/gpt-5.2"). Defaults to the workspace's current active model. |
| Flag | Default | Description |
|---|---|---|
--runs K |
2 |
Independent runs per item (pass@K). An item passes if any run passes. |
--concurrency K |
1 |
Number of items evaluated concurrently. 1 = sequential (default). Increase to load-test the agent under simultaneous requests. Progress output interleaves when K > 1. |
| Flag | Description |
|---|---|
--json PATH |
Write a JSON report to this path. Always uses the nested {models, runs, comparison} shape even for a single model. |
--quiet |
Suppress per-item progress. Per-model result tables and the comparison summary are still printed. |
| Flag | Description |
|---|---|
--langfuse |
Log scores and traces to Langfuse after each item. Requires --langfuse-dataset. Creates one named experiment run per model (gd-eval-{timestamp}-{model}). Requires LANGFUSE_PUBLIC_KEY, LANGFUSE_SECRET_KEY, LANGFUSE_HOST. |
The JSON report always uses the nested multi-model shape:
{
"models": ["gpt-5.2", "claude-opus-4-7"],
"runs": {
"gpt-5.2": { "summary": { "passed": 22, ... }, "items": { ... } },
"claude-opus-4-7": { "summary": { "passed": 18, ... }, "items": { ... } }
},
"comparison": {
"gpt-5.2": { "passed": 22, "total": 31, "pass_rate": 0.71, "avg_quality_score": 0.81, ... },
"claude-opus-4-7": { "passed": 18, "total": 31, "pass_rate": 0.58, "avg_quality_score": 0.72, ... }
}
}Winner is selected by pass rate → quality score → latency (lower latency wins all-equal ties).
List all LLM providers and their models in the org. Marks the active model
for a workspace when --workspace is given:
gd-eval models \
--host https://your.gooddata.cloud \
--workspace ecommerce_demo┃ Provider ┃ Provider ID ┃ Model ID ┃ Family ┃ Active ┃
│ Foundry4o │ foundry_… │ gpt-5.2 │ OPENAI │ ◀ active │
│ │ │ gpt-4o │ OPENAI │ │
│ HN_Anthropic │ hn_anthr_… │ claude-opus-4-7 │ ANTHROPIC │ │
A dataset is a folder of .json files, one per question:
{
"id": "stable-unique-id",
"dataset_name": "my_dataset",
"test_kind": "visualization",
"question": "Show revenue by quarter",
"expected_output": { }
}Supported test_kind values: visualization, metric_skill, alert_skill,
search_tool, general_question, guardrail, dashboard_summary.
Summary items call the dedicated summary endpoint
(POST /api/v1/ai/workspaces/{ws}/summary) instead of the chat endpoint, so
they carry an extra summary_input block, and the expected_output is a
rubric rather than an exact answer (summaries are free text):
{
"id": "summary-001",
"dataset_name": "summary_pilot",
"test_kind": "dashboard_summary",
"question": "Summarize the Sales Overview dashboard.",
"summary_input": {
"dashboard_id": "sales_overview"
},
"expected_output": {
"must_include": ["States the overall revenue trend", "Identifies the top segment"],
"must_not_include": ["Numbers or segments not present in the visualizations"],
"rubric": ["Reads as a coherent business summary"]
}
}summary_input requires only dashboard_id (the endpoint summarizes the whole
dashboard). Optional fields narrow the scope: visualizations (list of ids),
filter_context (AFM filters), tab_id, and format_hint.
The expected_output rubric:
must_include— facts a good summary must contain; all must pass for the item to pass.must_not_include— hallucination/accuracy guards; any violation fails the item.rubric— soft quality dimensions; they affectquality_scorebut do not gate pass/fail.
Each criterion is scored independently by the LLM judge, so quality_score
is the fraction of satisfied criteria.
| test_kind | What the agent must produce | Extra required |
|---|---|---|
visualization |
Correct AAC visualization (metrics, dimensions, filters, type) | — |
metric_skill |
create_metric tool call with correct MAQL and format |
— |
alert_skill |
create_metric_alert tool call with correct operator, threshold, trigger, filters, metric, recipients |
— |
search_tool |
search_objects tool call (correct function called = pass; correct arguments = quality score) |
— |
general_question |
Text answer judged by LLM | [llm-judge] |
guardrail |
Refusal/redirect (visualization response auto-fails) | [llm-judge] |
dashboard_summary |
Dashboard summary (via /summary endpoint) scored against a rubric by LLM |
[llm-judge] |
general_question and guardrail items are scored by a GPT-4o judge.
Requires the OpenAI package and OPENAI_API_KEY:
uv add 'gooddata-eval[llm-judge]'
# or for the standalone tool:
uv tool install 'gooddata-eval[llm-judge]'Without [llm-judge], those items are skipped.
| Code | Meaning |
|---|---|
0 |
Run completed. Evaluation failures do not cause a non-zero exit. |
2 |
Operational error: bad connection, missing model, unreadable dataset, missing credentials. |
