Sunbelt Computer Software

gooddata-eval

CLI to evaluate the GoodData AI agent against a dataset of natural-language questions on a chosen workspace and LLM model — including multi-model comparison.

Install

uv add gooddata-eval

Or install gd-eval as a standalone tool:

uv tool install gooddata-eval

Commands

Command	Description
`gd-eval run`	Run an evaluation dataset against one or more models.
`gd-eval models`	List LLM providers and models configured in the org.

`gd-eval run`

Quick start — single model

export GOODDATA_TOKEN='your-api-token'

gd-eval run \
  --host  https://your.gooddata.cloud \
  --workspace  ecommerce_demo \
  --dataset  ./my-dataset \
  --model  gpt-5.2 \
  --runs  1 \
  --json  results.json

Multi-model comparison

Pass --model multiple times to evaluate the same dataset against several models and get a side-by-side comparison:

gd-eval run \
  --host  https://your.gooddata.cloud \
  --workspace  ecommerce_demo \
  --dataset  ./my-dataset \
  --model  gpt-5.2 \
  --model  claude-opus-4-7 \
  --runs  1 \
  --json  comparison.json

When the same model id is offered by multiple providers, use the provider/model syntax to disambiguate:

  --model  "Foundry4o_4.1_5.2/gpt-5.2" \
  --model  "HN_Anthropic/claude-opus-4-7"

Both provider name and provider id are accepted as the prefix.

All flags

Connection

Flag	Env var	Description
`--host HOST`	—	GoodData host URL.
`--token TOKEN`	`GOODDATA_TOKEN`	API token. Pass via flag or env var.
`--profile NAME`	—	Profile name in `~/.gooddata/profiles.yaml` (same file as the `gdc` CLI).
`--workspace ID`	—	Required. Workspace id to evaluate against.

Dataset source (pick one)

Flag	Description
`--dataset PATH`	Flat folder of JSON files — one question per file.
`--langfuse-dataset NAME`	Pull items by name from a Langfuse dataset. Requires `LANGFUSE_PUBLIC_KEY`, `LANGFUSE_SECRET_KEY`, `LANGFUSE_HOST`.

Model selection

Flag	Description
`--model MODEL`	Model id to evaluate. Repeat to compare multiple models. Accepts `provider/model` syntax to disambiguate when a model is offered by multiple providers (e.g. `--model "Foundry4o/gpt-5.2"`). Defaults to the workspace's current active model.

Evaluation

Flag	Default	Description
`--runs K`	`2`	Independent runs per item (pass@K). An item passes if any run passes.
`--concurrency K`	`1`	Number of items evaluated concurrently. `1` = sequential (default). Increase to load-test the agent under simultaneous requests. Progress output interleaves when K > 1.

Output

Flag	Description
`--json PATH`	Write a JSON report to this path. Always uses the nested `{models, runs, comparison}` shape even for a single model.
`--quiet`	Suppress per-item progress. Per-model result tables and the comparison summary are still printed.

Langfuse sink

Flag	Description
`--langfuse`	Log scores and traces to Langfuse after each item. Requires `--langfuse-dataset`. Creates one named experiment run per model (`gd-eval-{timestamp}-{model}`). Requires `LANGFUSE_PUBLIC_KEY`, `LANGFUSE_SECRET_KEY`, `LANGFUSE_HOST`.

JSON report shape

The JSON report always uses the nested multi-model shape:

{
  "models": ["gpt-5.2", "claude-opus-4-7"],
  "runs": {
    "gpt-5.2":        { "summary": { "passed": 22, ... }, "items": { ... } },
    "claude-opus-4-7": { "summary": { "passed": 18, ... }, "items": { ... } }
  },
  "comparison": {
    "gpt-5.2":        { "passed": 22, "total": 31, "pass_rate": 0.71, "avg_quality_score": 0.81, ... },
    "claude-opus-4-7": { "passed": 18, "total": 31, "pass_rate": 0.58, "avg_quality_score": 0.72, ... }
  }
}

Winner is selected by pass rate → quality score → latency (lower latency wins all-equal ties).

`gd-eval models`

List all LLM providers and their models in the org. Marks the active model for a workspace when --workspace is given:

gd-eval models \
  --host  https://your.gooddata.cloud \
  --workspace  ecommerce_demo

┃ Provider       ┃ Provider ID ┃ Model ID          ┃ Family    ┃ Active   ┃
│ Foundry4o      │ foundry_…   │ gpt-5.2           │ OPENAI    │ ◀ active │
│                │             │ gpt-4o            │ OPENAI    │          │
│ HN_Anthropic   │ hn_anthr_…  │ claude-opus-4-7   │ ANTHROPIC │          │

Dataset format

A dataset is a folder of .json files, one per question:

{
  "id":           "stable-unique-id",
  "dataset_name": "my_dataset",
  "test_kind":    "visualization",
  "question":     "Show revenue by quarter",
  "expected_output": { }
}

Supported test_kind values: visualization, metric_skill, alert_skill, search_tool, general_question, guardrail, dashboard_summary.

`dashboard_summary` items

Summary items call the dedicated summary endpoint (POST /api/v1/ai/workspaces/{ws}/summary) instead of the chat endpoint, so they carry an extra summary_input block, and the expected_output is a rubric rather than an exact answer (summaries are free text):

{
  "id": "summary-001",
  "dataset_name": "summary_pilot",
  "test_kind": "dashboard_summary",
  "question": "Summarize the Sales Overview dashboard.",
  "summary_input": {
    "dashboard_id": "sales_overview"
  },
  "expected_output": {
    "must_include":     ["States the overall revenue trend", "Identifies the top segment"],
    "must_not_include": ["Numbers or segments not present in the visualizations"],
    "rubric":           ["Reads as a coherent business summary"]
  }
}

summary_input requires only dashboard_id (the endpoint summarizes the whole dashboard). Optional fields narrow the scope: visualizations (list of ids), filter_context (AFM filters), tab_id, and format_hint.

The expected_output rubric:

must_include — facts a good summary must contain; all must pass for the item to pass.
must_not_include — hallucination/accuracy guards; any violation fails the item.
rubric — soft quality dimensions; they affect quality_score but do not gate pass/fail.

Each criterion is scored independently by the LLM judge, so quality_score is the fraction of satisfied criteria.

Supported test kinds

test_kind	What the agent must produce	Extra required
`visualization`	Correct AAC visualization (metrics, dimensions, filters, type)	—
`metric_skill`	`create_metric` tool call with correct MAQL and format	—
`alert_skill`	`create_metric_alert` tool call with correct operator, threshold, trigger, filters, metric, recipients	—
`search_tool`	`search_objects` tool call (correct function called = pass; correct arguments = quality score)	—
`general_question`	Text answer judged by LLM	`[llm-judge]`
`guardrail`	Refusal/redirect (visualization response auto-fails)	`[llm-judge]`
`dashboard_summary`	Dashboard summary (via `/summary` endpoint) scored against a rubric by LLM	`[llm-judge]`

Optional extras

`[llm-judge]` — LLM-as-judge evaluators

general_question and guardrail items are scored by a GPT-4o judge. Requires the OpenAI package and OPENAI_API_KEY:

uv add 'gooddata-eval[llm-judge]'
# or for the standalone tool:
uv tool install 'gooddata-eval[llm-judge]'

Without [llm-judge], those items are skipped.

Exit codes

Code	Meaning
`0`	Run completed. Evaluation failures do not cause a non-zero exit.
`2`	Operational error: bad connection, missing model, unreadable dataset, missing credentials.

Name		Name	Last commit message	Last commit date
parent directory ..
src/gooddata_eval		src/gooddata_eval
tests		tests
LICENSE.txt		LICENSE.txt
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml

Score	Description
`pass_at_k`	1 if any of the K runs passed strict checks, else 0.
`quality_score`	Fraction of strict check flags that are `True` (0.0–1.0). Shown in CLI as a percentage.
`value_score`	Weighted blend: 0.6 × quality + 0.2 × speed (speed = max(0, 1 − latency/60s)).
`latency_s`	Average per-run latency in seconds.
`provider_type`	Model vendor + gateway label (e.g. `ANTHROPIC`, `BEDROCK/ANTHROPIC`, `AZURE/OPENAI`). Stored in Langfuse trace metadata and tags.

Sunbelt Computer Software

PL/B Language Development and Support

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

README.md

gooddata-eval

Install

Commands

`gd-eval run`

Quick start — single model

Multi-model comparison

All flags

Connection

Dataset source (pick one)

Model selection

Evaluation

Output

Langfuse sink

JSON report shape

`gd-eval models`

Dataset format

`dashboard_summary` items

Supported test kinds

Optional extras

`[llm-judge]` — LLM-as-judge evaluators

Exit codes

Scores (in JSON report and Langfuse)

Sunbelt Computer Software

PL/B Language Development and Support

Uh oh!

FilesExpand file tree

gooddata-eval

Directory actions

More options

Directory actions

More options

Latest commit

History

gooddata-eval

Folders and files

parent directory

README.md

gooddata-eval

Install

Commands

gd-eval run

Quick start — single model

Multi-model comparison

All flags

Connection

Dataset source (pick one)

Model selection

Evaluation

Output

Langfuse sink

JSON report shape

gd-eval models

Dataset format

dashboard_summary items

Supported test kinds

Optional extras

[llm-judge] — LLM-as-judge evaluators

Exit codes

Scores (in JSON report and Langfuse)

`gd-eval run`

`gd-eval models`

`dashboard_summary` items

`[llm-judge]` — LLM-as-judge evaluators