dd-llmo-eval-bootstrap
Analyzes production LLM traces from Datadog and generates ready-to-use evaluator code using the Datadog Evals SDK.
Setup & Installation
What This Skill Does
Analyzes production LLM traces from Datadog and generates ready-to-use evaluator code using the Datadog Evals SDK. It samples real traffic, identifies quality dimensions worth measuring, and outputs BaseEvaluator subclasses or LLMJudge instances you can plug into LLM Experiments.
Writing evaluators from scratch means guessing what quality dimensions matter, but this skill samples actual production traces and proposes grounded evals with evidence, so you start from real behavior instead of assumptions.
When to use it
- Generating evaluators for a RAG app after noticing answer quality drift
- Building a test suite from production traces of a customer support chatbot
- Creating LLM judge prompts grounded in real failure patterns from an RCA report
- Auditing an agent app for scope violations using trace-based safety evals
- Bootstrapping format and correctness checks for a new ml_app with no existing coverage
Similar Skills
wp-phpstan
Configures and runs PHPStan static analysis in WordPress codebases.
web-quality-audit
Runs a web quality audit across performance, accessibility, SEO, and best practices based on Google Lighthouse checks.
core-web-vitals
Diagnoses and fixes Core Web Vitals issues across LCP, INP, and CLS.
webapp-testing
Automates interaction with local web applications using Python Playwright.
