A real-time compute-resource monitoring workspace for a large-model training platform —
cross-Region, multi-accelerator (NVIDIA H200 / RTX PRO 5000 Blackwell / Alibaba Zhenwu PPU).
Built to the uploaded monitoring_prd_en.md product spec and DESIGN.md Notion visual system.
- Vue 3 with
<script setup>composition API - Tailwind CSS (design tokens from
DESIGN.md+ a dark-cyber palette) - Lucide icons via
lucide-vue-next - Vite build, hand-rolled SVG charts (no chart dependency)
npm install # already installed in this folder
npm run dev # http://localhost:5173
npm run build # production build to dist/
npm test # vitest — dictionary integrity + PRD coverage + UI tooltip coverage
npm run preview # serve the production build (http://localhost:4173)If npm run build or npm test fails with EMFILE: too many open files, that is a
file-descriptor limit on the host filesystem, not a code error — raise it first with
ulimit -n 8192 (or build on a local disk).
src/data/metrics.js documents every metric named in the PRD — all 12 L0 homepage
metrics (§7.2), the full L1 hardware/system/network/storage/training/cost/scheduling set
(§7.3), the L2 expert metrics (§7.4), and the §8 confused-metrics table — 76 entries total.
Each record carries the complete tooltip payload: definition, calculation logic,
significance, related metrics, easily-confused metrics, and notes, plus level (L0/L1/L2),
priority (P0–P3), unit, layer, default aggregation and collection sources.
Every metric surfaced in the UI shows a (?) icon (MetricTooltip) that, on hover, opens
a card teleported to <body> at a top-most z-index (never clipped or obscured) rendering
all of the above. The Settings → Metric Dictionary tab lists all 76 metrics grouped by
level and layer with live coverage counts.
Three vitest suites, 111 assertions:
- dictionary.test.js — every record has the six required doc fields, valid level/priority/layer, unique ids, and related-metric ids that resolve.
- coverage.test.js — a word-for-word checklist of every PRD metric mapped to a dictionary id; fails if any metric is dropped. Also enforces the §8 confused-metrics list.
- ui-coverage.test.js — scans every
.vue/.jsfile: each staticmetric-idused in the UI must resolve in the dictionary, and every metric-bearing page must wire up a tooltip.
The UI is a two-layer system, exactly as requested:
- Notion light shell — white canvas,
#f6f5f4surfaces, hairline borders, the signature purple (#5645d4) CTA, Notion-Sans typography, 8px buttons / 12px cards, navy (#0a1530) hero band. This is the chrome: top bar, left rail, tabs, filters. - Dark industrial-cyber metric dashboards — Slate/Zinc panels (
#11161f), a faint cyan grid, glowing accent edges, monospaced tabular readouts, live pulse indicators, animated sparklines. This is every data surface: KPI tiles, matrix, tables, charts, drawers.
- Global filters (Region / Accelerator / Tenant) + metric-value presets (allocated-low-util, high-mem-low-compute, thermal throttle, hardware risk) with a natural-language match summary (PRD §9.4).
- Live mode — the fleet jitters every 4s; toggle Live / Paused in the top bar; manual refresh.
- Metric tooltips — hover any P0/P1 metric for definition, difference-from-similar-metrics, and default aggregation (PRD §8).
- Multi-accelerator model — H200, RTX PRO 5000 Blackwell, Zhenwu 810E/M890 share one
unified
acceleratorabstraction; PPU-native vs NVIDIA-only fields handled per adapter.
src/
├─ App.vue # shell: tabs + page switch + drawers
├─ data/
│ ├─ catalog.js # regions, accelerator types, metric dictionary, tabs
│ ├─ generate.js # deterministic fleet generator + live jitter + series
│ └─ jobAnalytics.js # step-time breakdown + accelerator skew matrix
├─ store/useMonitor.js # reactive store: filters, KPIs, matrix, TopN, cost…
├─ components/
│ ├─ shell/ # TopBar, LeftRail, FilterBar
│ ├─ common/ # KpiCard, Sparkline, LineChart, Drawer, StatusBadge, MetricTooltip
│ ├─ pages/ # Overview, Resources, Jobs, Trends, Alerts, Cost, Settings
│ └─ drawers/ # AcceleratorDrawer, JobDrawer
All data flows through src/store/useMonitor.js, which is shaped to mirror the PRD API
map (§12.4): QueryOverviewSummary, QueryRegionModelMatrix, QueryTopN,
QueryResourceInventory, GetAcceleratorDetail, QueryJobStepBreakdown,
QueryJobAcceleratorSkew, SearchAlerts, QueryCostCapacitySummary, etc.
Replace the generators in data/generate.js with fetch() calls to
/api/v1/monitor/* and keep the computed selectors as-is.
