- Terminal Bench — Benchmark for LLMs on complex terminal tasks. [paper]
- SkillsBench — Evaluating how well skills work and how effective agents are at using them. [paper]
- LMCache — The fastest KV cache layer for LLMs. [paper]
- OT Agent — Open-source terminal agent from the Open Thoughts team. [blog] [blog]
- ClawsBench — Benchmark for claw-like agents. [paper]
- Harbor — Agent evaluation framework and RL environment toolkit. Contributor.
- lmcache-agent-trace — Agent application, benchmark, and workload traces for LLM serving research.
- claude-code-tracing — Tracing tooling for Claude Code agent runs. [blog]
- vLLM / production-stack — High-throughput LLM inference engine and its K8s-native serving stack. Contributor.
- inference-engine-arena — Postman & Chatbot Arena for inference benchmarking. (Open-sourced ~3 months before SemiAnalysisAI/InferenceX.)
- cacheserve — KV-cache-aware serving experiments. [paper]
- lmcache-trace-analysis / mooncake-trace-replayer — Trace analysis & replay for inference workloads.
- Continuum — Multi-turn LLM agent scheduling with KV-cache time-to-live for efficient serving. Contributor. [paper]
- VidGen — Diffusion + autoregressive models for interactive video/game generation (Diffusive AI).
- LAG — Research experiments.
- citation-verifier — Verifying citations produced by LLM agents (TypeScript).
