DigitalOcean Inference Engine
The Inference Engine for production AI: every model, every modality, on one platform, with the DigitalOcean Inference Router to optimize every call.
Three problems with fragmented inference
Model obsolescence
Models evolve quickly, and the one you chose months ago is rarely the best option today. Without intelligent routing, staying current means repeated migrations, rewrites, and vendor churn.
Cost that compounds at every layer
Inference costs don't just grow with tokens. Requests pass through multiple vendors, each adding markup on compute and orchestration, and every hop between services incurs egress. Teams end up overpaying for simple workloads or building complex routing systems just to stay efficient.
Operational blind spots
When inference runs across fragmented services, observability becomes an afterthought. Teams lose end-to-end visibility into latency by model, cost per request, and error rates, and can't optimize what they can't measure.
Control and Evaluate AI Inference
A unified control plane for AI inference. Define routing policies, evaluate model performance, and run experiments to continuously optimize how models behave in production.
Replace fragmented tooling with a single system for policy definition, output validation, and model testing.
DigitalOcean Inference Router (Public Preview)
The Inference Router is the control plane for production AI systems, unifying how models are selected and optimized across every inference call. It replaces manual routing logic with policy-driven control that adapts in real time.
Teams define routing behavior with simple policies in natural language or structured rules, enabling intent-based control over cost and latency, without hardcoding models.
Auto-route by cost or latency
Override any request at runtime
Failover that just works
Pin trajectories for agent consistency
Runs on Serverless and Dedicated
Full Model Catalog, one endpoint
Evaluate routers like models
Model Evaluations (Public Preview)
For teams validating model performance using real datasets before deploying to production.
Model Evaluations enables structured testing of catalog and Bring Your Own Models as well as inference routers. It utilizes LLM-as-a-judge to offer unified visibility into quality and latency.
Evaluate anything: catalog, BYOM, and routers
Real datasets and LLM-as-a-Judge scoring
Correctness, completeness, faithfulness, and safety
Latency, tokens, and cost per run
Compare everything side by side
Re-run as models evolve
Model Playground
For rapid experimentation and comparison across all model types.
The Model Playground lets teams test text, image, audio, and video models side by side and export production-ready API code directly from their configuration.
Every modality, side by side
Live parameter controls
Real-time inference with any catalog model
Zero code to test
Export curl or SDK instantly
Playground to production in one click
Run AI Inference in Production
The runtime for AI inference. Execute real-time, batch, and dedicated workloads through a single system that abstracts infrastructure complexity.
Serverless Inference
For production APIs, agents, and applications that require real-time responses.
Real-time text generation, image generation, audio, and video inference
55+ curated open-source and frontier models
Day 0 access to select OpenAI and Anthropic model releases
Intelligent routing for cost and latency optimization
Built-in observability (tokens, latency, errors, spend)
Multimodal generation (text to image, video, speech)
Agentic workflows via Messages API
Batch Inference
For large-scale workloads that do not require real-time latency.
Async job-based inference via API or SDK
24-hour result delivery SLA
Up to 50% cost reduction vs real-time inference
Isolated rate limits from production workloads
Transparent job lifecycle tracking (queued → processing → complete)
OpenAI and Anthropic-compatible batch schema for easy migration
Large-scale evaluation, enrichment, and moderation pipelines
Dedicated Inference
For sustained workloads requiring infrastructure-level control and performance guarantees.
Dedicated GPU endpoints in selected regions
Bring Your Own Model deployment
Custom GPU type and scaling configuration
Pre-tuned inference stack with optimized performance defaults
Managed orchestration and scaling without Kubernetes complexity
High-throughput production workloads and agent systems
Fine-tuned control over latency and performance profiles
Serverless Inference is fantastic because we can make as many calls as we need without worrying about provisioning infrastructure. It just scales automatically.
Carlo Ruiz
Infrastructure Engineer, Traversal
Built for the multimodal era
Modern AI applications are not text-only. The Inference Engine natively supports:
Text generation
Image generation
Video generation
Speech generation
Vision-language understanding
All through a single API key. No separate vendors. No fragmented billing. No additional infrastructure.
The latest models — by design
Weekly open-source refreshes, one-line model switching, and Day 0 access to select frontier releases keep production teams moving without migrations.
Operational intelligence at every step
Cost optimization
Delivers 3.9x throughput vs. AWS Bedrock, $0.65/M serverless tokens, and $6/hr dedicated inference.
Observability built in
Track token usage, TTFT, latency, errors, spend, and batch lifecycle without external tooling. Ranked #1 on Artificial Analysis for performance efficiency across leading inference providers.
Platform native
One security model, one billing system, and one infrastructure layer from GPU to API.
Your inference layer is part of your stack
Run your inference workloads alongside your existing infrastructure with no stitched-together vendors, fragmented billing, or hidden complexity.
FAQs about Inference Engine
What is the Inference Engine?
The Inference Engine is DigitalOcean's production system for serving AI models at scale. It brings together Serverless, Batch, and Dedicated Inference under a single OpenAI and Anthropic-compatible endpoint so developers can run real-time, asynchronous, or reserved workloads without managing infrastructure.
How does DigitalOcean Inference Router work?
Instead of manually choosing models for each request, developers can rely on system-level routing or presets that automatically match requests to the most appropriate model based on task type, cost, and performance needs. This reduces the need to hardcode model decisions and helps optimize inference in production.
What models are available on DigitalOcean?
You can also import models from Hugging Face and bring your own custom models from Spaces into your Model Catalog, giving you a single pane of glass to manage and deploy everything in one place.
What is Multimodal Inference?
Multimodal Inference allows developers to generate and process images, video, and audio directly through DigitalOcean’s API. It includes capabilities like text-to-image, text-to-video, and text-to-speech, all running natively within the same platform as text models.
How is Batch Inference different from real-time inference?
Batch Inference is designed for large, asynchronous workloads that do not require immediate responses. It allows developers to submit large job sets and receive results within 24 hours at significantly lower cost than real-time inference.
What is the Model Playground used for?
The Model Playground is an interactive environment for testing text, image, audio, and video models side by side. It allows developers to adjust parameters and export ready-to-use API code directly from their configurations.
How is pricing handled across the platform?
DigitalOcean uses a pay-as-you-go model with spend-based limits rather than fixed token caps. Certain workloads also benefit from features like off-peak discounts and batch pricing to reduce overall inference costs.
Who is the Inference Engine built for?
It is designed for AI engineers and technical teams building production AI applications at scale. This includes AI-native companies, enterprise teams modernizing workflows, and developers who need flexibility across models, modalities, and deployment types.
Start building with the Inference Engine
One platform for every model. One system for every workload. One engine for production AI.

