Ouranos by Exla

Turn your GPU cluster into an OpenAI-style API.

A Kubernetes-native LLM inference control plane that turns your GPU cluster into an OpenAI-style API with built-in cost optimization, routing, and SLO guarantees.

If you already run or plan to run your own GPUs, Ouranos gives you the OpenAI experience without sacrificing control, compliance, or efficiency.

  • Stable, OpenAI-compatible endpoints every internal team can build on.
  • Central routing, autoscaling, and multi-tenant guardrails tuned for LLM workloads.
  • LLM-specific efficiency features—KV reuse, LoRA multiplexing, heterogeneous GPU placement—built in.
What the product is

A control plane that makes your GPUs feel like an OpenAI region

Ouranos is a Kubernetes-native LLM inference control plane that turns your GPU cluster into an OpenAI-style API with built-in cost optimization, routing, and SLO guarantees.

Plain English

You give it a K8s cluster + some GPUs + some models, and it gives you:

  • A stable OpenAI-compatible endpoint.
  • Central routing and autoscaling.
  • Multi-model, multi-tenant control.
  • LLM-specific efficiency (KV reuse, LoRA, hetero GPUs, etc.).
Who it’s for & jobs it does

Built for platform teams that own their GPU clusters

Primary persona: Head of AI Platform / Infra Lead at a company that:

Jobs-to-be-done
  • “Give all our internal teams a single, reliable endpoint for LLMs, like OpenAI, but running on our GPUs.”
  • “Let me add or swap models without rewriting infrastructure.”
  • “Keep latency SLOs while not torching GPU spend.”
  • “Make it easy to onboard new workloads or teams without bespoke infra each time.”
What the product includes

Four layers that ship together

Think in four layers: API, Control Plane, Runtime, and Ops.

API Layer (Developer-facing)

Drop-in OpenAI-compatible endpoints with per-tenant controls.

  • Supports /v1/chat/completions, /v1/completions, /v1/embeddings.
  • Simple config for model, tenant/workspace, and routing hints (latency vs cost).
  • Change the base URL and API key—done.
  • Per-tenant API keys with quotas and rate limits enforced by the control plane.

Control Plane (Platform brain)

Everything you need to register models, route traffic, and stay compliant.

  • Model & deployment registry with versions, weight locations, and runtime profiles.
  • Routing engine for canaries, team allowlists, geo routing, and multi-model failover.
  • LLM-aware autoscaler watching tokens/sec, TTFT, queue depth, and KV cache pressure across cost/latency profiles.
  • Tenant & app management with quotas, model allowlists, and SLO tiers.

Runtime Layer (On your cluster)

Gateway, model servers, and runtime services tuned for dense LLM workloads.

  • Gateway ingress for auth, rate limiting, and smart routing to backends.
  • Model servers (vLLM, etc.) with batching, LoRA adapters, and KV reuse.
  • Distributed KV cache and heterogeneous GPU placement for long contexts and mixed workloads.
  • Unified AI runtime sidecar handling model downloads, metrics, health checks, and restarts.

Ops & UI Layer

Turn open-source building blocks into a product your teams can trust.

  • Web console for models, traffic, and GPU utilization with a routing & autoscaling editor.
  • Prometheus and Grafana out of the box, with optional Datadog or New Relic integrations.
  • Governance with SSO, RBAC, and audit logs for every change.
  • Lifecycle management: safe rolling upgrades, versioned configs, and one-click rollbacks.
How it differs

Why Ouranos beats the usual options

OpenAI / Anthropic / Bedrock

They hand you black-box APIs with zero infra to manage—but no control over cost structure, observability, or residency.

  • Ouranos runs on your GPUs with OpenAI-compatible ergonomics.
  • Full visibility into logs, metrics, traces, and which model handled each call.
  • Choose and customize models, including internal fine-tunes and LoRA adapters.

Ray Serve

A generic distributed compute framework that excels at Python workloads, but isn’t K8s-native or LLM-specialized by default.

  • Ouranos is Kubernetes-native from day one.
  • LLM-specific routing, autoscaling, and KV cache coordination are built in.
  • Delivers a managed control plane, not a framework you still have to extend.

KServe

Great for generic ML inference via CRDs, but it lacks opinionated support for complex LLM needs like KV reuse or token-centric autoscaling.

  • Ouranos is purpose-built for LLM inference, not arbitrary ML models.
  • Ships an LLM gateway, SLO-aware autoscaling, and KV reuse out of the box.
  • Provides strongly opinionated workflows tuned to LLM-specific requirements.

Roll-your-own vLLM stack

Helm charts and scripts work for one model, but every new workload means bespoke routing, autoscaling, and governance duct-tape.

  • Ouranos unifies configs, routing, autoscaling, and tenancy in one control plane.
  • Productizes upgrades, observability, and guardrails you’d otherwise hand-build.
  • Gives platform teams a supported path instead of a brittle DIY stack.
Ready to give your GPU cluster an OpenAI-quality control plane?