NVIDIA Dynamo 1.0: The Distributed Inference OS Reshaping AI Factories

When NVIDIA announced Dynamo 1.0 on March 16, 2026 at GTC, the headline performance figure was a 7x improvement in inference throughput on Blackwell GPUs, achieved purely through software orchestration without adding a single GPU to the cluster. That number matters not because it is a marketing claim but because it quantifies something the industry has struggled to articulate clearly: the bottleneck in production AI deployment is no longer the model and no longer the hardware. It is the coordination layer that decides how requests flow across GPUs, how memory is allocated and reused between inference phases, and how compute resources are dynamically balanced in response to unpredictable workloads. Dynamo 1.0 is NVIDIA's answer to that coordination problem, and it reaches production as a free, open-source framework already adopted in production by companies including AstraZeneca, BlackRock, ByteDance, Coupang, Instacart, Meituan, PayPal, and Pinterest, alongside all four major hyperscalers.
Understanding why NVIDIA Dynamo distributed inference matters requires understanding the specific failure mode it addresses. Existing inference engines such as vLLM, SGLang, and TensorRT-LLM are highly capable at single-node or small multi-GPU serving, but they treat the prefill and decode phases of large language model inference as a coupled, colocated operation. At the scales demanded by agentic AI workflows, reasoning models, and enterprise deployments handling hundreds of millions of requests monthly, that coupling becomes a structural bottleneck. Dynamo decouples these phases, routes them intelligently across GPU pools, and manages the KV cache state that flows between them, functioning as what Jensen Huang described as the first operating system for AI factories. This blog covers the core architecture, how disaggregated serving works at a mechanical level, what the production deployment picture looks like, where the framework fits relative to existing inference engines, the real performance characteristics and their limitations, and how enterprises should evaluate adoption.
Why Traditional Inference Serving Could Not Scale
The standard model of LLM inference serving before disaggregated architectures treated every inference request as a two-phase sequential operation executed on the same set of GPUs. The prefill phase processes the entire input prompt, building the key-value cache that represents the model's compressed understanding of the context. The decode phase then autoregressively generates output tokens, attending to the KV cache on each step. Both phases share the same GPU memory, the same compute resources, and the same scheduling queue.
This coupling creates a fundamental resource mismatch. The prefill phase is compute-bound. It parallelises across the full prompt length in a single forward pass, creating intense demand for FLOPS. The decode phase is memory-bandwidth-bound. It makes a single-token forward pass on each step, reading the full KV cache and model weights repeatedly with comparatively light arithmetic. Running both phases on the same GPU means neither phase can be optimised for its actual resource profile. The compute-optimised configuration that accelerates prefill wastes memory bandwidth during decode, and the memory-bandwidth-optimised configuration that accelerates decode under-utilises FLOPS during prefill.
This mismatch is tolerable at small scale but becomes increasingly costly as request volumes grow and models get larger. Reasoning models such as DeepSeek-R1 and later generation models from major labs generate tens of thousands of tokens per request, dramatically lengthening the decode phase and exacerbating the imbalance. Agentic workflows, where multiple models invoke one another in loops, send constant bursts of inference requests that stress prefill capacity unpredictably. Neither problem has a clean solution within a coupled serving architecture.
The Predecessor: Triton Inference Server and Its Limits
NVIDIA Triton Inference Server, released in 2018, established the standard for multi-framework model serving. It unified inference across TensorFlow, PyTorch, ONNX, and OpenVINO into a single API surface and drove down inference costs significantly by enabling dynamic batching and concurrent model execution. Triton was designed for the era when inference meant single-model, single-node serving with predictable request patterns. It had no concept of disaggregating inference phases, no cluster-wide KV cache visibility, and no mechanism for routing requests based on which GPU already held the relevant cached context from a prior request. Dynamo is Triton's architectural successor, built for a fundamentally different era where models span hundreds of billions of parameters, requests arrive in heterogeneous bursts across multi-agent pipelines, and the KV cache from prior requests is a first-class resource to be managed rather than discarded.
The Core Architecture of NVIDIA Dynamo 1.0

NVIDIA Dynamo is best understood as an orchestration layer that sits above existing inference engines. It does not replace vLLM, SGLang, or TensorRT-LLM. Instead, it turns those engines into coordinated components of a multi-node inference system, adding the disaggregation, routing, scheduling, and memory management capabilities that those engines lack natively.
The framework is built in Rust for its performance-critical components and Python for extensibility and developer accessibility. Its four primary architectural innovations are disaggregated prefill and decode serving, KV-aware intelligent request routing, dynamic GPU worker allocation and replanning, and hierarchical KV cache offloading. These components are modular. Operators can deploy them individually or in combination, and each component is available as a standalone module for teams that want to integrate specific capabilities into existing systems.
Disaggregated Serving: Splitting Prefill and Decode Pools
Traditional LLM deployments placed both the prefill and decode phases of inference on a single GPU or node, despite each phase having fundamentally different resource requirements. The prefill phase processes user input to generate the first output token and is compute-bound, while the decode phase generates subsequent tokens and is memory-bound.Dynamo's disaggregated serving separates these phases into distinct GPU pools, allowing each pool to be independently sized, configured, and optimised for its actual workload.
The mechanics of disaggregation in Dynamo follow a precise sequence. When a request arrives at the Dynamo frontend, the disaggregated router decides whether the prefill should be computed remotely on a dedicated prefill worker or locally in the decode worker. This decision is runtime-dynamic, based on the absolute prefill length after accounting for any prefix cache hits, and on the current queue depth of the prefill workers. If the prefill length without cache reuse exceeds a configured threshold and prefill workers have capacity, the request is routed to a prefill pool. The prefill worker computes the forward pass, writes the resulting KV cache blocks to GPU memory, and then transfers those blocks to the decode worker via RDMA using NVIDIA NIXL (NVIDIA Inference Xfer Library). The decode worker receives the KV cache and begins token generation without repeating the prefill computation. The transfer is non-blocking, allowing both the prefill worker and the decode worker to continue processing other requests while the KV data moves.
Separating prefill and decode pools lets operators use cheaper GPUs for decode, cutting cost per token by 3 to 5 times for chat workloads.This asymmetry in hardware cost between prefill and decode pools is one of the most consequential operational advantages disaggregated serving provides, because decode processing dominates request volume in conversational and agentic workloads by a wide margin.
KV-Aware Intelligent Routing
KV cache reuse is the second major mechanism through which Dynamo recovers compute that monolithic serving wastes. In production workloads, many requests share a common prefix: a system prompt defining an agent's role, a shared document in a RAG pipeline, a compliance disclaimer, or the accumulated history of a multi-turn conversation. If the KV cache for that prefix has already been computed by a prior request on a specific worker, routing the next request to that same worker eliminates the need to recompute the prefix entirely.
Standard load balancing strategies such as round-robin distribute requests without regard to KV cache state, forcing workers to repeatedly compute the same prefix independently. In tests on Azure Kubernetes Service, the Dynamo KV Router demonstrated a 20x reduction in time to first token and 4x faster end-to-end latency by eliminating redundant prefix recomputation.The KV router makes a two-signal routing decision for every request: which worker has the most relevant cached KV blocks, and whether that worker has the capacity to accept the request without creating queue buildup that would offset the prefill savings. Dynamo tracks KV cache state across the entire cluster using an etcd distributed key-value store for metadata and NATS for prefix caching coordination, giving the router cluster-wide visibility into which workers hold which KV blocks at any moment.
The Dynamo Planner: Dynamic GPU Allocation
The Dynamo Planner is the continuous-optimisation component that responds to workload fluctuations in real time. It monitors queue depth, GPU utilisation, time-to-first-token SLOs, and inter-token latency SLOs across all workers in a cluster, and makes dynamic decisions about whether incoming requests should be served with disaggregated or aggregated execution, and whether additional GPU workers should be added to the prefill or decode pool. When a surge of long-input summarisation requests arrives and overwhelms prefill workers, the Planner can reallocate decode GPUs to perform prefill work, sacrificing some decode throughput temporarily to prevent prefill becoming a bottleneck for all users. This adaptive behaviour is impossible in static serving configurations and is one of the capabilities that makes Dynamo suitable for the unpredictable traffic patterns of agentic workloads.
KVBM, NIXL, and Grove: The Supporting Modules
Dynamo's modular architecture exposes three core building blocks as standalone components that can be adopted independently.
KVBM (KV Block Manager) manages the allocation, eviction, and offloading of KV cache blocks across GPU HBM, CPU DRAM, and object storage tiers such as S3 and Azure Blob. It supports cache pinning for frequently reused prefixes, evicting less-relevant blocks to cheaper storage while keeping hot prefixes resident in GPU memory.
NIXL (NVIDIA Inference Xfer Library) is the low-level transport layer for KV cache movement. It supports RDMA over InfiniBand, RoCE via UCX, TCP fallback, NVMe-oF, and S3-compatible storage, and uses zero-copy GPU-to-GPU transfers with minimal CPU involvement.
Grove is the Kubernetes orchestration extension that handles topology-aware gang scheduling, autoscaling of disaggregated components, and NVLink fabric-aware pod placement on GB300 NVL72 racks. It integrates with the standard Kubernetes Inference Gateway plugin to make KV-aware routing available inside standard Kubernetes networking primitives.
Performance Characteristics: What the Numbers Actually Mean
Dynamo 1.0 boosts inference performance of NVIDIA Blackwell GPUs by up to 7x, lowering token cost and increasing revenue opportunity for millions of GPUs through free open-source software.Understanding what that figure actually represents requires knowing its measurement conditions.
The 7x figure was measured for disaggregated serving combined with wide expert parallelism on Blackwell GB200 NVL72 hardware. The throughput gap grows with prompt length because long prefill is where monolithic setups suffer most. At 4,000-token inputs, the decode pool sits nearly idle while prefill processes the prompt. Disaggregated setups keep both pools productive simultaneously.On Hopper-generation H100 hardware, gains from disaggregation alone are meaningful but more modest, and teams should benchmark on their specific hardware before making infrastructure decisions.
The framework boosts the number of requests served by up to 30x when running the open-source DeepSeek-R1 model on a large cluster of GB200 NVL72 racks, because reasoning models generate tens of thousands of tokens per request, making the decode phase even more dominant and the hardware mismatch between coupled prefill and decode even more severe.
For multimodal workloads, Dynamo 1.0 introduces a three-stage disaggregated encode/prefill/decode pipeline. The disaggregated encode/prefill/decode approach with an embedding cache achieves 30% faster time to first token on image workloadsby allowing the encode stage to run on hardware optimised for vision embedding computation independently from the text prefill and decode stages.
Where Performance Gains Are Real and Where They Are Not
The performance improvements are most significant in three scenarios: long-prompt workloads where prefill is computationally expensive, high-concurrency workloads where KV cache reuse across shared system prompts is frequent, and reasoning model deployments where decode phases are dramatically longer than in standard instruction-following models. Gains are less significant for short-context chat applications with low concurrency, where the overhead of disaggregation and KV transfer can exceed the compute savings. Teams operating at modest scale on a single H100 node may find that vLLM with paged attention and continuous batching delivers adequate throughput without the operational complexity of a disaggregated multi-node deployment. The decision point for adopting Dynamo is generally when the workload requires coordination across multiple GPU nodes, when KV cache recomputation of shared prefixes becomes measurable in latency or cost, or when agentic workflows create bursty, heterogeneous request patterns that exceed what static resource allocation can handle efficiently.
NVIDIA Dynamo Distributed Inference vs. Existing Frameworks
Dynamo is the orchestration layer above inference engines. It does not replace SGLang, TensorRT-LLM, or vLLM. Instead, it turns them into a coordinated multi-node inference system.This positioning is architecturally distinct from what those engines provide individually, and understanding the layered relationship is essential for practitioners evaluating the stack.
vLLM provides paged attention, continuous batching, and a clean OpenAI-compatible API surface. It is the fastest path to production, starts in minutes, and supports the widest range of Hugging Face models. vLLM serves as one of the inference backends that Dynamo orchestrates, and Dynamo's KV-aware router and NIXL-based KV transfer are available as plugins within vLLM's own disaggregated prefill implementation.
SGLang provides RadixAttention, which is a shared prefix caching mechanism particularly valuable for structured generation and agent-loop workloads. When Dynamo orchestrates SGLang workers, it adds cluster-wide KV routing and cross-node disaggregation capabilities that SGLang's native prefix caching cannot provide on its own.
TensorRT-LLM provides the deepest hardware-level optimisation for NVIDIA GPUs, with CUDA graph fusion, FP8 quantization running natively on Tensor Cores, and speculative decoding. It achieves 15 to 30% higher throughput than vLLM on H100s under optimal conditions. When combined with Dynamo, TensorRT-LLM serves as the decode or prefill worker backend, gaining the orchestration, disaggregation, and dynamic scheduling capabilities it does not provide natively.
The practical recommendation for teams evaluating this stack is to treat Dynamo as the production infrastructure decision and choose between vLLM, SGLang, or TensorRT-LLM as the per-node engine decision based on workload characteristics, model stability, and acceptable setup complexity.
Ecosystem Integration and Production Adoption
The breadth of adoption at Dynamo 1.0's production launch is unusual for an infrastructure framework of this technical depth. The NVIDIA inference platform is integrated by cloud service providers AWS, Microsoft Azure, Google Cloud, and Oracle Cloud Infrastructure, alongside cloud partners Alibaba Cloud, CoreWeave, Together AI, and Nebius, and adopted by enterprises including ByteDance, Meituan, PayPal, Pinterest, AstraZeneca, BlackRock, Coupang, Instacart, Shopee, and SoftBank Corp.
Dynamo and TensorRT-LLM optimizations integrate natively into open source frameworks including LangChain, llm-d, LMCache, SGLang, and vLLM.The LangChain integration, specifically ChatNVIDIADynamo, enables agent workflows built in LangChain to pass per-request metadata hints to Dynamo that influence routing decisions, cache pinning TTL, and expected output length, allowing the application layer to communicate workload intent to the infrastructure layer in a way that was not previously possible.
The framework is available under Apache 2.0 license at the ai-dynamo GitHub organisation, and production multi-node deployments are also available via NVIDIA AI Enterprise for teams requiring support, security backporting, and stability guarantees. NVIDIA NIM microservices will bundle Dynamo capabilities as a deployment option for enterprises who prefer a container-packaged path rather than a self-managed installation.
KriraAI, an AI solutions company focused on building and deploying production-grade AI systems for enterprise clients, has closely tracked the development of Dynamo since its initial GTC 2025 preview. For clients operating at the scale where inference infrastructure decisions directly affect operational margins and user experience quality, the architectural shift that Dynamo 1.0 represents is one of the most consequential infrastructure developments of 2026.
Practical Adoption: When and How to Deploy Dynamo
The decision to adopt Dynamo is fundamentally a function of workload characteristics and deployment scale. The framework adds operational complexity that is not justified for all deployments, and KriraAI consistently advises enterprise clients to match infrastructure investment to the specific constraints that are actually limiting their systems.
The clearest signals that Dynamo adoption is warranted include:
Your workload requires serving requests across multiple GPU nodes and you need to coordinate prefill and decode across those nodes rather than running them independently.
You can identify frequent shared prefixes in your workload (system prompts, RAG document contexts, conversation history) and are currently paying the recomputation cost on every request.
You are deploying reasoning models such as DeepSeek-R1, Llama 4, or similar architectures where decode phases generate thousands of tokens and the memory-bandwidth demands of the decode stage are significantly different from the compute demands of prefill.
You are running agentic workflows where multiple model calls compose within a single user-facing pipeline, creating heterogeneous, bursty request traffic that static allocation cannot serve efficiently.
You are deploying multimodal models where image encoding adds a third distinct resource profile to the prefill-decode split.
For teams beginning deployment, the zero-config deployment feature introduced in Dynamo 1.0 via the DynamoGraphDeploymentRequest Kubernetes custom resource simplifies initial setup significantly. Teams specify model, hardware, and SLA targets in a single YAML manifest; Dynamo's AIConfigurator profiles the workload offline, the Planner optimises the disaggregated topology, and Dynamo handles deployment automatically. This removes much of the manual parallelism configuration that previously made distributed inference deployments require days or weeks of expert tuning.
Integration with Existing Kubernetes Infrastructure
For teams running existing Kubernetes-based AI infrastructure, Dynamo's Grove component integrates with the standard Kubernetes Inference Gateway through a plugin that adds KV-aware routing to native Kubernetes service routing. The integration preserves existing cluster management tooling while adding Dynamo's routing intelligence. The KAI scheduler integration enables topology-aware pod placement that accounts for NVLink fabric proximity, ensuring that prefill and decode workers that communicate frequently are placed on nodes with the lowest inter-GPU latency for KV cache transfers.
Security and compliance have been active blockers for enterprise AI inference deployments. Many enterprises have been stuck in pilot phases because self-built RAG implementations often lack the policy enforcement needed to protect regulated data across business units.Dynamo's integration with NVIDIA's pipeline deployment framework, combined with the policy enforcement available through NVIDIA AI Enterprise, provides a path for regulated industries to deploy disaggregated inference without requiring custom security tooling.
Current Limitations and Open Technical Challenges
Honest evaluation of Dynamo 1.0 requires naming the genuine limitations that practitioners will encounter in production.
The KV cache transfer latency introduced by disaggregation adds overhead to every request that uses remote prefill. For short prompts where the prefill computation itself takes less time than the RDMA transfer of the resulting KV cache, disaggregation reduces performance rather than improving it. The disaggregated router mitigates this by dynamically choosing local prefill for short requests, but the threshold tuning requires workload-specific benchmarking and will need adjustment as workload composition changes.
TensorRT-LLM's integration with Dynamo's KVBM for disaggregated serving carried known issues at the time of the 1.0 release, with specific build dependencies required to avoid request-hang behaviour in certain configurations. Teams deploying TensorRT-LLM backends with KVBM should follow the specific container build guidance in the documentation rather than using the latest TensorRT-LLM release without verification.
Multi-datacenter and cross-region deployments are not a supported use case in the current architecture. Dynamo is designed for intra-datacenter deployments where NVLink, InfiniBand, or RoCE connectivity provides the bandwidth and latency characteristics that make KV cache transfer cost-effective. Disaggregating across geographic regions would require network bandwidth two to three orders of magnitude below what intra-rack NVLink provides, making it impractical for latency-sensitive workloads.
The operational complexity of a multi-pool disaggregated deployment is substantially higher than a single-node vLLM deployment. Teams need to manage prefill pool sizing, decode pool sizing, KV transfer bandwidth, and router configuration, and each of these dimensions interacts with the others. Inference costs move fast, and weekly reviews are appropriate for catching regressions before they compound in disaggregated deployments where workload composition shifts can dramatically change optimal pool ratios.
Conclusion
Three things stand out from a careful analysis of NVIDIA Dynamo 1.0. First, at an architectural level, Dynamo's core insight is that the prefill and decode phases of LLM inference are fundamentally different resource problems that should be matched to different hardware and scheduled independently, and the disaggregated serving architecture that follows from this insight is not an incremental optimisation but a structural redesign of how inference at scale must be built. Second, the technology matters most for organisations deploying reasoning models, RAG-intensive applications, or multi-agent workflows at the scale where inference infrastructure costs are a primary operational variable. The performance improvements are real but workload-dependent, and the gains are most substantial precisely in the classes of workloads that enterprise AI adoption is accelerating toward fastest in 2026. Third, the practical response for AI engineering teams is to evaluate Dynamo against their specific workload profile, prioritise workloads where KV cache reuse and disaggregated scaling would provide measurable benefit, and adopt the modular components incrementally rather than treating Dynamo as an all-or-nothing infrastructure replacement.
FAQs
NVIDIA Dynamo is an open-source, distributed inference orchestration framework that operates as a layer above existing inference engines rather than replacing them. Where vLLM and TensorRT-LLM serve inference requests within a single node or GPU group using continuous batching and paged attention, Dynamo coordinates inference work across entire multi-node GPU clusters by disaggregating the prefill and decode phases of LLM inference onto separate worker pools, routing requests based on KV cache state across the cluster, and dynamically reallocating GPU resources in response to workload changes. Dynamo integrates with vLLM, SGLang, and TensorRT-LLM as its backend engines and adds the coordination, disaggregation, and memory management capabilities those engines lack. It was released as version 1.0 on March 16, 2026 and is available under the Apache 2.0 license.
Disaggregated inference splits the two phases of LLM request processing onto dedicated GPU pools. The prefill phase, which processes the input prompt to compute the KV cache and is compute-bound, runs on prefill workers optimised for high FLOPS utilisation. The decode phase, which autoregressively generates output tokens by attending to the KV cache and is memory-bandwidth-bound, runs on decode workers optimised for high memory bandwidth. When a request arrives, Dynamo's router decides whether to perform remote prefill based on prompt length and worker queue depth. If remote prefill is selected, the prefill worker computes the forward pass, then transfers the resulting KV cache tensors to the decode worker's GPU memory via RDMA using NVIDIA's NIXL library. The transfer is non-blocking, allowing parallel processing of other requests on both workers during the transfer. This separation allows each phase to be hardware-matched and independently scaled, which is the source of the measured throughput gains.
Dynamo supports deployments from a single GPU to thousands of GPUs. For disaggregated serving to provide measurable benefit, teams need at least two GPU nodes with high-bandwidth interconnect, ideally InfiniBand or NVLink, to make KV cache transfers fast enough to offset the coordination overhead. On Hopper-generation H100 hardware, disaggregated serving provides throughput improvements that vary significantly with prompt length and model size, with the largest gains observed on long-prompt, high-concurrency workloads. On Blackwell GB200 NVL72 hardware, independent benchmarks validated the 7x throughput improvement for disaggregated serving with wide expert parallelism on mixture-of-experts models. Enterprises should run their specific workloads through Dynamo's AIConfigurator profiling tool before making infrastructure decisions, because the performance characteristics are highly workload-dependent.
KV-aware routing reduces costs by eliminating redundant KV cache recomputation for requests that share common prefixes. In production RAG pipelines, agentic workflows, and multi-turn conversational applications, a substantial fraction of requests share an initial token sequence: a shared system prompt, a document context, or conversation history. Without KV-aware routing, each request recomputes the KV cache for those shared tokens from scratch on whichever GPU happens to receive the request under round-robin or load-based routing. Dynamo tracks KV cache state across the cluster via etcd metadata storage and NATS coordination, giving the router visibility into which workers already hold which cached KV blocks. By routing new requests to workers with matching cached prefixes, Dynamo eliminates the prefill computation for those tokens entirely, reducing time to first token and reducing the GPU-seconds consumed per request. The Azure Kubernetes Service deployment guide documented a 20x improvement in time to first token from KV routing in workloads with high prefix reuse.
NVIDIA Dynamo is designed primarily for organisations deploying AI inference across multiple GPU nodes or managing workloads where inference cost and latency at scale are measurable business concerns. The framework adds operational complexity that is not justified for small-scale or single-node deployments. Organisations serving inference from a single GPU or a small GPU cluster with predictable, uniform request patterns will typically find that vLLM with continuous batching and prefix caching provides adequate performance without the overhead of a disaggregated orchestration layer. The relevant threshold is roughly when an organisation needs to coordinate inference work across multiple nodes, when inference infrastructure costs are large enough that a 3x to 7x efficiency improvement meaningfully affects operating economics, or when agentic workload patterns create the heterogeneous request dynamics that disaggregated scheduling is built to handle. KriraAI, which builds and operates production AI systems for enterprise clients across diverse scale requirements, evaluates Dynamo on a workload-by-workload basis precisely because the infrastructure investment only delivers returns above a specific operational threshold.
CEO
Divyang Mandani is the CEO of KriraAI, driving innovative AI and IT solutions with a focus on transformative technology, ethical AI, and impactful digital strategies for businesses worldwide.