Unpacking NVIDIA Dynamo LLM Inference Framework

Everything you need to know about Dynamo. Code, components, concepts with diagrams and details.

Mar 20, 2025

In the AI world, this week was NVIDIA’s GTC 2025 week.

Amongst the many announcements that Jensen made in his Keynote, the open-sourcing of Dynamo Framework is the one that will have the most impact on AI/ML Engineers.

After researching it for the past 8 hours, I can say it’s a beautiful yet powerful approach to solving LLM inference optimally and at scale, and I was just surprised by it.

This article is deeply technical, as we’re going to unpack each technical module of NVIDIA Dynamo, the same way we’ve done with NIM (Nvidia Inference Microservice) in a past article, if curious - find it here:

Deep Dive into NVIDIA NIMs for Generative AI
Alex Razvant
·
July 30, 2024
Read full story

For Dynamo, we’re going to go over all major components, and the inner workings, what it provides, and why AI Engineers should be excited about it.

Figure 1. Annotated image of the NVIDIA Dynamo framework. Each component is explained in depth in **section 2** of this article. Source https://github.com/ai-dynamo/dynamo

Recap on the LLM Inference Process
NVIDIA Dynamo Inference Framework
What is Disaggregated Serving and the Role of the Prefill Queue
NVIDIA Inference Communications Library (NIXL) and KV Cache Transfer
Distributed Load balancing and Smart Routing
Memory Manager and Dynamo Events
Installing and serving LLMs with Dynamo
Conclusion

1. Recap on the LLM Inference process

From the previous article, we learned that LLMs are trained on a large text corpus (e.g. Trillions of tokens). The dataset for pretraining LLMs is composed of text that is passed through a tokenizer with a fixed vocabulary size of tokens. The LLM then learns to map the next token from the previous sequence of tokens.

Due to this auto-regressive nature, the inference process for LLMs splits into two stages, prefill and generation.

Figure 2. Exemplification of the prefill phase, where the KV values are computed, and the first token is sampled from the vocabulary.

The prefill phase can be considered as a warmup, as it represents a first pass of the User’s query through the model, generating the first token. This populates the attention matrices, KV values, GPU memory, etc. Prefill efficiently saturates the GPU memory because the model sees the entire input context on the first prompt.

Next, let’s see the decoding (i.e generation) phase:

Figure 3. The generation phase, where the LLM decoder iteratively samples the next tokens from the vocabulary, based on the previously precomputed sequence.

The generation phase is the iterative process of sampling the next token until the inference reaches a stop condition, be it predicting a special token that tells LLM to stop, or it reaches the MAX_NEW_TOKENS limit that we’ve set.

Generation is memory-bound, meaning that the speed at which weights, keys, values, and activations are transferred to the GPU from CPU memory is slower than the actual compute time required to get these in the first place.

For more details on this topic, here is the full article.

Understanding LLM Inference
Alex Razvant
·
Feb 20
Read full story

With this summary in mind, we can conclude two things:

Firstly, the prefill happens once per request, and it’s highly parallelizable.
Secondly, the decoding is iterative and each new token “requires” computation of the previous KV states of the sequence. To mitigate that and avoid recomputation, KV pairs can be cached in GPU memory and loaded on demand - a concept known as KV Caching.

KV caching is a core optimization feature that many LLM serving engines support, for example, vLLM through PagedAttention or SGLang through RadixAttention designs.

With NVIDIA Dynamo, the KV Caching mechanism becomes way more interesting due to the Smart Router and Cache distribution across GPUs.

2. NVIDIA Dynamo Inference Framework

Just 2 days ago (March 17), at GTC, Jensen announced that the NVIDIA Inference team is open-sourcing Dynamo, an inference framework for accelerating and scaling AI reasoning models at the lowest cost with the highest efficiency.

With the current surge of Reasoning models such as DeepSeek R1, R1-Zero - the challenges LLM serving faced at scale grew more complex due to the dynamic and unconditional behavior that reasoning models have, as they generate a long sequence of “reasoning” tokens before providing an output, which does a hit on system’s performance.

For that use case, and many other challenges of LLMs at scale, Dynamo proposes an innovative approach that builds and takes inspiration from the existing inference engines.

Dynamo is built mainly in Rust (55%) for critical performance-sensitive modules, thanks to the memory safety, and concurrency of Rust. Go (28%) for the deployment module, including Operator, Helm, and API Server part, and Python (10%) for its flexibility and customization to bring it all up together.

Let’s see what Dynamo is composed of, how it works, and what problems it aims to solve.

Figure 4. Annotated exemplification of the inner workings of the Dynamo Inference framework. The path in red exemplifies the prompt-processing-response workflow, whereas the one in blue showcases the inner optimizations and mechanisms of Dynamo. Original source https://github.com/ai-dynamo/dynamo

Dynamo brings 4 key components, which we’ll describe in detail. These are:

(7) GPU Planner - a planning engine that dynamically adds and removes GPUs to adjust to fluctuating user demand, avoiding GPU over- or under-provisioning.
(2) Smart Router - an LLM-aware router that directs requests across large GPU fleets to minimize costly GPU recomputations of repeat or overlapping requests — freeing up GPUs to respond to new incoming requests.
(3) NVIDIA Inference Transfer Library (NIXL) - An inference-optimized library that supports state-of-the-art GPU-to-GPU communication and abstracts the complexity of data exchange across heterogeneous devices, accelerating data transfer.
(8) Memory Manager - an engine that intelligently offloads and reloads inference data to and from lower-cost memory and storage devices without impacting user experience.

All the components are robust, modular, and independently scalable.

Dynamo Disaggregation

The Disaggregated Serving module is inspired by the DistServe (2024) paper, which points to separating prefill and decode to different GPUs co-optimizing the resource allocation and parallelism strategy tailored to each phase. Separating prefill and decode increases efficiency because they have different memory and computation footprints.

The execution of a request follows these main steps:

The prefill engine computes the prefill phase and generates the KV Cache
Prefill engine transfers the KV cache to the decode engine
The decode engine computes the decode phase

In case the prefill is short (short prompt) or the decode engine has a high cache hit, the prefill can be done locally in the decode engine directly by piggybacking chunked prefill requests with ongoing decode requests.

Dynamo by design handles all these different scenarios, prefill locally or distributed remotely, as seen in the following diagram:

Figure 5. Exemplification of how disaggregated serving works. Worker checks if current resources are enough for executing prefill locally, on the node. Allocates required KV blocks and in case of remote execution, enqueues the request to the prefill queue. Source https://github.com/ai-dynamo/dynamo

The Prefill Queue

If the prefill requests can’t be executed locally in the decode workers as specified above, they are executed in their dedicated iterations without any other requests to ensure optimal TTFT (Time to First Token) times.

To balance the load across multiple prefill engines, Dynamo adopts a global prefill queue where workers push remote prefill requests and prefill workers pull and complete the requests one by one. The global prefill queue is implemented based on the NATS stream to ensure high performance and availability.

NIXL and KV Cache Transfer

The LLM-aware request routing module can target the workers with the best KV match in the distributed KV Cache setup across multiple nodes, and execute the inference stages there.

After the KV blocks are allocated, the worker scheduler sends the remote prefill requests, which contain the memory descriptors for the allocated KV blocks, to the prefill worker scheduler via the prefill queue.

This allows the prefill worker to read and write from the remote KV blocks without explicit handling in the remote worker engine, thanks to the RDMA read-and-write NIXL operations.

If you’re curious how KV Cache, being composed of tensor states is shared through NIXL, the answer is that NIXL offers generic interfaces capable of supporting data transfers in the form of tensors, bytes, or objects, across multi-nodes.

Once the remote prefill is done, the worker scheduler simply adds the decode request to the worker in-flight. This allows workers to execute forward passes of ongoing decode/prefill requests while waiting for the remote prefill to finish.

Figure 6. Further steps in the request processing, showcasing how distributed prefill/decode workers communicate using read/write operations through NIXL. Source https://github.com/ai-dynamo/dynamo

KV Cache and Load Balancing in Dynamo

In vLLM we had PagedAttention as a mechanism to reuse KV Cache. In SGlang, RadixAttention with the RadixTree + LRU eviction policy enhanced cache reuse.

In Dynamo, a KVPublisher is introduced which emits KV Cache events that occur at each worker, and a KVIndexer which keeps track of these events globally.

Similarly to SGLang, the KV Cache eviction is based on a RadixTree approach.

With the router having a global view over the KV Cache and Load, it can align and select a worker based on a cost function composed of Load + Match.

For example, in the following image, the cost is (KV match - Load). Worker 3 has the best KV Cache match, but it’s also at 80% load. Given these conditions, Worker 2 is selected as even if the KV match is lower than Worker 3’s, the load is at almost half.

Figure 7. Exemplification of Dynamo inference request load balancing as a cost function of (KV cache match and node resources load). For a set sequence of tokens, it searches and routes the prefill/decode computation on the nodes that respect the imposed cost function. Original from https://github.com/ai-dynamo/dynamo

GPU Planner

The GPU planner takes care of elastic GPU resources, maximizing GPU throughput within a cluster of nodes. Based on the metrics insights that Event Plane is gathering, Dynamo adjusts the cluster’s resources and can add, remove, or reallocate GPUs in response to fluctuating request volumes and types, or route specific requests to target GPUs, for example, “a very long reasoning request” routed to an H200 GPU node as it requires more VRAM to compute the prefill/decode phases.

Dynamo Events

In the documentation, the team stated they want to support KV Cache Routing and load balancing for many backends that have different implementations of KV Cache and record different metrics. With that purpose in mind, the KVPublisher can be plugged into any framework to publish KV Events and a KvMetricsPublisher can publish Metric Events.

For example, following this design, vLLM could publish and update the PagedAttention table of logical-physical memory blocks it uses to store the KV Cache through the KVPublisher.

Similarly, the same could be done for SGLang’s RadixAttention approach of storing KVCache into a RadixTree, by controlling the KVCache layout through KVPublisher events.

On the receiving side, the KVIndexer accepts events from KVPublisher and puts them into a global prefix tree and a KvMetricsAggregator which aggregates metric events by the worker.

Figure 8. Dynamo Events workflow, as the team stated they want to support various designs of KV Caching either logical block in Paged Attention or RadixTree layout from SGlang. Original from https://github.com/ai-dynamo/dynamo

Installing and serving LLMs with Dynamo

Done with the theory, let’s get practical and deploy a model using Dynamo Framework. Remember that Dynamo is an inference framework, and acts as the backbone for LLM serving engines, such as vLLM, SGLang, or TensorRT-LLM.

In this example, we’ll serve a HuggingFace model using the vLLM backend.

Installing Dynamo (Ubuntu 20.04 + x86_64 CPU)

apt-get update
DEBIAN_FRONTEND=noninteractive apt-get install -yq python3-dev python3-pip python3-venv libucx0
python3 -m venv venv
source venv/bin/activate

pip install ai-dynamo[all]

Download a model from hugging face

curl -L -o Llama-3.2-3B-Instruct-Q4_K_M.gguf "<https://huggingface.co/bartowski/Llama-3.2-3B-Instruct-GGUF/resolve/main/Llama-3.2-3B-Instruct-Q4_K_M.gguf?download=true>"

Serve model from local

# As a text CLI interface
dynamo run out=vllm Llama-3.2-3B-Instruct-Q4_K_M.gguf

# As a HTTP interface
dynamo run in=http out=vllm Llama-3.2-3B-Instruct-Q4_K_M.gguf

## Use the OpenAI API spec to send requests

curl <http://localhost:8080/v1/completions> \\
  -H "Content-Type: application/json" \\
  -d '{
    "model": "Llama-3.2-3B-Instruct-Q4_K_M",
    "messages":[{"role":"user", "content": "Tell me a story" }]
    "max_completion_tokens": 7,
  }'

If serving in a multi-node setup, use the following:

Make sure etcd and nats are installed on all nodes
On node 1 run

dynamo run in=http out=dyn://llama3B_pool

On the rest of the nodes, run

dynamo run in=dyn://llama3B_pool out=vllm ~/llm_models/Llama-3.2-3B-Instruct

This video walkthrough from NVIDIA Developer’s YouTube Channel showcases how to run disaggregated serving using vLLM and Dynamo.

Conclusion

In this article, we’ve explained the core phases of the LLM inference process - prefill and decode and unpacked the new NVIDIA Dynamo Framework, announced at GTC2025, 2 days ago at the date of posting this article.

After reading this article you have a clear understanding of LLM limitations and challenges when deployed at scale, and how Dynamo tackles and solves these problems.

Thank you for reading, see you in the next one!

References:

ai-dynamo/dynamo: A Datacenter Scale Distributed Inference Serving Framework. (n.d.). GitHub. https://github.com/ai-dynamo/dynamo
DistServe: (arXiv:2401.09670, 2024) - https://arxiv.org/abs/2401.09670
NATS Protocol: (docs.nats.io) - https://docs.nats.io/reference/reference-protocols/nats-protocol
Dynamo Disaggregated Serving: (GitHub: ai-dynamo/dynamo, disagg_serving.md) - https://github.com/ai-dynamo/dynamo/blob/main/docs/disagg_serving.md
Dynamo KV Cache Manager: (GitHub: ai-dynamo/dynamo, kv_cache_manager.md) - https://github.com/ai-dynamo/dynamo/blob/main/docs/kv_cache_manager.md
NVIDIA NIXL: (GitHub: ai-dynamo/nixl, nixl.md) - https://github.com/ai-dynamo/nixl/blob/main/docs/nixl.md

If not specified otherwise, all diagrams are made by the author.