Mastering KV-Cache-Aware Routing with llm-d

August 5, 2025 · 9 min read

Principal Technical Marketing Manager for AI, Red Hat

Introduction

In the era of large-scale AI inference, ensuring efficiency across distributed environments is no longer optional—it's a necessity. As workloads grow, so does the need for smarter scheduling and memory reuse strategies. Enter llm-d, a Kubernetes-native framework for scalable, intelligent LLM inference. One of its most powerful capabilities is KV-cache-aware routing, which reduces latency and improves throughput by directing requests to pods that already hold relevant context in GPU memory.

Version Note

This blog post is written for llm-d v0.2.0. For detailed release information and installation instructions, see the v0.2.0 release notes.

In this blog post, we'll cover:

What KV-cache-aware routing is and why it matters
How llm-d implements this feature with EPPs, Redis, and NIXL
The critical Kubernetes YAML assets that make it work
A test case showing our latest 87.4% cache hit rate
Where to go to learn more and get started

llm-d 0.2: Our first well-lit paths (mind the tree roots!)

July 29, 2025 · 10 min read

Robert Shaw

Director of Engineering, Red Hat

Clayton Coleman

Distinguished Engineer, Google

Carlos Costa

Distinguished Engineer, IBM

Our 0.2 release delivers progress against our three well-lit paths to accelerate deploying large scale inference on Kubernetes - better load balancing, lower latency with disaggregation, and native vLLM support for very large Mixture of Expert models like DeepSeek-R1.

We’ve also enhanced our deployment and benchmarking tooling, incorporating lessons from real-world infrastructure deployments and addressing key antipatterns. This release gives llm-d users, contributors, researchers, and operators, clearer guides for efficient use in tested, reproducible scenarios.

llm-d Community Update - June 2025

June 25, 2025 · 3 min read

Pete Cheslock

AI Community Architect, Red Hat

Hey everyone! We've been making great progress with the llm-d project, and I wanted to share some important updates and opportunities to get involved.

Help Shape the Future of the llm-d Project

To guide the future development of the llm-d project, we need to understand the real-world challenges, configurations, and performance needs of our community. We've created a short survey to gather insight into how you serve Large Language Models, from the hardware you use to the features you need most.

This anonymous, vendor-agnostic survey will take approximately 5 minutes to complete. Your input will directly influence the project's roadmap and priorities. The aggregated results will be shared with the llm-d-contributors mailing list to benefit the entire community.

Your Input Will Define Our Roadmap

We've created an llm-d Community Roadmap Survey to gather information about your LLM workloads. We are looking to learn more about:

Your Serving Environment: This includes the hardware you use now and anticipate using in a year (like NVIDIA GPUs, AMD GPUs, or CPUs), and whether you run on-premise, in the cloud, or on edge devices.
Your Model Strategy: Do you serve a few large models or many smaller ones, which model families (like Llama or Mistral) are most common, and how you utilize techniques like LoRA adapters.
Your Performance Requirements: Your real-world SLOs for latency and throughput and the biggest LLM serving challenges you face—from cost optimization to operational ease of use.
Your Future Needs: What single new feature you would prioritize for an LLM Model-as-a-Service to help guide our innovation.

Take the 5-Minute Survey

Your participation is invaluable. Please take a few minutes to complete the survey. We encourage you to share it with other users or proxy their needs in your response to ensure our direction reflects the community's diverse requirements.

llm-d Week 1 Project News Round-Up

June 3, 2025 · One min read

Pete Cheslock

AI Community Architect, Red Hat

June 3, 2025

llm-d Week 1+2 Project News Round-Up

Hey, the llm-d project team has been really busy after the launch on May 20.

We've hit 1000 ⭐️'s on GitHub

Announcing the llm-d community!

May 20, 2025 · 11 min read

Robert Shaw

Director of Engineering, Red Hat

Clayton Coleman

Distinguished Engineer, Google

Carlos Costa

Distinguished Engineer, IBM

Announcing the llm-d community

llm-d is a Kubernetes-native high-performance distributed LLM inference framework
- a well-lit path for anyone to serve at scale, with the fastest time-to-value and competitive performance per dollar for most models across most hardware accelerators.

With llm-d, users can operationalize gen AI deployments with a modular, high-performance, end-to-end serving solution that leverages the latest distributed inference optimizations like KV-cache aware routing and disaggregated serving, co-designed and integrated with the Kubernetes operational tooling in Inference Gateway (IGW).

llm-d Press Release

May 20, 2025 · 12 min read

May 20, 2025

Red Hat Launches the llm-d Community, Powering Distributed Gen AI Inference at Scale

Forged in collaboration with founding contributors CoreWeave, Google Cloud, IBM Research and NVIDIA and joined by industry leaders AMD, Cisco, Hugging Face, Intel, Lambda and Mistral AI and university supporters at the University of California, Berkeley, and the University of Chicago, the project aims to make production generative AI as omnipresent as Linux

Introduction​

Help Shape the Future of the llm-d Project​

Take the 5-Minute Survey​

June 3, 2025​