3 posts tagged with "llm-d release news!"

llm-d tag description

Mastering KV-Cache-Aware Routing with llm-d

August 5, 2025 · 9 min read

Principal Technical Marketing Manager for AI, Red Hat

Introduction

In the era of large-scale AI inference, ensuring efficiency across distributed environments is no longer optional—it's a necessity. As workloads grow, so does the need for smarter scheduling and memory reuse strategies. Enter llm-d, a Kubernetes-native framework for scalable, intelligent LLM inference. One of its most powerful capabilities is KV-cache-aware routing, which reduces latency and improves throughput by directing requests to pods that already hold relevant context in GPU memory.

Version Note

This blog post is written for llm-d v0.2.0. For detailed release information and installation instructions, see the v0.2.0 release notes.

In this blog post, we'll cover:

What KV-cache-aware routing is and why it matters
How llm-d implements this feature with EPPs, Redis, and NIXL
The critical Kubernetes YAML assets that make it work
A test case showing our latest 87.4% cache hit rate
Where to go to learn more and get started

llm-d 0.2: Our first well-lit paths (mind the tree roots!)

July 29, 2025 · 10 min read

Robert Shaw

Director of Engineering, Red Hat

Clayton Coleman

Distinguished Engineer, Google

Carlos Costa

Distinguished Engineer, IBM

Our 0.2 release delivers progress against our three well-lit paths to accelerate deploying large scale inference on Kubernetes - better load balancing, lower latency with disaggregation, and native vLLM support for very large Mixture of Expert models like DeepSeek-R1.

We’ve also enhanced our deployment and benchmarking tooling, incorporating lessons from real-world infrastructure deployments and addressing key antipatterns. This release gives llm-d users, contributors, researchers, and operators, clearer guides for efficient use in tested, reproducible scenarios.

Announcing the llm-d community!

May 20, 2025 · 11 min read

Robert Shaw

Director of Engineering, Red Hat

Clayton Coleman

Distinguished Engineer, Google

Carlos Costa

Distinguished Engineer, IBM

Announcing the llm-d community

llm-d is a Kubernetes-native high-performance distributed LLM inference framework
- a well-lit path for anyone to serve at scale, with the fastest time-to-value and competitive performance per dollar for most models across most hardware accelerators.

With llm-d, users can operationalize gen AI deployments with a modular, high-performance, end-to-end serving solution that leverages the latest distributed inference optimizations like KV-cache aware routing and disaggregated serving, co-designed and integrated with the Kubernetes operational tooling in Inference Gateway (IGW).

Introduction​

Announcing the llm-d community​

Introduction

Announcing the llm-d community