Skip to main content

3 posts tagged with "llm-d release news!"

llm-d tag description

View All Tags

Mastering KV-Cache-Aware Routing with llm-d

· 9 min read
Christopher Nuland
Principal Technical Marketing Manager for AI, Red Hat

Introduction​

In the era of large-scale AI inference, ensuring efficiency across distributed environments is no longer optional—it's a necessity. As workloads grow, so does the need for smarter scheduling and memory reuse strategies. Enter llm-d, a Kubernetes-native framework for scalable, intelligent LLM inference. One of its most powerful capabilities is KV-cache-aware routing, which reduces latency and improves throughput by directing requests to pods that already hold relevant context in GPU memory.

Version Note

This blog post is written for llm-d v0.2.0. For detailed release information and installation instructions, see the v0.2.0 release notes.

In this blog post, we'll cover:

  • What KV-cache-aware routing is and why it matters
  • How llm-d implements this feature with EPPs, Redis, and NIXL
  • The critical Kubernetes YAML assets that make it work
  • A test case showing our latest 87.4% cache hit rate
  • Where to go to learn more and get started

llm-d 0.2: Our first well-lit paths (mind the tree roots!)

· 10 min read
Robert Shaw
Director of Engineering, Red Hat
Clayton Coleman
Distinguished Engineer, Google
Carlos Costa
Distinguished Engineer, IBM

Our 0.2 release delivers progress against our three well-lit paths to accelerate deploying large scale inference on Kubernetes - better load balancing, lower latency with disaggregation, and native vLLM support for very large Mixture of Expert models like DeepSeek-R1.

We’ve also enhanced our deployment and benchmarking tooling, incorporating lessons from real-world infrastructure deployments and addressing key antipatterns. This release gives llm-d users, contributors, researchers, and operators, clearer guides for efficient use in tested, reproducible scenarios.

Announcing the llm-d community!

· 11 min read
Robert Shaw
Director of Engineering, Red Hat
Clayton Coleman
Distinguished Engineer, Google
Carlos Costa
Distinguished Engineer, IBM

Announcing the llm-d community​

llm-d is a Kubernetes-native high-performance distributed LLM inference framework
- a well-lit path for anyone to serve at scale, with the fastest time-to-value and competitive performance per dollar for most models across most hardware accelerators.

With llm-d, users can operationalize gen AI deployments with a modular, high-performance, end-to-end serving solution that leverages the latest distributed inference optimizations like KV-cache aware routing and disaggregated serving, co-designed and integrated with the Kubernetes operational tooling in Inference Gateway (IGW).