KVCache Cache in the Wild Characterizing and Optimizing KVCache Cache at a Large Cloud Provider
Extensive Reading Author Info IPADS Alibaba Group Background Large-scale services go further and maintain a KV cache cache (prefix/prompt cache) that reuses KV blocks across different requests that share prefixes. Most deployed KV eviction strategies reuse general-purpose cache policies: Recency-based (LRU, FIFO) and frequency-based (LFU) policies, sometimes combined (e.g., GDFS-style recency–frequency–size heuristics). These methods are workload-agnostic and overlook several KV-specific realities: KV blocks often have short, bursty lifespans; past frequency is a poor predictor of future reuse. Different request categories (API vs chat, first turn vs later turns) have very different reuse patterns that generic policies cannot distinguish. Spatial locality is highly asymmetric: early “head” blocks of prompts are far more valuable than late “tail” blocks, but standard policies treat all blocks similarly. Observations Trace A: To-C workload, a consumer-facing trace including: ...