Efficient Memory Management for Large Language Model Serving with PagedAttention
Extensive Reading Author Info Woosuk Kwon Zhuohan Li Background The existing systems suffer from internal and external memory fragmentation. Three primary sources of memory wastes: 7+ Internal fragmentation: Space that will not be used in the future within an allocated memory block. External fragmentation: Unused space between memory blocks. The existing systems cannot exploit the opportunities for memory sharing. Parallel sampling, beam search, and shared prefix have the potential to leverage the shared KV cache to reduce memory footprint. ...