A Survey on Efficient Inference for Large Language Models
General Background Resources LLMs typically demand: Higher Computational Cost Higher Memory Access Cost Higher Memory Cost Inference Process of LLMs auto-regressive generation In each generation step, the LLM takes as input the whole token sequences, including the input tokens and previously generated tokens, and generates the next token. With the increase in sequence length, the time cost of the generation process grows rapidly. KV cache technique can store and reuse previous key and value pairs within the Multi-Head Self-Attention block. ...