A Survey of Resource-efficient LLM and Multimodal Foundation Models

Extensive Reading Goal The goal of this survey is to provide an overarching understanding of how current approaches are tackling the resource challenges posed by large foundation models and to potentially inspire future breakthroughs in this field. ...

August 21, 2025 · Last updated on August 26, 2025 · 3 min · KKKZOZ

A Survey on Efficient Inference for Large Language Models

General Background Resources LLMs typically demand: Higher Computational Cost Higher Memory Access Cost Higher Memory Cost Inference Process of LLMs auto-regressive generation In each generation step, the LLM takes as input the whole token sequences, including the input tokens and previously generated tokens, and generates the next token. With the increase in sequence length, the time cost of the generation process grows rapidly. KV cache technique can store and reuse previous key and value pairs within the Multi-Head Self-Attention block. ...

July 20, 2025 · Last updated on August 25, 2025 · 21 min · KKKZOZ