ArXiv-24

A Survey of Resource-efficient LLM and Multimodal Foundation Models

Extensive Reading Goal The goal of this survey is to provide an overarching understanding of how current approaches are tackling the resource challenges posed by large foundation models and to potentially inspire future breakthroughs in this field. ...

LLM as a System Service on Mobile Devices

Intensive Reading Author Info ‪Wangsong Yin‬ - ‪Google Scholar‬ Mengwei Xu Background 论文首先提出了 LLMaaS: LLM as a system service on mobile devices (LLMaaS): The mobile OS exposes an LLM and its inference infrastructure as a system feature to mobile apps, akin to the location or notification services. LLMaaS 的提出主要有以下原因： LLMaaS needs only one copy of LLM weights in memory. 不同应用程序应该去调用由系统维护的同一个大模型，而不是自己单独去加载一个 A system-level LLM can be better customized for on-device accelerator and enjoy the performance gain over commodity hardware. 在系统层面去做大模型的管理和推理更接近底层，能够更好地利用底层的硬件资源这篇文章要解决的核心问题是 How to efficiently manage the LLM contexts ...

Striped Attention Faster Ring Attention for Causal Transformers

Skimming Author Info Implementation and Benchmark zhuzilin/ring-flash-attention: Ring attention implementation with flash attention Corresponding virtualization is here Background Challenges Insights Ring attention suffers from workload imbalance Due to the casual mask mechanism, some devices are doing meaningless computations in the iterations while other devices stays busy all the time. Stripped attention propose an another way to distribute workloads across devices to eliminate the imbalance. Approaches Striped Attention 让每个设备都持有了在原始序列中均匀分布的、不连续的词元 Example Important 理解这个例子最重要的一点：Ring Attention 和 Striped Attention 都不是采用朴素的注意力计算 ...

PowerInfer-2 Fast Large Language Model Inference on a Smartphone

Intensive Reading Author Info Zhenliang Xue: From IPADS. Yixin Song: First author of PowerInfer. Zeyu Mi (糜泽羽): He is an associate professor at School of Software, Shanghai Jiao Tong University (SJTU). Haibo Chen [IPADS]: Director of Institute of Parallel and Distributed Systems. Background Sparsity FFN 的参数占比大，稀疏化特征也明显（特别是在使用 ReLU 时），所以可以在执行计算前利用一个 predictor 来预测哪些神经元会被激活，从而降低计算和 I/O 开销。 PowerInfer2 还探索了 LLM 推理过程中的动态稀疏性：当批次很大时，对于任何一个神经元，只要它被输入中的至少一个激活，它在这一步的计算中就不是稀疏的。由于不同输入会激活不同神经元，其聚合效应导致大量神经元被激活，形成稳定、密集的“热点”，整体稀疏度显著降低。由于某些序列会更早终止，所以有效批次的大小也会动态波动。这个实时变化导致了模型的计算模式在一个任务的生命周期内，会从一个接近稠密的模式平滑地过渡到一个高度稀疏的模式。 Mobile Hardware Characteristics 与 PC 相比，手机的硬件有两个特点： Heterogeneous computing capabilities with distinct sparse computation characteristics. CPU 更擅长稀疏计算 NPU 更擅长稠密计算 GPU 比 CPU 和 NPU 都更慢，而且在推理中使用 GPU 会影响设备的渲染帧率移动 LLM 推理框架应同时利用异构处理器，以最大限度地利用共享内存带宽 Distinct storage architecture with unique I/O characteristics. 读的块大小越大，吞吐量越高数据范围越小，吞吐量越高频率越高的 CPU core 读取时吞吐量越高 UFS 并发能力有限 ...

A Survey on Efficient Inference for Large Language Models

General Background Resources LLMs typically demand: Higher Computational Cost Higher Memory Access Cost Higher Memory Cost Inference Process of LLMs auto-regressive generation In each generation step, the LLM takes as input the whole token sequences, including the input tokens and previously generated tokens, and generates the next token. With the increase in sequence length, the time cost of the generation process grows rapidly. KV cache technique can store and reuse previous key and value pairs within the Multi-Head Self-Attention block. ...