HeteroLLM Accelerating Large Language Model Inference on Mobile SoCs with Heterogeneous AI Accelerators
Extensive Reading Author Info Le Chen - Google Scholar Haibo Chen [IPADS]: Director of Institute of Parallel and Distributed Systems Background 现有的LLM推理引擎通常只使用其中一种加速器(例如只用GPU或只用NPU),这导致了两个主要问题: 资源浪费:无法充分利用芯片上所有计算单元的算力。 性能瓶颈:单一加速器有其固有的性能短板,无法在所有场景下都达到最优性能。 Challenges Insights 设计一个能够同时、高效地利用 GPU 和 NPU 进行协同计算的 LLM 推理引擎,以最大限度地提升移动设备上的 LLM 推理速度 The NPU serves as the primary computing unit, handling the majority of computing tasks, while the GPU acts as a secondary computing unit to enhance the lower bound of NPU performance. GPU Characteristics Linear Performance: The GPU’s performance scales linearly as the tensor size increases. It transitions from being memory-bound on small tensors to compute-bound on large ones, where its performance plateaus. High-Cost Synchronization: There are two main types of synchronization overheads. Data Copy: API calls to transfer data between CPU and GPU buffers, like clEnqueueWriteBuffer, incur a fixed latency of about 400 microseconds, irrespective of the data’s size. Kernel Submission: Submitting a kernel to an active, non-empty GPU queue has a negligible overhead (10-20 microseconds). However, after a synchronization event empties the queue, submitting the next kernel incurs a much higher “startup” latency of 50-100 microseconds. ...