ArXiv-25

Estimating LLM Uncertainty with Evidence

Extensive Reading Author Info Background Hallucinations exist in Large Language Models (LLMs) — where models generate unreliable responses due to a lack of knowledge. Existing methods for estimating uncertainty to detect hallucinations are flawed: Failure of Probability-Based Methods: Traditional methods rely on softmax probabilities. The normalization process (softmax) causes a loss of “evidence strength” information. A high probability does not always mean the model is knowledgeable; it might simply mean one token is slightly better than others in a low-knowledge scenario. Conversely, a low probability might not mean ignorance; it could mean the model knows multiple valid answers (e.g., synonyms). Limitations of Sampling-Based Methods: Methods like Semantic Entropy require multiple sampling iterations, which is computationally expensive and fails to capture the model’s inherent epistemic uncertainty (e.g., consistently producing the same incorrect answer due to lack of training data). Insights The reason why probability-based methods fail to identify reliability is that probability is normalized. ...

LServe Efficient Long-sequence LLM Serving with Unified Sparse Attention

Extensive Reading Author Info MIT HAN Lab Background Long-context LLM serving is bottlenecked by attention and KV caches. Prefilling has quadratic attention cost in sequence length, while decoding is memory-bound due to ever-growing KV caches; this makes 128k–512k contexts and long reasoning traces (e.g., 20k-token CoT) slow and expensive in practice. Existing KV cache optimizations are incomplete. Quantization and compression methods (e.g., KV quantization, paged KV cache) reduce memory and bandwidth but do not change the asymptotic attention complexity, so latency still grows linearly (decoding) or quadratically (prefilling) with context length. ...

Dynamic Sparse Attention on Mobile SoCs

Extensive Reading Author Info ‪Wangsong Yin‬ - ‪Google Scholar‬ Daliang Xu （徐大亮） - Daliang Xu’s Website Mengwei Xu Background State-of-the-art on-device inference frameworks fall back to the CPU/GPU for the attention operation , which is necessary for accuracy but causes resource contention and degrades user experience. Running the full attention operation directly on the NPU is not a viable alternative, as its high sensitivity to quantization results in significant accuracy degradation (an 18 pp average drop) when using the NPU’s low-precision integer compute. Applying traditional sparse attention on the CPU/GPU to lessen the workload yields minimal performance gain, as the required estimation stage to find important tokens becomes the new computational bottleneck. Insights Compute sparse attention accurately and efficiently in NPU-centric LLM inference ...

Scaling Up On-Device LLMs via Active-Weight Swapping Between DRAM and Flash

Intensive Reading Author Info About Ting Cao - Dr. Ting Cao: A Professor at the Institute of AI Industry Research (AIR), Tsinghua University. Background Challenges 挑战一：如何准确地识别出下一步计算到底需要哪些“活跃权重” 。如果识别错误，会降低模型的准确度。挑战二：如何能足够早地预测出需要的活跃权重，从而将缓慢的闪存加载过程与当前的计算过程并行处理，以隐藏延迟。现有的一些方法依赖 ReLU 激活函数来预测稀疏性，但这不适用于 Llama 等为追求高精度而未使用 ReLU 的现代 LLM. Insights 利用了 Top-K 的稀疏性，实现了在非 ReLu 上的权重值预测和预取论文提出了两个核心观察： Similarities in Cross-Layer Activations The input activations of the attention and MLP blocks in LLMs exhibit high cross-layer similarity due to residuals to the input activations. 由于激活值相似度很高，所以用当前层最重要的 K 个激活通道去预测下一层最重要的 K 个激活通道，准确度也很高 Contextual Hot Active Weights During Decoding Contextual active weights exhibit high temporal locality across inference iterations during decoding. 在一个具体的对话或任务中（上下文层面），“热点权重”的重复使用率，远高于在所有通用任务中（任务层面）的平均重复使用率所以根据上下文的激活频率设计缓存会更有效（缓存命中率会更高） Approaches Cross-Layer Active Weight Preloading 当计算第 N 层时，ActiveFlow 利用第 N 层的激活值来预测并提前加载第 N+1 层到第 N+k 层（一个“层组”）所需要的活跃权重 ...

SmallThinker A Family of Efficient Large Language Models Natively Trained for Local Deployment

回应我在 EdgeMoE 结尾提出的暴论：所有需要离线 Profiling 的工作都应该由 LLM 厂家在训练时实现。我们不应该问“如何把云端模型塞进本地设备？”，而应该问“如果我们从一开始就为本地设备的限制而设计，一个大语言模型会是什么样子？”。 SmallThinker 模型家族是一个从零开始、原生为本地部署而设计的全新架构。它将本地设备的弱计算能力、有限内存和慢速存储这三大限制，转化为了架构设计的核心原则。 Insights Sparsity is all you need Predict-and-prefetch is all you need Model Architecture Fine-Grained Mixture of Experts SmallThinker 旨在引入稀疏性来大幅降低计算负载，该架构由四个关键部分组成：基础 MoE 架构：模型采用了 MOE 架构。4B 模型配置了 32 个专家，而 21B 模型则配置了 64 个专家。这种设计可以在保持巨大总参数量（从而保证模型的知识容量）的同时，显著减少单次推理所需的实际计算量基于稀疏 ReGLU 的 FFN: 为了在MoE的结构稀疏性之上实现第二层稀疏，每个专家内部都使用了ReGLU激活函数。ReLU系列的激活函数天然会使许多神经元的输出变为零，从而在专家内部诱导出“神经元级别的稀疏性”。这意味着，即使一个专家被路由激活，其内部仍有大量神经元是不参与计算的。这为后续的计算优化提供了基础，进一步降低了计算和I/O开销预注意力路由：最具创新性的设计之一传统做法：在大多数 MoE 模型中，Router 模块位于 Attention 模块之后 SmallThinker: 将 Router 放在了 Attention 模块之前允许模型在执行计算密集型的注意力操作的同时，提前预测出接下来需要哪些专家。推理引擎可以利用这个时间窗口，并行地从慢速的SSD中 prefetch 所需专家的参数到内存中。当注意力计算完成时，专家参数也已准备就绪，从而隐藏 I/O 延迟 DP-Groups 全局负载均衡损失: 解决一个 MoE 训练中的固有矛盾：既要让专家“专业化”，又要避免训练不均衡 SmallThinker 采用了一种更宽松的、在数据并行的分组（DP-Groups）内进行负载均衡的策略。这允许不同的小组根据自己的数据“培养”出各自的专业化专家，既实现了功能上的专业化，又保持了训练的稳定性，且几乎不增加额外的训练开销 ...

HeteroLLM Accelerating Large Language Model Inference on Mobile SoCs with Heterogeneous AI Accelerators

Extensive Reading Author Info ‪Le Chen‬ - ‪Google Scholar‬ Haibo Chen [IPADS]: Director of Institute of Parallel and Distributed Systems Background 现有的LLM推理引擎通常只使用其中一种加速器（例如只用GPU或只用NPU），这导致了两个主要问题：资源浪费：无法充分利用芯片上所有计算单元的算力。性能瓶颈：单一加速器有其固有的性能短板，无法在所有场景下都达到最优性能。 Challenges Insights 设计一个能够同时、高效地利用 GPU 和 NPU 进行协同计算的 LLM 推理引擎，以最大限度地提升移动设备上的 LLM 推理速度 The NPU serves as the primary computing unit, handling the majority of computing tasks, while the GPU acts as a secondary computing unit to enhance the lower bound of NPU performance. GPU Characteristics Linear Performance: The GPU’s performance scales linearly as the tensor size increases. It transitions from being memory-bound on small tensors to compute-bound on large ones, where its performance plateaus. High-Cost Synchronization: There are two main types of synchronization overheads. Data Copy: API calls to transfer data between CPU and GPU buffers, like clEnqueueWriteBuffer, incur a fixed latency of about 400 microseconds, irrespective of the data’s size. Kernel Submission: Submitting a kernel to an active, non-empty GPU queue has a negligible overhead (10-20 microseconds). However, after a synchronization event empties the queue, submitting the next kernel incurs a much higher “startup” latency of 50-100 microseconds. ...