DDIA: Chapter 3 Storage and Retrieval

这一章主要讲的是数据库更底层的一些东西。 In order to tune a storage engine to perform well on your kind of workload, you need to have a rough idea of what the storage engine is doing under the hood. Data Structures That Power Your Database Index Any kind of index usually slows down writes, because the index also needs to be updated every time data is written. This is an important trade-off in storage systems: well-chosen indexes speed up read queries, but every index slows down writes. ...

October 20, 2023 · Last updated on August 1, 2025 · 9 min · KKKZOZ

SmallThinker A Family of Efficient Large Language Models Natively Trained for Local Deployment

回应我在 EdgeMoE 结尾提出的暴论:所有需要离线 Profiling 的工作都应该由 LLM 厂家在训练时实现。 我们不应该问“如何把云端模型塞进本地设备?”,而应该问“如果我们从一开始就为本地设备的限制而设计,一个大语言模型会是什么样子?”。 SmallThinker 模型家族是一个从零开始、原生为本地部署而设计的全新架构。它将本地设备的弱计算能力、有限内存和慢速存储这三大限制,转化为了架构设计的核心原则。 Insights Sparsity is all you need Predict-and-prefetch is all you need Model Architecture Fine-Grained Mixture of Experts SmallThinker 旨在引入稀疏性来大幅降低计算负载,该架构由四个关键部分组成: 基础 MoE 架构:模型采用了 MOE 架构。4B 模型配置了 32 个专家,而 21B 模型则配置了 64 个专家。这种设计可以在保持巨大总参数量(从而保证模型的知识容量)的同时,显著减少单次推理所需的实际计算量 基于稀疏 ReGLU 的 FFN: 为了在MoE的结构稀疏性之上实现第二层稀疏,每个专家内部都使用了ReGLU激活函数。ReLU系列的激活函数天然会使许多神经元的输出变为零,从而在专家内部诱导出“神经元级别的稀疏性”。这意味着,即使一个专家被路由激活,其内部仍有大量神经元是不参与计算的。这为后续的计算优化提供了基础,进一步降低了计算和I/O开销 预注意力路由:最具创新性的设计之一 传统做法:在大多数 MoE 模型中,Router 模块位于 Attention 模块之后 SmallThinker: 将 Router 放在了 Attention 模块之前 允许模型在执行计算密集型的注意力操作的同时,提前预测出接下来需要哪些专家。推理引擎可以利用这个时间窗口,并行地从慢速的SSD中 prefetch 所需专家的参数到内存中。当注意力计算完成时,专家参数也已准备就绪,从而隐藏 I/O 延迟 DP-Groups 全局负载均衡损失: 解决一个 MoE 训练中的固有矛盾:既要让专家“专业化”,又要避免训练不均衡 SmallThinker 采用了一种更宽松的、在数据并行的分组(DP-Groups)内进行负载均衡的策略。这允许不同的小组根据自己的数据“培养”出各自的专业化专家,既实现了功能上的专业化,又保持了训练的稳定性,且几乎不增加额外的训练开销 ...

August 25, 2025 · Last updated on August 26, 2025 · 2 min · KKKZOZ

STI Turbocharge NLP Inference at the Edge via Elastic Pipelining

Intensive Reading Author Info Homepage - Liwei Guo / Assistant Professor: a tenure-track Assistant Professor at University of UESTC. Background Challenges Cold start of NLP models in mobile devices NLP inference stresses mobile devices on two aspects Latency: impromptu user engagements Model Size Existing Paradigms: Hold in memory Too large memory footprint, likely to be victims of mobile memory management Load before execute Slow start, waiting for I/O, computation resources stall Pipeline load/execution Low arithmetic intensity in Transformer’s attention modules The pipeline is filled with bubbles and the computation stalls most of the time at each model layer Insights A model can be re-engineered from a monolithic block into a collection of resource-elastic “shards” by uniquely combining vertical partitioning with fine-grained, per-shard quantization. This transforms the I/O time of each model component into a tunable parameter. ...

August 25, 2025 · Last updated on August 26, 2025 · 2 min · KKKZOZ

EdgeMoE Empowering Sparse Large Language Models on Mobile Devices

Extensive Reading Author Info ‪Rongjie Yi‬ - ‪Google Scholar‬ Homepage - Liwei Guo / Assistant Professor: a tenure-track Assistant Professor at University of UESTC. Mengwei Xu Background Challenges End-to-end latency is I/O-dominated because expert weights are loaded on demand from slow storage (tail delay inflation). Quantization trilemma: compress aggressively, preserve accuracy, and keep dequantization nearly free on low-power CPUs/NPUs. Dynamic routing obscures which experts will be needed, making prefetch hard and naive caching ineffective when activations are balanced. Tiny RAM budgets (~1.5–3 GB) constrain the expert buffer, demanding careful eviction to avoid thrashing. Hardware heterogeneity and variable storage speeds complicate a one-size-fits-all pipeline and bitwidth plan. Insights Non-expert weights are held in device memory; while expert weights are held on external storage and fetched to memory only when activated. ...

August 24, 2025 · Last updated on August 26, 2025 · 2 min · KKKZOZ

HeteroLLM Accelerating Large Language Model Inference on Mobile SoCs with Heterogeneous AI Accelerators

Extensive Reading Author Info ‪Le Chen‬ - ‪Google Scholar‬ Haibo Chen [IPADS]: Director of Institute of Parallel and Distributed Systems Background 现有的LLM推理引擎通常只使用其中一种加速器(例如只用GPU或只用NPU),这导致了两个主要问题: 资源浪费:无法充分利用芯片上所有计算单元的算力。 性能瓶颈:单一加速器有其固有的性能短板,无法在所有场景下都达到最优性能。 Challenges Insights 设计一个能够同时、高效地利用 GPU 和 NPU 进行协同计算的 LLM 推理引擎,以最大限度地提升移动设备上的 LLM 推理速度 The NPU serves as the primary computing unit, handling the majority of computing tasks, while the GPU acts as a secondary computing unit to enhance the lower bound of NPU performance. GPU Characteristics Linear Performance: The GPU’s performance scales linearly as the tensor size increases. It transitions from being memory-bound on small tensors to compute-bound on large ones, where its performance plateaus. High-Cost Synchronization: There are two main types of synchronization overheads. Data Copy: API calls to transfer data between CPU and GPU buffers, like clEnqueueWriteBuffer, incur a fixed latency of about 400 microseconds, irrespective of the data’s size. Kernel Submission: Submitting a kernel to an active, non-empty GPU queue has a negligible overhead (10-20 microseconds). However, after a synchronization event empties the queue, submitting the next kernel incurs a much higher “startup” latency of 50-100 microseconds. ...

August 24, 2025 · Last updated on August 26, 2025 · 4 min · KKKZOZ

A Survey of Resource-efficient LLM and Multimodal Foundation Models

Extensive Reading Goal The goal of this survey is to provide an overarching understanding of how current approaches are tackling the resource challenges posed by large foundation models and to potentially inspire future breakthroughs in this field. ...

August 21, 2025 · Last updated on August 26, 2025 · 3 min · KKKZOZ

H2O Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models

Skimming Author Info Zhenyu “Allen” Zhang: A final-year Ph.D. student at the Electrical and Computer Engineering Department of UT Austin. Ying Sheng Insights Inherent Sparsity of Attention 推理过程中,其注意力矩阵表现出极高的稀疏性,超过95%的注意力值都非常小。这意味着在生成下一个 token 时,模型实际上只关注了过去所有词元中的一小部分。这为减少 KV Cache 的大小提供了可能性,因为大部分缓存的键值对实际上很少被用到 Existence of “Heavy Hitters” 通过分析词元在注意力计算中的累积得分,作者发现这些得分遵循 Power-law distribution, 这意味着只有一小部分词元 (Heavy Hitters) 贡献了绝大部分的注意力价值。这些 H₂ 词元对于维持模型的性能至关重要,如果将它们从缓存中移除,模型的准确率会急剧下降 Effectiveness of Local Statistics 理论上,要识别出真正的 Heavy Hitters 需要知道未来所有词元的注意力信息,这在自回归生成中是不现实的。 论文通过实验发现,仅使用局部信息——即在每个解码步骤中,根据已经生成的词元来计算和累积注意力分数——来动态确定 H₂,其效果与使用全局信息几乎一样好。 Note 既然不是所有的历史信息都同等重要,那么就可以设计一种智能的缓存管理策略,只保留那些最关键的信息,从而在有限的显存中实现高效推理。 ...

August 21, 2025 · Last updated on August 26, 2025 · 1 min · KKKZOZ

LLM as a System Service on Mobile Devices

Intensive Reading Author Info ‪Wangsong Yin‬ - ‪Google Scholar‬ Mengwei Xu Background 论文首先提出了 LLMaaS: LLM as a system service on mobile devices (LLMaaS): The mobile OS exposes an LLM and its inference infrastructure as a system feature to mobile apps, akin to the location or notification services. LLMaaS 的提出主要有以下原因: LLMaaS needs only one copy of LLM weights in memory. 不同应用程序应该去调用由系统维护的同一个大模型,而不是自己单独去加载一个 A system-level LLM can be better customized for on-device accelerator and enjoy the performance gain over commodity hardware. 在系统层面去做大模型的管理和推理更接近底层,能够更好地利用底层的硬件资源 这篇文章要解决的核心问题是 How to efficiently manage the LLM contexts ...

August 18, 2025 · Last updated on August 26, 2025 · 4 min · KKKZOZ