KKKZOZ’s Blog

Three paper indexes(LLM/Transactions/Distributed Systems) are pinned.

[Pinned] LLM Inference Papers Index

My reading notes. 2026 0203-0209 Cascade Speculative Drafting for Even Faster LLM Inference CAS-Spec Cascade Adaptive Self-Speculative Decoding for On-the-Fly Lossless Inference Acceleration of LLMs Draft & Verify Lossless Large Language Model Acceleration via Self-Speculative Decoding Swift On-the-fly Self-speculative Decoding For LLM Inference Acceleration 3-Model Speculative Decoding Hierarchical Speculative Decoding with Dynamic Windows for Efficient Language Model Inference LayerSkip Enabling Early Exit Inference and Self-Speculative Decoding AIConfigurator Lightning-Fast Configuration Optimization for Multi-Framework LLM Serving Revati Transparent GPU-Free Time-Warp Emulation for LLM Serving 0127-0202 FlexPrefill A Context-Aware Sparse Attention Mechanism for Efficient Long-Sequence Inference XAttention Block Sparse Attention with Antidiagonal Scoring SLED A Speculative LLM Decoding Framework for Efficient Edge Serving R-Stitch Dynamic Trajectory Stitching for Efficient Reasoning Estimating LLM Uncertainty with Evidence Entropy Adaptive Decoding Dynamic Model Switching for Efficient Inference Think Big, Generate Quick LLM-to-SLM for Fast Autoregressive Decoding 2025 Remaining Beyond the 80 20 Rule High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning KVCache Cache in the Wild Characterizing and Optimizing KVCache Cache at a Large Cloud Provider 1111-1117 LServe Efficient Long-sequence LLM Serving with Unified Sparse Attention QServe W4A8KV4 Quantization and System Co-design for Efficient LLM Serving Quest Query-Aware Sparsity for Efficient Long-Context LLM Inference Dynamic Sparse Attention on Mobile SoCs A dynamic parallel method for performance optimization on hybrid CPUs SmoothQuant Accurate and Efficient Post-Training Quantization for Large Language Models DuoAttention Efficient Long-Context LLM Inference with Retrieval and Streaming Heads Efficient Streaming Language Models with Attention Sinks KTransformers Unleashing the Full Potential of CPU GPU Hybrid Inference for MoE Models 1104-1110 EAGLE Speculative Sampling Requires Rethinking Feature Uncertainty 1028-1103 Aegaeon Effective GPU Pooling for Concurrent LLM Serving on the Market DistServe Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving Splitwise Efficient Generative LLM Inference Using Phase Splitting Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve 0826-0901 ELMS Elasticized Large Language Models On Mobile Devices Scaling Up On-Device LLMs via Active-Weight Swapping Between DRAM and Flash 0819-0825 STI Turbocharge NLP Inference at the Edge via Elastic Pipelining EdgeMoE Empowering Sparse Large Language Models on Mobile Devices LLM as a System Service on Mobile Devices SmallThinker A Family of Efficient Large Language Models Natively Trained for Local Deployment HeteroLLM Accelerating Large Language Model Inference on Mobile SoCs with Heterogeneous AI Accelerators A Survey of Resource-efficient LLM and Multimodal Foundation Models H2O Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models 0812-0818 KV-Runahead Scalable Causal LLM Inference by Parallel Key-Value Cache Generation Striped Attention Faster Ring Attention for Causal Transformers Ring Attention with Blockwise Transformers for Near-Infinite Context TPI-LLM Serving 70B-scale LLMs Efficiently on Low-resource Mobile Devices LLM.int8() 8-bit Matrix Multiplication for Transformers at Scale 0729-0804 Fast On-device LLM Inference with NPUs Deja Vu Contextual Sparsity for Efficient LLMs at Inference Time PowerInfer-2 Fast Large Language Model Inference on a Smartphone LLM in a flash Efficient Large Language Model Inference with Limited Memory PowerInfer Fast Large Language Model Serving with a Consumer-grade GPU 0722-0728 AWQ Activation-aware Weight Quantization for LLM Compression and Acceleration FlexGen High-Throughput Generative Inference of Large Language Models with a Single GPU LoRA Low-Rank Adaptation of Large Language Models SpecInfer Accelerating Large Language Model Serving with Tree-based Speculative Inference and Verification EdgeLLM Fast On-Device LLM Inference With Speculative Decoding Efficient Memory Management for Large Language Model Serving with PagedAttention 0715-0721 A Survey on Efficient Inference for Large Language Models -0714 Orca A Distributed Serving System for Transformer-Based Generative Models EdgeShard Efficient LLM Inference via Collaborative Edge Computing ServerlessLLM Locality-Enhanced Serverless Inference for Large Language Models Uncategorized WIP 🚧 ...

[Pinned] Transactions Papers Index

My reading notes. 2025 0715-0721 Concurrency Control as a Service Sonata Multi-Database Transactions Made Fast and Serializable Uncategorized WIP 🚧 towards-transaction-as-a-service grit taking-omid-to-the-clouds epoxy ad-hoc-transactions-in-web-applications omid-reloaded data-management-in-microservices scalable-distributed-transactions-across-heterogeneous-stores cobra

[Pinned] Distributed Papers Index

My reading notes. 2025 2023 && 2024 bigtable cap-twelve-years-later zab mapreduce chubby chain-replication time, clocks, and the ordering farm zookeeper

GGML-with-C++

在和 GGML 打交道时不知道已经写了几篇文档了，这篇一定是最后一个这篇文档主要记录我用 GGML 实现一个简单的 LLM Inference Engine 时遇到的问题 C++ Initialization 在 C++ 中， class 中的成员在构造函数体开始之前就被默认构造了，要控制其行为，需要使用初始化列表 class Member { public: Member() { std::cout << "Member 默认构造\n"; } Member(int x) { std::cout << "Member 带参构造: " << x << "\n"; } }; class MyClass { Member m1; // 成员对象 Member m2; int value; public: // 情况1: 不使用初始化列表 MyClass() { std::cout << "构造函数体开始\n"; value = 10; // 这是赋值，不是初始化！ } // 情况2: 使用初始化列表 MyClass(int v) : m1(1), m2(2), value(v) { std::cout << "构造函数体开始\n"; } }; 情况 1 输出 Member 默认构造 // m1 在构造函数体前被默认构造 Member 默认构造 // m2 在构造函数体前被默认构造构造函数体开始情况 2 输出 Member 带参构造: 1 // m1 在构造函数体前初始化 Member 带参构造: 2 // m2 在构造函数体前初始化构造函数体开始对于指针来说，初始化就是将其设置为 nullptr, 对于 STL 容器，就是初始化空容器 ...

matrix-math

向量内积物理意义内积本质上是一个向量在另一个向量方向上的投影长度，与基准向量长度的乘积。如果提取特征或求取分量，令 $\mathbf{u}$ 为单位向量（$\|\mathbf{u}\| = 1$），则 $\mathbf{v} \cdot \mathbf{u}$ 直接输出 $\mathbf{v}$ 在 $\mathbf{u}$ 方向上的标量投影。工程应用：在信号处理（如傅里叶变换）中，信号与正交基函数的内积，就是在提取该信号在特定频率上的能量分量。在经典力学中，功的计算 $W = \mathbf{F} \cdot \mathbf{d}$ 就是提取力在位移方向上的有效分量并相乘。由于内积公式中包含 $\cos(\theta)$，它是衡量高维空间中两个向量方向“一致性”或“对齐程度”的线性算子。当 $\mathbf{a} \cdot \mathbf{b} > 0$ 时，夹角为锐角，两者存在正相关性。当 $\mathbf{a} \cdot \mathbf{b} = 0$ 时，$\cos(\theta) = 0$，两向量正交（垂直）。在工程上，这意味着两个系统、信号或特征完全独立，互不干涉（即协方差为零）。当 $\mathbf{a} \cdot \mathbf{b} < 0$ 时，夹角为钝角，存在负相关性。工程应用：在机器学习和数据挖掘中，将向量归一化后求内积，即为余弦相似度（Cosine Similarity），常用于衡量文本词向量的语义相似性或推荐系统中用户偏好的匹配度。 Note 向量内积天然满足交换律，即 $\mathbf{a} \cdot \mathbf{b} = \mathbf{b} \cdot \mathbf{a}$，即 $\mathbf{a}$ 投影在 $\mathbf{b}$ 上和 $\mathbf{b}$ 投影在 $\mathbf{a}$ 上的数值是相等的 ...

probalility-and-statistics

L1 范数 The $L_1$ norm calculates the sum of the absolute values of all elements within a vector or matrix. Calculation: For a weight vector $w$ with $n$ elements, the $L_1$ norm is defined as: $$ \|w\|_1 = \sum_{i=1}^{n} |w_i| $$ Related to LLM Pruning Context: When applying the $L_1$ norm to a structural group (like a specific attention channel), you take the absolute value of every single parameter in that channel and sum them up. A low $L_1$ norm indicates that the parameters in that group are collectively very close to zero. ...

cli-notes

记录一些我在使用 cli 时简单整理的一些东西 Setup Docker curl -fsSL https://get.docker.com -o get-docker.sh sudo DOWNLOAD_URL=https://mirrors.ustc.edu.cn/docker-ce sh get-docker.sh sudo groupadd docker # optional sudo usermod -aG docker $USER newgrp docker sudo systemctl start docker apt 替换镜像源： For 24.04: sed -i "s@http://.*archive.ubuntu.com@https://mirrors.aliyun.com/@g" /etc/apt/sources.list.d/ubuntu.sources sed -i "s@http://.*security.ubuntu.com@https://mirrors.aliyun.com/@g" /etc/apt/sources.list.d/ubuntu.sources sed -i "s@http://ports.ubuntu.com@https://mirrors.aliyun.com@g" /etc/apt/sources.list.d/ubuntu.sources For 22.04: sed -i "s@http://.*archive.ubuntu.com@https://mirrors.aliyun.com/@g" /etc/apt/sources.list sed -i "s@http://.*security.ubuntu.com@https://mirrors.aliyun.com/@g" /etc/apt/sources.list Network # 查看网关 ip route # 查看 ip 地址 ip addr # 查看每个进程的网速 sudo apt install nethogs nethogs apt sudo 会为了安全重置环境变量, 加上 -E 参数来保留当前的环境变量 ...

linear-attention

linear_attention 是把 Mamba2 的 gating 机制和 DeltaNet 的 delta rule 结合起来形成的新结构先简单分析一下线性注意力是什么 Overview 从 Transformer 视角标准 causal self-attention 的一层，大致是： $$ Q = XW_Q,\quad K = XW_K,\quad V = XW_V $$$$ A = \text{softmax}(QK^\top + \text{mask}),\quad O = AV $$对第 $t$ 个 token 来说，本质上是在做： $$ o_t = \sum_{i \le t} \alpha_{t,i} v_i $$ O(n, t) 中的每一行是把 attn_map 中的对应行作为权重，对 value 矩阵的行进行线性组合(加权) 也就是：当前 token 用 query 去和历史所有 key 打分，再把历史 value 加权求和。这给了 Transformer 很强的“按内容检索历史”的能力，但代价是训练时通常要处理完整的 token-token 交互矩阵。Mamba2、DeltaNet、Gated DeltaNet 这一类工作，核心目标就是：尽量保留这种“从历史取信息”的能力，但不要每次都真的显式看全部历史 token。 ...