[Pinned] LLM Inference Papers Index

My reading notes. 2025 0826-0901 ELMS Elasticized Large Language Models On Mobile Devices Scaling Up On-Device LLMs via Active-Weight Swapping Between DRAM and Flash 0819-0825 STI Turbocharge NLP Inference at the Edge via Elastic Pipelining EdgeMoE Empowering Sparse Large Language Models on Mobile Devices LLM as a System Service on Mobile Devices SmallThinker A Family of Efficient Large Language Models Natively Trained for Local Deployment HeteroLLM Accelerating Large Language Model Inference on Mobile SoCs with Heterogeneous AI Accelerators A Survey of Resource-efficient LLM and Multimodal Foundation Models H2O Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models 0812-0818 KV-Runahead Scalable Causal LLM Inference by Parallel Key-Value Cache Generation Striped Attention Faster Ring Attention for Causal Transformers Ring Attention with Blockwise Transformers for Near-Infinite Context TPI-LLM Serving 70B-scale LLMs Efficiently on Low-resource Mobile Devices LLM.int8() 8-bit Matrix Multiplication for Transformers at Scale 0729-0804 Fast On-device LLM Inference with NPUs Deja Vu Contextual Sparsity for Efficient LLMs at Inference Time PowerInfer-2 Fast Large Language Model Inference on a Smartphone LLM in a flash Efficient Large Language Model Inference with Limited Memory PowerInfer Fast Large Language Model Serving with a Consumer-grade GPU 0722-0728 AWQ Activation-aware Weight Quantization for LLM Compression and Acceleration FlexGen High-Throughput Generative Inference of Large Language Models with a Single GPU LoRA Low-Rank Adaptation of Large Language Models SpecInfer Accelerating Large Language Model Serving with Tree-based Speculative Inference and Verification EdgeLLM Fast On-Device LLM Inference With Speculative Decoding Efficient Memory Management for Large Language Model Serving with PagedAttention 0715-0721 A Survey on Efficient Inference for Large Language Models -0714 Orca A Distributed Serving System for Transformer-Based Generative Models EdgeShard Efficient LLM Inference via Collaborative Edge Computing ServerlessLLM Locality-Enhanced Serverless Inference for Large Language Models Uncategorized WIP 🚧 ...

July 28, 2025 · Last updated on September 1, 2025 · 2 min · KKKZOZ

[Pinned] Transactions Papers Index

My reading notes. 2025 0715-0721 Concurrency Control as a Service Sonata Multi-Database Transactions Made Fast and Serializable Uncategorized WIP 🚧 towards-transaction-as-a-service grit taking-omid-to-the-clouds epoxy ad-hoc-transactions-in-web-applications omid-reloaded data-management-in-microservices scalable-distributed-transactions-across-heterogeneous-stores cobra

August 1, 2025 · Last updated on August 3, 2025 · 1 min · KKKZOZ

[Pinned] Distributed Papers Index

My reading notes. 2025 2023 && 2024 bigtable cap-twelve-years-later zab mapreduce chubby chain-replication time, clocks, and the ordering farm zookeeper

August 1, 2025 · Last updated on August 3, 2025 · 1 min · KKKZOZ

Clang on Apple

最近在 macOS 上尝试编译 llama.cpp 的过程中,踩了不少坑。最后的结论其实很简单:在 macOS 上,最稳妥的方案就是直接用系统自带的 Apple Clang。这样几乎不需要额外配置,避免了各种 ABI、SDK 的兼容性问题。 遇到的问题 一开始我用的是 Homebrew 安装的 LLVM/Clang: brew install llvm 然后在 CMake 的 toolchain 或者 preset 里,把编译器指定成了: /opt/homebrew/opt/llvm/bin/clang /opt/homebrew/opt/llvm/bin/clang++ 结果一跑,问题接踵而至: SDK 找不到 链接时提示: ld: library 'System' not found 这是因为 Homebrew 的 clang 默认不会自动找到 macOS SDK,导致 libSystem 等核心库无法链接。 ABI 不兼容 在修复 SDK 之后,又遇到了链接报错: Undefined symbols for architecture arm64: "std::__1::__hash_memory(void const*, unsigned long)", ... 这些符号来自 libc++ 21 的新 ABI(Homebrew 的 LLVM),但链接时却跑去用了 Apple SDK 里的老版本 libc++。结果头文件和库的版本不一致,出现了典型的 ABI mismatch。 ...

September 12, 2025 · Last updated on September 12, 2025 · 1 min · KKKZOZ

Git Essentials

Essential Understandings about Git. Basic Concept Working Directory & Staging Area & Branch Working Directory 是你当前正在进行工作的、实实在在的磁盘文件集合 可以是“干净的”,即与仓库中某个版本完全一致;也可以是“脏的”,即包含了修改过或未跟踪的文件 Staging Area 是一个临时区域,用于保存即将提交到版本库的文件快照 通过 git add 命令将工作目录中的修改添加到暂存区 Branch 是 Git 中用于并行开发的核心概念 每个分支都是一个独立的开发线,可以在上面进行修改而不影响其他分支 Remote & Upstream Remote 是指你本地仓库关联的远程 Git 仓库的引用。 # 添加 remote git remote add origin https://github.com/user/repo.git # 添加多个 remote git remote add upstream https://github.com/original/repo.git git remote add fork https://github.com/your-fork/repo.git # 删除 remote git remote remove origin # 重命名 remote git remote rename origin github Upstream 是指本地分支跟踪的远程分支,建立了"上下游"关系。 ...

August 8, 2025 · Last updated on September 1, 2025 · 12 min · KKKZOZ

VSCode Essentials

Essential Understandings about VSCode. This post will continue to update to cover every major update of VSCode. Concepts Workspace VSCode 的工作区有两种: 单文件夹工作区 (Single-Folder Workspace) 这是最常见、最简单的一种。当你使用 File -> Open Folder 打开一个文件夹时,这个文件夹就成为了你的当前工作区。VSCode 的所有操作和配置(比如在 .vscode 目录下的文件)都是相对于这个根文件夹的。 多根工作区 (Multi-root Workspace) 一个多根工作区可以包含多个不同位置的文件夹,但它们都在同一个 VSCode 窗口中管理。 场景:想象一个复杂的项目,它的前端代码在一个仓库(比如 my-webapp),后端代码在另一个完全独立的仓库(比如 my-api-server)。你希望同时能看到并编辑这两个项目的代码。 操作: 先打开其中一个文件夹(例如 my-webapp)。 然后点击 File -> Add Folder to Workspace,并选择 my-api-server. 此时文件浏览器里会同时出现这两个文件夹。 最后,点击 File -> Save Workspace As..., VSCode 会创建一个后缀为 .code-workspace 的文件。 以后只需要直接打开这个 .code-workspace 文件,就能恢复包含多个项目文件夹的工作环境。 总结:所以,“工作区”是你当前在 VSCode 中的项目上下文。它可以是一个简单的文件夹,也可以是一个由 .code-workspace 文件定义的、包含多个文件夹的集合。 ...

August 6, 2025 · Last updated on August 10, 2025 · 4 min · KKKZOZ

Deep Learning Basic

For self reference. Forward Pass import torch import torch.nn.functional as F # Setup learning_rate = 0.1 x = torch.randn(1, 5) # Input data y_true = torch.tensor([[1.0]]) # True label # Model parameters initialized manually # requires_grad=True tells PyTorch to calculate gradients for them w = torch.randn(5, 1, requires_grad=True) b = torch.randn(1, requires_grad=True) print(f"Initial weight:\n{w.data}\n") # 1. Forward Pass # Calculate a prediction using the current weight and bias z = x @ w + b # `@` is matrix multiplication y_pred = torch.sigmoid(z) # 2. Calculate Loss # Compare the prediction to the true label loss = F.binary_cross_entropy(y_pred, y_true) # 3. Backward Pass # Calculate the gradients of the loss with respect to w and b loss.backward() # 4. Update Parameters # Manually adjust w and b in the opposite direction of their gradients with torch.no_grad(): # Temporarily disable gradient tracking for the update w -= learning_rate * w.grad b -= learning_rate * b.grad # Manually zero out the gradients for the next iteration w.grad.zero_() b.grad.zero_() print(f"Updated weight:\n{w.data}\n") print(f"Loss: {loss.item():.4f}") “forward pass” (前向传播) 是指神经网络从输入数据开始,逐层计算,直到产生最终输出(预测结果)的过程。可以把它想象成信息在网络中“向前流动”的过程。 ...

May 27, 2025 · Last updated on August 1, 2025 · 4 min · KKKZOZ

NumPy Notes

Shape Manipulating reshape 可以把 NumPy 的 reshape 操作想象成一个先摊平, 再重铺的过程 核心思想: 无论你原来的数组是什么形状,也无论你想要变成什么新形状,reshape 都会(概念上)做两步: 摊平 (Flattening): 一行一行地把整个数组中的元素读出来, 形成一维数组 重铺 (Refilling): 再根据 reshape 后的形状填满一行一行地填满整个数组 import numpy as np a = np.array([[1, 2, 3, 4, 5, 6], [7, 8, 9, 10, 11, 12]]) # Flatten a_flat = a.reshape(1, 12) print("Flattened array: ", a_flat) # Refill a_refilled = a_flat.reshape(3, 4) print("Refilled array: ", a_refilled) transpose 高维转置的本质是重新安排数组的索引顺序,而不是传统意义上的"矩阵转置" concatenate & stack concatenate: 沿着现有的轨道/维度进行延伸或对接 concatenate 是将多个数组沿着一个已经存在的维度(轴,axis)拼接起来。结果数组的维度数量通常与输入数组的维度数量相同 工作方式: 你需要指定一个 axis 参数,告诉 NumPy 沿着哪个维度进行拼接。 除了要拼接的那个维度之外,其他所有维度的大小必须完全相同。 就像你要把两列火车车厢接起来,它们的高度和宽度得匹配,只有长度可以不同(然后加起来) A = np.array([[1, 2], [3, 4]]) B = np.array([[5, 6]]) # 注意:B 也是二维的,才能在 axis=0 上与 A 匹配列数 np.concatenate((A, B), axis=0) # 结果: # [[1, 2], # [3, 4], # [5, 6]] (行数增加了,列数不变) A = np.array([[1, 2], [3, 4]]) C = np.array([[5, 6], [7, 8]]) np.concatenate((A, C), axis=1) # 结果: # [[1, 2, 5, 6], # [3, 4, 7, 8]] (列数增加了,行数不变) stack: 将多个独立的层叠放起来,形成一个新的维度 ...

May 26, 2025 · Last updated on August 1, 2025 · 5 min · KKKZOZ