Posts

[Pinned] LLM Inference Papers Index

My reading notes. 2025 0826-0901 ELMS Elasticized Large Language Models On Mobile Devices Scaling Up On-Device LLMs via Active-Weight Swapping Between DRAM and Flash 0819-0825 STI Turbocharge NLP Inference at the Edge via Elastic Pipelining EdgeMoE Empowering Sparse Large Language Models on Mobile Devices LLM as a System Service on Mobile Devices SmallThinker A Family of Efficient Large Language Models Natively Trained for Local Deployment HeteroLLM Accelerating Large Language Model Inference on Mobile SoCs with Heterogeneous AI Accelerators A Survey of Resource-efficient LLM and Multimodal Foundation Models H2O Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models 0812-0818 KV-Runahead Scalable Causal LLM Inference by Parallel Key-Value Cache Generation Striped Attention Faster Ring Attention for Causal Transformers Ring Attention with Blockwise Transformers for Near-Infinite Context TPI-LLM Serving 70B-scale LLMs Efficiently on Low-resource Mobile Devices LLM.int8() 8-bit Matrix Multiplication for Transformers at Scale 0729-0804 Fast On-device LLM Inference with NPUs Deja Vu Contextual Sparsity for Efficient LLMs at Inference Time PowerInfer-2 Fast Large Language Model Inference on a Smartphone LLM in a flash Efficient Large Language Model Inference with Limited Memory PowerInfer Fast Large Language Model Serving with a Consumer-grade GPU 0722-0728 AWQ Activation-aware Weight Quantization for LLM Compression and Acceleration FlexGen High-Throughput Generative Inference of Large Language Models with a Single GPU LoRA Low-Rank Adaptation of Large Language Models SpecInfer Accelerating Large Language Model Serving with Tree-based Speculative Inference and Verification EdgeLLM Fast On-Device LLM Inference With Speculative Decoding Efficient Memory Management for Large Language Model Serving with PagedAttention 0715-0721 A Survey on Efficient Inference for Large Language Models -0714 Orca A Distributed Serving System for Transformer-Based Generative Models EdgeShard Efficient LLM Inference via Collaborative Edge Computing ServerlessLLM Locality-Enhanced Serverless Inference for Large Language Models Uncategorized WIP 🚧 ...

[Pinned] Transactions Papers Index

My reading notes. 2025 0715-0721 Concurrency Control as a Service Sonata Multi-Database Transactions Made Fast and Serializable Uncategorized WIP 🚧 towards-transaction-as-a-service grit taking-omid-to-the-clouds epoxy ad-hoc-transactions-in-web-applications omid-reloaded data-management-in-microservices scalable-distributed-transactions-across-heterogeneous-stores cobra

[Pinned] Distributed Papers Index

My reading notes. 2025 2023 && 2024 bigtable cap-twelve-years-later zab mapreduce chubby chain-replication time, clocks, and the ordering farm zookeeper

Clang on Apple

最近在 macOS 上尝试编译 llama.cpp 的过程中，踩了不少坑。最后的结论其实很简单：在 macOS 上，最稳妥的方案就是直接用系统自带的 Apple Clang。这样几乎不需要额外配置，避免了各种 ABI、SDK 的兼容性问题。遇到的问题一开始我用的是 Homebrew 安装的 LLVM/Clang： brew install llvm 然后在 CMake 的 toolchain 或者 preset 里，把编译器指定成了： /opt/homebrew/opt/llvm/bin/clang /opt/homebrew/opt/llvm/bin/clang++ 结果一跑，问题接踵而至： SDK 找不到链接时提示： ld: library 'System' not found 这是因为 Homebrew 的 clang 默认不会自动找到 macOS SDK，导致 libSystem 等核心库无法链接。 ABI 不兼容在修复 SDK 之后，又遇到了链接报错： Undefined symbols for architecture arm64: "std::__1::__hash_memory(void const*, unsigned long)", ... 这些符号来自 libc++ 21 的新 ABI（Homebrew 的 LLVM），但链接时却跑去用了 Apple SDK 里的老版本 libc++。结果头文件和库的版本不一致，出现了典型的 ABI mismatch。 ...

Git Essentials

Essential Understandings about Git. Basic Concept Working Directory & Staging Area & Branch Working Directory 是你当前正在进行工作的、实实在在的磁盘文件集合可以是“干净的”，即与仓库中某个版本完全一致；也可以是“脏的”，即包含了修改过或未跟踪的文件 Staging Area 是一个临时区域，用于保存即将提交到版本库的文件快照通过 git add 命令将工作目录中的修改添加到暂存区 Branch 是 Git 中用于并行开发的核心概念每个分支都是一个独立的开发线，可以在上面进行修改而不影响其他分支 Remote & Upstream Remote 是指你本地仓库关联的远程 Git 仓库的引用。 # 添加 remote git remote add origin https://github.com/user/repo.git # 添加多个 remote git remote add upstream https://github.com/original/repo.git git remote add fork https://github.com/your-fork/repo.git # 删除 remote git remote remove origin # 重命名 remote git remote rename origin github Upstream 是指本地分支跟踪的远程分支，建立了"上下游"关系。 ...

VSCode Essentials

Essential Understandings about VSCode. This post will continue to update to cover every major update of VSCode. Concepts Workspace VSCode 的工作区有两种: 单文件夹工作区 (Single-Folder Workspace) 这是最常见、最简单的一种。当你使用 File -> Open Folder 打开一个文件夹时，这个文件夹就成为了你的当前工作区。VSCode 的所有操作和配置（比如在 .vscode 目录下的文件）都是相对于这个根文件夹的。多根工作区 (Multi-root Workspace) 一个多根工作区可以包含多个不同位置的文件夹，但它们都在同一个 VSCode 窗口中管理。场景：想象一个复杂的项目，它的前端代码在一个仓库（比如 my-webapp），后端代码在另一个完全独立的仓库（比如 my-api-server）。你希望同时能看到并编辑这两个项目的代码。操作：先打开其中一个文件夹（例如 my-webapp）。然后点击 File -> Add Folder to Workspace，并选择 my-api-server. 此时文件浏览器里会同时出现这两个文件夹。最后，点击 File -> Save Workspace As..., VSCode 会创建一个后缀为 .code-workspace 的文件。以后只需要直接打开这个 .code-workspace 文件，就能恢复包含多个项目文件夹的工作环境。总结：所以，“工作区”是你当前在 VSCode 中的项目上下文。它可以是一个简单的文件夹，也可以是一个由 .code-workspace 文件定义的、包含多个文件夹的集合。 ...

Deep Learning Basic

For self reference. Forward Pass import torch import torch.nn.functional as F # Setup learning_rate = 0.1 x = torch.randn(1, 5) # Input data y_true = torch.tensor([[1.0]]) # True label # Model parameters initialized manually # requires_grad=True tells PyTorch to calculate gradients for them w = torch.randn(5, 1, requires_grad=True) b = torch.randn(1, requires_grad=True) print(f"Initial weight:\n{w.data}\n") # 1. Forward Pass # Calculate a prediction using the current weight and bias z = x @ w + b # `@` is matrix multiplication y_pred = torch.sigmoid(z) # 2. Calculate Loss # Compare the prediction to the true label loss = F.binary_cross_entropy(y_pred, y_true) # 3. Backward Pass # Calculate the gradients of the loss with respect to w and b loss.backward() # 4. Update Parameters # Manually adjust w and b in the opposite direction of their gradients with torch.no_grad(): # Temporarily disable gradient tracking for the update w -= learning_rate * w.grad b -= learning_rate * b.grad # Manually zero out the gradients for the next iteration w.grad.zero_() b.grad.zero_() print(f"Updated weight:\n{w.data}\n") print(f"Loss: {loss.item():.4f}") “forward pass” (前向传播) 是指神经网络从输入数据开始，逐层计算，直到产生最终输出（预测结果）的过程。可以把它想象成信息在网络中“向前流动”的过程。 ...

NumPy Notes

Shape Manipulating reshape 可以把 NumPy 的 reshape 操作想象成一个先摊平, 再重铺的过程核心思想：无论你原来的数组是什么形状，也无论你想要变成什么新形状，reshape 都会（概念上）做两步：摊平 (Flattening): 一行一行地把整个数组中的元素读出来, 形成一维数组重铺 (Refilling): 再根据 reshape 后的形状填满一行一行地填满整个数组 import numpy as np a = np.array([[1, 2, 3, 4, 5, 6], [7, 8, 9, 10, 11, 12]]) # Flatten a_flat = a.reshape(1, 12) print("Flattened array: ", a_flat) # Refill a_refilled = a_flat.reshape(3, 4) print("Refilled array: ", a_refilled) transpose 高维转置的本质是重新安排数组的索引顺序，而不是传统意义上的"矩阵转置" concatenate & stack concatenate: 沿着现有的轨道/维度进行延伸或对接 concatenate 是将多个数组沿着一个已经存在的维度（轴，axis）拼接起来。结果数组的维度数量通常与输入数组的维度数量相同工作方式：你需要指定一个 axis 参数，告诉 NumPy 沿着哪个维度进行拼接。除了要拼接的那个维度之外，其他所有维度的大小必须完全相同。就像你要把两列火车车厢接起来，它们的高度和宽度得匹配，只有长度可以不同（然后加起来） A = np.array([[1, 2], [3, 4]]) B = np.array([[5, 6]]) # 注意：B 也是二维的，才能在 axis=0 上与 A 匹配列数 np.concatenate((A, B), axis=0) # 结果: # [[1, 2], # [3, 4], # [5, 6]] (行数增加了，列数不变) A = np.array([[1, 2], [3, 4]]) C = np.array([[5, 6], [7, 8]]) np.concatenate((A, C), axis=1) # 结果: # [[1, 2, 5, 6], # [3, 4, 7, 8]] (列数增加了，行数不变) stack: 将多个独立的层叠放起来，形成一个新的维度 ...