KKKZOZ’s Blog

Three paper indexes(LLM/Transactions/Distributed Systems) are pinned.

Other posts are sorted by date.

[Pinned] LLM Inference Papers Index

My reading notes. 2025 1103-1110 EAGLE Speculative Sampling Requires Rethinking Feature Uncertainty 1028-1103 Aegaeon Effective GPU Pooling for Concurrent LLM Serving on the Market DistServe Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving Splitwise Efficient Generative LLM Inference Using Phase Splitting Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve 0826-0901 ELMS Elasticized Large Language Models On Mobile Devices Scaling Up On-Device LLMs via Active-Weight Swapping Between DRAM and Flash 0819-0825 STI Turbocharge NLP Inference at the Edge via Elastic Pipelining EdgeMoE Empowering Sparse Large Language Models on Mobile Devices LLM as a System Service on Mobile Devices SmallThinker A Family of Efficient Large Language Models Natively Trained for Local Deployment HeteroLLM Accelerating Large Language Model Inference on Mobile SoCs with Heterogeneous AI Accelerators A Survey of Resource-efficient LLM and Multimodal Foundation Models H2O Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models 0812-0818 KV-Runahead Scalable Causal LLM Inference by Parallel Key-Value Cache Generation Striped Attention Faster Ring Attention for Causal Transformers Ring Attention with Blockwise Transformers for Near-Infinite Context TPI-LLM Serving 70B-scale LLMs Efficiently on Low-resource Mobile Devices LLM.int8() 8-bit Matrix Multiplication for Transformers at Scale 0729-0804 Fast On-device LLM Inference with NPUs Deja Vu Contextual Sparsity for Efficient LLMs at Inference Time PowerInfer-2 Fast Large Language Model Inference on a Smartphone LLM in a flash Efficient Large Language Model Inference with Limited Memory PowerInfer Fast Large Language Model Serving with a Consumer-grade GPU 0722-0728 AWQ Activation-aware Weight Quantization for LLM Compression and Acceleration FlexGen High-Throughput Generative Inference of Large Language Models with a Single GPU LoRA Low-Rank Adaptation of Large Language Models SpecInfer Accelerating Large Language Model Serving with Tree-based Speculative Inference and Verification EdgeLLM Fast On-Device LLM Inference With Speculative Decoding Efficient Memory Management for Large Language Model Serving with PagedAttention 0715-0721 A Survey on Efficient Inference for Large Language Models -0714 Orca A Distributed Serving System for Transformer-Based Generative Models EdgeShard Efficient LLM Inference via Collaborative Edge Computing ServerlessLLM Locality-Enhanced Serverless Inference for Large Language Models Uncategorized WIP 🚧 ...

July 28, 2025 · Last updated on November 10, 2025 · 2 min · KKKZOZ

[Pinned] Transactions Papers Index

My reading notes. 2025 0715-0721 Concurrency Control as a Service Sonata Multi-Database Transactions Made Fast and Serializable Uncategorized WIP 🚧 towards-transaction-as-a-service grit taking-omid-to-the-clouds epoxy ad-hoc-transactions-in-web-applications omid-reloaded data-management-in-microservices scalable-distributed-transactions-across-heterogeneous-stores cobra

August 1, 2025 · Last updated on August 3, 2025 · 1 min · KKKZOZ

[Pinned] Distributed Papers Index

My reading notes. 2025 2023 && 2024 bigtable cap-twelve-years-later zab mapreduce chubby chain-replication time, clocks, and the ordering farm zookeeper

August 1, 2025 · Last updated on August 3, 2025 · 1 min · KKKZOZ

Sync Dot Files with Chezmoi

Keeping dotfiles consistent across multiple machines (macOS, Ubuntu, etc.) is a common challenge. Traditionally, tools like GNU Stow were used to symlink files from a repository into $HOME. Today, chezmoi provides a more powerful, template-driven approach that integrates seamlessly with Git and makes managing dotfiles across systems straightforward. This post will walk you through: Installing chezmoi Setting it up on your first machine Committing to a remote repository Applying and syncing your configuration on other hosts Handling OS-specific configuration differences 1. Installing chezmoi On macOS (via Homebrew): ...

October 4, 2025 · Last updated on October 4, 2025 · 3 min · KKKZOZ

Clang on Apple

最近在 macOS 上尝试编译 llama.cpp 的过程中,踩了不少坑。最后的结论其实很简单:在 macOS 上,最稳妥的方案就是直接用系统自带的 Apple Clang。这样几乎不需要额外配置,避免了各种 ABI、SDK 的兼容性问题。 遇到的问题 一开始我用的是 Homebrew 安装的 LLVM/Clang: brew install llvm 然后在 CMake 的 toolchain 或者 preset 里,把编译器指定成了: /opt/homebrew/opt/llvm/bin/clang /opt/homebrew/opt/llvm/bin/clang++ 结果一跑,问题接踵而至: SDK 找不到 链接时提示: ld: library 'System' not found 这是因为 Homebrew 的 clang 默认不会自动找到 macOS SDK,导致 libSystem 等核心库无法链接。 ABI 不兼容 在修复 SDK 之后,又遇到了链接报错: Undefined symbols for architecture arm64: "std::__1::__hash_memory(void const*, unsigned long)", ... 这些符号来自 libc++ 21 的新 ABI(Homebrew 的 LLVM),但链接时却跑去用了 Apple SDK 里的老版本 libc++。结果头文件和库的版本不一致,出现了典型的 ABI mismatch。 ...

September 12, 2025 · Last updated on September 12, 2025 · 1 min · KKKZOZ

Git Essentials

Essential Understandings about Git. Basic Concept Working Directory & Staging Area & Branch Working Directory 是你当前正在进行工作的、实实在在的磁盘文件集合 可以是“干净的”,即与仓库中某个版本完全一致;也可以是“脏的”,即包含了修改过或未跟踪的文件 Staging Area 是一个临时区域,用于保存即将提交到版本库的文件快照 通过 git add 命令将工作目录中的修改添加到暂存区 Branch 是 Git 中用于并行开发的核心概念 每个分支都是一个独立的开发线,可以在上面进行修改而不影响其他分支 Remote & Upstream Remote 是指你本地仓库关联的远程 Git 仓库的引用。 # 添加 remote git remote add origin https://github.com/user/repo.git # 添加多个 remote git remote add upstream https://github.com/original/repo.git git remote add fork https://github.com/your-fork/repo.git # 删除 remote git remote remove origin # 重命名 remote git remote rename origin github Upstream 是指本地分支跟踪的远程分支,建立了"上下游"关系。 ...

August 8, 2025 · Last updated on September 1, 2025 · 12 min · KKKZOZ

VSCode Essentials

Essential Understandings about VSCode. This post will continue to update to cover every major update of VSCode. Concepts Workspace VSCode 的工作区有两种: 单文件夹工作区 (Single-Folder Workspace) 这是最常见、最简单的一种。当你使用 File -> Open Folder 打开一个文件夹时,这个文件夹就成为了你的当前工作区。VSCode 的所有操作和配置(比如在 .vscode 目录下的文件)都是相对于这个根文件夹的。 多根工作区 (Multi-root Workspace) 一个多根工作区可以包含多个不同位置的文件夹,但它们都在同一个 VSCode 窗口中管理。 场景:想象一个复杂的项目,它的前端代码在一个仓库(比如 my-webapp),后端代码在另一个完全独立的仓库(比如 my-api-server)。你希望同时能看到并编辑这两个项目的代码。 操作: 先打开其中一个文件夹(例如 my-webapp)。 然后点击 File -> Add Folder to Workspace,并选择 my-api-server. 此时文件浏览器里会同时出现这两个文件夹。 最后,点击 File -> Save Workspace As..., VSCode 会创建一个后缀为 .code-workspace 的文件。 以后只需要直接打开这个 .code-workspace 文件,就能恢复包含多个项目文件夹的工作环境。 总结:所以,“工作区”是你当前在 VSCode 中的项目上下文。它可以是一个简单的文件夹,也可以是一个由 .code-workspace 文件定义的、包含多个文件夹的集合。 ...

August 6, 2025 · Last updated on August 10, 2025 · 4 min · KKKZOZ

Deep Learning Basic

For self reference. Forward Pass import torch import torch.nn.functional as F # Setup learning_rate = 0.1 x = torch.randn(1, 5) # Input data y_true = torch.tensor([[1.0]]) # True label # Model parameters initialized manually # requires_grad=True tells PyTorch to calculate gradients for them w = torch.randn(5, 1, requires_grad=True) b = torch.randn(1, requires_grad=True) print(f"Initial weight:\n{w.data}\n") # 1. Forward Pass # Calculate a prediction using the current weight and bias z = x @ w + b # `@` is matrix multiplication y_pred = torch.sigmoid(z) # 2. Calculate Loss # Compare the prediction to the true label loss = F.binary_cross_entropy(y_pred, y_true) # 3. Backward Pass # Calculate the gradients of the loss with respect to w and b loss.backward() # 4. Update Parameters # Manually adjust w and b in the opposite direction of their gradients with torch.no_grad(): # Temporarily disable gradient tracking for the update w -= learning_rate * w.grad b -= learning_rate * b.grad # Manually zero out the gradients for the next iteration w.grad.zero_() b.grad.zero_() print(f"Updated weight:\n{w.data}\n") print(f"Loss: {loss.item():.4f}") “forward pass” (前向传播) 是指神经网络从输入数据开始,逐层计算,直到产生最终输出(预测结果)的过程。可以把它想象成信息在网络中“向前流动”的过程。 ...

May 27, 2025 · Last updated on August 1, 2025 · 4 min · KKKZOZ