Three paper indexes(LLM/Transactions/Distributed Systems) are pinned.
Other posts are sorted by date.
Three paper indexes(LLM/Transactions/Distributed Systems) are pinned.
Other posts are sorted by date.
My reading notes. 2025 0819-0825 STI Turbocharge NLP Inference at the Edge via Elastic Pipelining EdgeMoE Empowering Sparse Large Language Models on Mobile Devices LLM as a System Service on Mobile Devices SmallThinker A Family of Efficient Large Language Models Natively Trained for Local Deployment HeteroLLM Accelerating Large Language Model Inference on Mobile SoCs with Heterogeneous AI Accelerators A Survey of Resource-efficient LLM and Multimodal Foundation Models H2O Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models 0812-0818 KV-Runahead Scalable Causal LLM Inference by Parallel Key-Value Cache Generation Striped Attention Faster Ring Attention for Causal Transformers Ring Attention with Blockwise Transformers for Near-Infinite Context TPI-LLM Serving 70B-scale LLMs Efficiently on Low-resource Mobile Devices LLM.int8() 8-bit Matrix Multiplication for Transformers at Scale 0729-0804 Fast On-device LLM Inference with NPUs Deja Vu Contextual Sparsity for Efficient LLMs at Inference Time PowerInfer-2 Fast Large Language Model Inference on a Smartphone LLM in a flash Efficient Large Language Model Inference with Limited Memory PowerInfer Fast Large Language Model Serving with a Consumer-grade GPU 0722-0728 AWQ Activation-aware Weight Quantization for LLM Compression and Acceleration FlexGen High-Throughput Generative Inference of Large Language Models with a Single GPU LoRA Low-Rank Adaptation of Large Language Models SpecInfer Accelerating Large Language Model Serving with Tree-based Speculative Inference and Verification EdgeLLM Fast On-Device LLM Inference With Speculative Decoding Efficient Memory Management for Large Language Model Serving with PagedAttention 0715-0721 A Survey on Efficient Inference for Large Language Models -0714 Orca A Distributed Serving System for Transformer-Based Generative Models EdgeShard Efficient LLM Inference via Collaborative Edge Computing ServerlessLLM Locality-Enhanced Serverless Inference for Large Language Models Uncategorized WIP 🚧 ...
My reading notes. 2025 0715-0721 Concurrency Control as a Service Sonata Multi-Database Transactions Made Fast and Serializable Uncategorized WIP 🚧 towards-transaction-as-a-service grit taking-omid-to-the-clouds epoxy ad-hoc-transactions-in-web-applications omid-reloaded data-management-in-microservices scalable-distributed-transactions-across-heterogeneous-stores cobra
My reading notes. 2025 2023 && 2024 bigtable cap-twelve-years-later zab mapreduce chubby chain-replication time, clocks, and the ordering farm zookeeper
Essential Understandings about Git. Basic Concept Working Directory & Staging Area & Branch Working Directory 是你当前正在进行工作的、实实在在的磁盘文件集合 可以是“干净的”,即与仓库中某个版本完全一致;也可以是“脏的”,即包含了修改过或未跟踪的文件 Staging Area 是一个临时区域,用于保存即将提交到版本库的文件快照 通过 git add 命令将工作目录中的修改添加到暂存区 Branch 是 Git 中用于并行开发的核心概念 每个分支都是一个独立的开发线,可以在上面进行修改而不影响其他分支 Remote & Upstream Remote 是指你本地仓库关联的远程 Git 仓库的引用。 # 添加 remote git remote add origin https://github.com/user/repo.git # 添加多个 remote git remote add upstream https://github.com/original/repo.git git remote add fork https://github.com/your-fork/repo.git # 删除 remote git remote remove origin # 重命名 remote git remote rename origin github Upstream 是指本地分支跟踪的远程分支,建立了"上下游"关系。 ...
Essential Understandings about VSCode. This post will continue to update to cover every major update of VSCode. Concepts Workspace VSCode 的工作区有两种: 单文件夹工作区 (Single-Folder Workspace) 这是最常见、最简单的一种。当你使用 File -> Open Folder 打开一个文件夹时,这个文件夹就成为了你的当前工作区。VSCode 的所有操作和配置(比如在 .vscode 目录下的文件)都是相对于这个根文件夹的。 多根工作区 (Multi-root Workspace) 一个多根工作区可以包含多个不同位置的文件夹,但它们都在同一个 VSCode 窗口中管理。 场景:想象一个复杂的项目,它的前端代码在一个仓库(比如 my-webapp),后端代码在另一个完全独立的仓库(比如 my-api-server)。你希望同时能看到并编辑这两个项目的代码。 操作: 先打开其中一个文件夹(例如 my-webapp)。 然后点击 File -> Add Folder to Workspace,并选择 my-api-server. 此时文件浏览器里会同时出现这两个文件夹。 最后,点击 File -> Save Workspace As..., VSCode 会创建一个后缀为 .code-workspace 的文件。 以后只需要直接打开这个 .code-workspace 文件,就能恢复包含多个项目文件夹的工作环境。 总结:所以,“工作区”是你当前在 VSCode 中的项目上下文。它可以是一个简单的文件夹,也可以是一个由 .code-workspace 文件定义的、包含多个文件夹的集合。 ...
For self reference. Forward Pass import torch import torch.nn.functional as F # Setup learning_rate = 0.1 x = torch.randn(1, 5) # Input data y_true = torch.tensor([[1.0]]) # True label # Model parameters initialized manually # requires_grad=True tells PyTorch to calculate gradients for them w = torch.randn(5, 1, requires_grad=True) b = torch.randn(1, requires_grad=True) print(f"Initial weight:\n{w.data}\n") # 1. Forward Pass # Calculate a prediction using the current weight and bias z = x @ w + b # `@` is matrix multiplication y_pred = torch.sigmoid(z) # 2. Calculate Loss # Compare the prediction to the true label loss = F.binary_cross_entropy(y_pred, y_true) # 3. Backward Pass # Calculate the gradients of the loss with respect to w and b loss.backward() # 4. Update Parameters # Manually adjust w and b in the opposite direction of their gradients with torch.no_grad(): # Temporarily disable gradient tracking for the update w -= learning_rate * w.grad b -= learning_rate * b.grad # Manually zero out the gradients for the next iteration w.grad.zero_() b.grad.zero_() print(f"Updated weight:\n{w.data}\n") print(f"Loss: {loss.item():.4f}") “forward pass” (前向传播) 是指神经网络从输入数据开始,逐层计算,直到产生最终输出(预测结果)的过程。可以把它想象成信息在网络中“向前流动”的过程。 ...
Shape Manipulating reshape 可以把 NumPy 的 reshape 操作想象成一个先摊平, 再重铺的过程 核心思想: 无论你原来的数组是什么形状,也无论你想要变成什么新形状,reshape 都会(概念上)做两步: 摊平 (Flattening): 一行一行地把整个数组中的元素读出来, 形成一维数组 重铺 (Refilling): 再根据 reshape 后的形状填满一行一行地填满整个数组 import numpy as np a = np.array([[1, 2, 3, 4, 5, 6], [7, 8, 9, 10, 11, 12]]) # Flatten a_flat = a.reshape(1, 12) print("Flattened array: ", a_flat) # Refill a_refilled = a_flat.reshape(3, 4) print("Refilled array: ", a_refilled) transpose 高维转置的本质是重新安排数组的索引顺序,而不是传统意义上的"矩阵转置" concatenate & stack concatenate: 沿着现有的轨道/维度进行延伸或对接 concatenate 是将多个数组沿着一个已经存在的维度(轴,axis)拼接起来。结果数组的维度数量通常与输入数组的维度数量相同 工作方式: 你需要指定一个 axis 参数,告诉 NumPy 沿着哪个维度进行拼接。 除了要拼接的那个维度之外,其他所有维度的大小必须完全相同。 就像你要把两列火车车厢接起来,它们的高度和宽度得匹配,只有长度可以不同(然后加起来) A = np.array([[1, 2], [3, 4]]) B = np.array([[5, 6]]) # 注意:B 也是二维的,才能在 axis=0 上与 A 匹配列数 np.concatenate((A, B), axis=0) # 结果: # [[1, 2], # [3, 4], # [5, 6]] (行数增加了,列数不变) A = np.array([[1, 2], [3, 4]]) C = np.array([[5, 6], [7, 8]]) np.concatenate((A, C), axis=1) # 结果: # [[1, 2, 5, 6], # [3, 4, 7, 8]] (列数增加了,行数不变) stack: 将多个独立的层叠放起来,形成一个新的维度 ...
Chapter 1 GPT是自回归模型:像GPT这样的解码器式模型,它们生成文本的方式是一次预测一个词 (one word at a time)。也就是说,它会根据已经生成的词语序列来预测下一个最可能出现的词语,然后将这个预测到的词语加入到序列中,再进行下一步的预测,如此循环。这种依赖于自身先前输出进行下一步预测的特性,使得这类模型被称为“自回归模型”(Autoregressive model)。 Chapter 2 Word2Vec trained neural network architecture to generate word embeddings by predicting the context of a word given the target word or vice versa. The main idea behind Word2Vec is that words that appear in similar contexts tend to have similar meanings. LLMs commonly produce their own embeddings that are part of the input layer and are updated during training. The advantage of optimizing the embeddings as part of the LLM training instead of using Word2Vec is that the embeddings are optimized to the specific task and data at hand. ...