SLED A Speculative LLM Decoding Framework for Efficient Edge Serving

Extensive Reading

Author Info

SEC: CCF C

Pure implementation of Speculative decoding in edge scenarios

Route to the server when the confidence score associated with token generated by the edge device falls below a given threshold

pasted-image-20251207193900

Two details:

When sending tokens to the server, the edge device keeps generating draft tokens, expecting the verifier would accept all sent tokens
When retrying due to network issues, the edge device can append new generated tokens to the draft sequence

pasted-image-20260202163317

推测长度增加对系统性能存在 Trade-off：

没有和 SOTA 比较

Edge device 和 Edge Server inference 的 overlap 已经是一个比较工程的优化了，没有创新