STI Turbocharge NLP Inference at the Edge via Elastic Pipelining

Intensive Reading Author Info Homepage - Liwei Guo / Assistant Professor: a tenure-track Assistant Professor at University of UESTC. Background Challenges Cold start of NLP models in mobile devices NLP inference stresses mobile devices on two aspects Latency: impromptu user engagements Model Size Existing Paradigms: Hold in memory Too large memory footprint, likely to be victims of mobile memory management Load before execute Slow start, waiting for I/O, computation resources stall Pipeline load/execution Low arithmetic intensity in Transformer’s attention modules The pipeline is filled with bubbles and the computation stalls most of the time at each model layer Insights A model can be re-engineered from a monolithic block into a collection of resource-elastic “shards” by uniquely combining vertical partitioning with fine-grained, per-shard quantization. This transforms the I/O time of each model component into a tunable parameter. ...

August 25, 2025 · Last updated on August 26, 2025 · 2 min · KKKZOZ