Serverless

Background Serverless inference can significantly reduce costs for LLM users by charging only for the duration of inference and the volume of processed data. Key components in GPU serverless clusters: Controller: Request Router: direct incoming requests to nodes already running LLM inference processes, or instructs the model loading scheduler. Model Loading Scheduler: activate LLM inference processes on unallocated GPUs. TODO The deployment of LLMs on serverless systems, although promising, often incurs significant latency overheads. This is largely due to the substantial proportions of cold-start in serverless clusters. ...