ArXiv-2601

AI-Aided Author Info Background The paper identifies a critical bottleneck in deploying Large Language Models (LLMs): The Optimization Challenge: Efficient deployment requires tuning a vast configuration space (e.g., parallelism strategies, batch sizes, caching policies). Cost vs. Fidelity Trade-off: Real GPU Execution: Testing on physical hardware is prohibitively expensive and slow. Discrete-Event Simulators (DES): While fast and cheap, traditional simulators require manually re-implementing the serving system’s complex control logic. Because frameworks (like vLLM and SGLang) evolve rapidly, simulators suffer from a perpetual “semantic gap” and high maintenance burden. Insights ...

ArXiv-2601

AIConfigurator Lightning-Fast Configuration Optimization for Multi-Framework LLM Serving

Revati Transparent GPU-Free Time-Warp Emulation for LLM Serving