Orca A Distributed Serving System for Transformer-Based Generative Models

Background Current serving system schedules the execution of the engine at the granularity of request. Under this design, when the serving system dispatches a batch of requests to the engine, the engine returns inference results for the entire batch at once after processing all requests within the batch. Challenge 1: Early-finished and late-joining requests Requests can’t be early finished As different client requests may require different numbers of iterations for processing, requests that have finished earlier than others in the batch cannot return to the client, resulting in an increased latency. ...

July 2, 2025 · Last updated on August 1, 2025 · 5 min · KKKZOZ