The standard vLLM environment includes inference engines, CUDA components, distributed runtime, and server code. A pure client-side load test only needs request generation, concurrency control, latency statistics, and reporting.
Project Overview
The project keeps a Python package, CLI entry, request generator, async HTTP client, statistics module, and mock OpenAI service tests.
It is a lightweight pressure-test client for already running OpenAI-compatible services. It fits gateway tests, migration comparisons, and latency checks across sampling settings.
Client Responsibilities
The CLI keeps the shape of vllm bench serve. It constructs prompts with a tokenizer and generator, controls concurrency, sends async HTTP requests, and records time to first token, total latency, success rate, throughput, and error classes.
Mock service tests verify request shape, scheduling, and metrics without a real model service. Real tests and mock tests share the same client entry.