环境准备
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.171.04 Driver Version: 535.171.04 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce GTX 1060 Off | 00000000:01:00.0 Off | N/A |
| N/A 62C P8 3W / 78W | 2426MiB / 6144MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 958 G /usr/lib/xorg/Xorg 4MiB |
| 0 N/A N/A 3029 C /usr/local/bin/ollama 2418MiB |
+---------------------------------------------------------------------------------------+
ollama run qwen3:1.7b
pip install evalscope
运行
evalscope perf \
--model qwen3:1.7b \
--url "http://192.168.1.3:11434/v1/chat/completions" \
--parallel 5 \
--number 20 \
--api openai \
--dataset openqa \
--stream
结果
"Time taken for tests (s)": 288.8133,
"Number of concurrency": 5,
"Total requests": 10,
"Succeed requests": 10,
"Failed requests": 0,
"Output token throughput (tok/s)": 36.605,
"Total token throughput (tok/s)": 37.5848,
"Request throughput (req/s)": 0.0346,
"Average latency (s)": 112.7879,
"Average time to first token (s)": 22.1013,
"Average time per output token (s)": 0.0865,
"Average input tokens per request": 28.3,
"Average output tokens per request": 1057.2,
"Average package latency (s)": 0.0865,
"Average package per request": 1048.3
reference