Supported Model Servers¶
Any model server that conform to the model server protocol are supported by the inference extension.
Compatible Model Server Versions¶
Model Server | Version | Commit | Notes |
---|---|---|---|
vLLM V0 | v0.6.4 and above | commit 0ad216f | |
vLLM V1 | v0.8.0 and above | commit bc32bc7 | |
Triton(TensorRT-LLM) | 25.03 and above | commit 15cb989. | LoRA affinity feature is not available as the required LoRA metrics haven't been implemented in Triton yet. |
vLLM¶
vLLM is configured as the default in the endpoint picker extension. No further configuration is required.
Triton with TensorRT-LLM Backend¶
Triton specific metric names need to be specified when starting the EPP.
Option 1: Use Helm¶
Use --set inferencePool.modelServerType=triton-tensorrt-llm
to install the inferencepool
via helm. See the inferencepool
helm guide for more details.
Option 2: Edit EPP deployment yaml¶
Add the following to the args
of the EPP deployment
- -totalQueuedRequestsMetric
- "nv_trt_llm_request_metrics{request_type=waiting}"
- -kvCacheUsagePercentageMetric
- "nv_trt_llm_kv_cache_block_metrics{kv_cache_block_type=fraction}"
- -loraInfoMetric
- "" # Set an empty metric to disable LoRA metric scraping as they are not supported by Triton yet.