Claude Code Assistant Guidelines
Go Code Style
- Follow the standard Go code style and conventions. Use
gofmtfor formatting and adhere to idiomatic Go practices. - Follow best practices from the Effective Go guide:
Naming Conventions
- Use MixedCaps or mixedCaps rather than underscores for multi-word names
- Package names should be short, lowercase, single-word names
- Getters don't use "Get" prefix (use
obj.Name()notobj.GetName()) - Interface names use "-er" suffix for single-method interfaces (e.g.,
Reader,Writer)
Formatting
- Use
gofmtfor consistent formatting (tabs for indentation, spaces for alignment) - Line length: no strict limit, but keep lines reasonable
- Group related declarations together
Error Handling
- Return errors as the last return value
- Check errors immediately after the call
- Provide context with
fmt.Errorfand error wrapping
Logging
- Use
ctrl.Logfor structured logging - Keep log fields consistent and meaningful
- Avoid logging sensitive data
Documentation
- Every exported name should have a doc comment
- Start comments with the name being described
- Use complete sentences
Concurrency
- Share memory by communicating; don't communicate by sharing memory
- Use channels to orchestrate goroutines
- Always handle goroutine cleanup and cancellation properly
Project Structure
- Keep packages focused and cohesive
- Avoid circular dependencies
- Place tests in
*_test.gofiles
Documentation
Prefer placing documentation in the docs/ directory.
There are 3 main types of documentation targeting different audiences:
-
Developer Documentation - For contributors and maintainers of this project
- Architecture decisions
- Development setup and workflow
- Contributing guidelines
- usually in the
docs/developer-guide/subdirectory
-
Administrator Documentation - For operators deploying and managing the autoscaler controller
- Installation and configuration
- Deployment guidelines
- Monitoring and troubleshooting
- usually located under the
docs/user-guide/directory (for example, in an admin-focused subdirectory)
-
End-User Documentation - For application developers creating applications that use the autoscaler
- Usage guides and examples
- API reference
- Best practices and common patterns
- usually located under the
docs/user-guide/directory (for example, in an end-user-focused subdirectory)
E2E Testing
- use make targets for running e2e tests (e.g.,
make test-e2e-smokeormake test-e2e-full) and document the process indocs/developer-guide/testing.md - use
make testfor unit tests - Never use images from docker.io in e2e tests. All container images must use fully-qualified registry paths (e.g.,
registry.k8s.io/,quay.io/, or a private registry). Do not rely on Docker Hub as a default registry.
CLI Tools
llm-d Inference Scheduler EPP CLI Reference
This section documents the command-line flags and environment variables supported by the llm-d inference scheduler EPP (Endpoint Picker). The EPP inherits its CLI from gateway-api-inference-extension.
Main Branch (Latest)
Uses gateway-api-inference-extension at commit fd30cb97714a (post-v1.3.0).
Command-Line Flags
| Flag | Type | Default | Description |
|---|---|---|---|
--grpc-port | int | 9002 | gRPC port used for communicating with Envoy proxy |
--ha-enable-leader-election | bool | false | Enables leader election for high availability. When enabled, readiness probes will only pass on the leader |
--pool-group | string | inference.networking.k8s.io | Kubernetes resource group of the InferencePool this Endpoint Picker is associated with |
--pool-namespace | string | "" | Namespace of the InferencePool this Endpoint Picker is associated with |
--pool-name | string | "" | Name of the InferencePool this Endpoint Picker is associated with |
--endpoint-selector | string | "" | Selector to filter model server pods on, only 'key=value' pairs are supported. Format: comma-separated list of key=value pairs (e.g., 'app=vllm-llama3-8b-instruct,env=prod') |
--endpoint-target-ports | []int | [] | Target ports of model server pods. Format: comma-separated list of numbers (e.g., '3000,3001,3002') |
--disable-endpoint-subset-filter | bool | false | Disables respecting the x-gateway-destination-endpoint-subset metadata for dispatching requests in EPP |
--model-server-metrics-scheme | string | http | Protocol scheme used in scraping metrics from endpoints |
--model-server-metrics-path | string | /metrics | URL path used in scraping metrics from endpoints |
--model-server-metrics-port | int | 0 | DEPRECATED: Port to scrape metrics from endpoints |
--model-server-metrics-https-insecure-skip-verify | bool | true | Disable certificate verification when using 'https' scheme for model-server-metrics-scheme |
--refresh-metrics-interval | duration | 50ms | Interval to refresh metrics |
--refresh-prometheus-metrics-interval | duration | 5s | Interval to flush Prometheus metrics |
--metrics-staleness-threshold | duration | 2s | Duration after which metrics are considered stale |
--total-queued-requests-metric | string | vllm:num_requests_waiting | DEPRECATED: Use engineConfigs in EndpointPickerConfig instead |
--total-running-requests-metric | string | vllm:num_requests_running | DEPRECATED: Use engineConfigs in EndpointPickerConfig instead |
--kv-cache-usage-percentage-metric | string | vllm:kv_cache_usage_perc | DEPRECATED: Use engineConfigs in EndpointPickerConfig instead |
--lora-info-metric | string | vllm:lora_requests_info | DEPRECATED: Use engineConfigs in EndpointPickerConfig instead |
--cache-info-metric | string | vllm:cache_config_info | DEPRECATED: Use engineConfigs in EndpointPickerConfig instead |
-v, --v | int | 0 | Number for the log level verbosity |
--zap-log-level | string | Zap log level (debug, info, warn, error) | |
--zap-devel | bool | true | Development Mode defaults (encoder=consoleEncoder,logLevel=Debug,stackTraceLevel=Warn) |
--zap-encoder | string | Zap log encoding ('json' or 'console') | |
--zap-stacktrace-level | string | Zap Level at and above which stacktraces are captured | |
--tracing | bool | true | Enables emitting traces |
--health-checking | bool | false | Enables health checking |
--metrics-port | int | 9090 | The metrics port exposed by EPP |
--grpc-health-port | int | 9003 | The port used for gRPC liveness and readiness probes |
--enable-pprof | bool | true | Enables pprof handlers |
--cert-path | string | "" | The path to the certificate for secure serving. Certificate and private key files are assumed to be named tls.crt and tls.key |
--enable-cert-reload | bool | false | Enables certificate reloading of the certificates specified in --cert-path |
--secure-serving | bool | true | Enables secure serving |
--metrics-endpoint-auth | bool | true | Enables authentication and authorization of the metrics endpoint |
--config-file | string | "" | The path to the configuration file |
--config-text | string | "" | The configuration specified as text, in lieu of a file |
Environment Variables
| Variable | Description | Deprecation |
|---|---|---|
NAMESPACE | Used to determine pool namespace when --pool-namespace is not set | - |
POD_NAME | Used to determine EPP name when using --endpoint-selector mode | - |
ENABLE_EXPERIMENTAL_DATALAYER_V2 | Enables experimental pluggable data layer | DEPRECATED: Use FeatureGates in config file instead |
ENABLE_EXPERIMENTAL_FLOW_CONTROL_LAYER | Enables experimental pluggable flow control layer | DEPRECATED: Use FeatureGates in config file instead |
SD_QUEUE_DEPTH_THRESHOLD | Saturation detector queue depth threshold | DEPRECATED: Use config file instead |
SD_KV_CACHE_UTIL_THRESHOLD | Saturation detector KV cache utilization threshold | DEPRECATED: Use config file instead |
SD_METRICS_STALENESS_THRESHOLD | Saturation detector metrics staleness threshold | DEPRECATED: Use config file instead |
v0.5.0
Uses gateway-api-inference-extension v1.3.0.
Command-Line Flags
| Flag | Type | Default | Description |
|---|---|---|---|
--grpc-port | int | 9002 | gRPC port used for communicating with Envoy proxy |
--ha-enable-leader-election | bool | false | Enables leader election for high availability. When enabled, readiness probes will only pass on the leader |
--pool-group | string | inference.networking.k8s.io | Kubernetes resource group of the InferencePool this Endpoint Picker is associated with |
--pool-namespace | string | "" | Namespace of the InferencePool this Endpoint Picker is associated with |
--pool-name | string | "" | Name of the InferencePool this Endpoint Picker is associated with |
--endpoint-selector | string | "" | Selector to filter model server pods on, only 'key=value' pairs are supported. Format: comma-separated list of key=value pairs (e.g., 'app=vllm-llama3-8b-instruct,env=prod') |
--endpoint-target-ports | []int | [] | Target ports of model server pods. Format: comma-separated list of numbers (e.g., '3000,3001,3002') |
--disable-endpoint-subset-filter | bool | false | Disables respecting the x-gateway-destination-endpoint-subset metadata for dispatching requests in EPP |
--model-server-metrics-scheme | string | http | Protocol scheme used in scraping metrics from endpoints |
--model-server-metrics-path | string | /metrics | URL path used in scraping metrics from endpoints |
--model-server-metrics-port | int | 0 | DEPRECATED: Port to scrape metrics from endpoints. Set to InferencePool.Spec.TargetPorts[0].Number if not defined |
--model-server-metrics-https-insecure-skip-verify | bool | true | Disable certificate verification when using 'https' scheme for model-server-metrics-scheme |
--refresh-metrics-interval | duration | 50ms | Interval to refresh metrics |
--refresh-prometheus-metrics-interval | duration | 5s | Interval to flush Prometheus metrics |
--metrics-staleness-threshold | duration | 2s | Duration after which metrics are considered stale |
--total-queued-requests-metric | string | vllm:num_requests_waiting | Prometheus metric for the number of queued requests |
--total-running-requests-metric | string | vllm:num_requests_running | Prometheus metric for the number of running requests |
--kv-cache-usage-percentage-metric | string | vllm:kv_cache_usage_perc | Prometheus metric for the fraction of KV-cache blocks currently in use (from 0 to 1) |
--lora-info-metric | string | vllm:lora_requests_info | Prometheus metric for the LoRA info metrics (must be in vLLM label format) |
--cache-info-metric | string | vllm:cache_config_info | Prometheus metric for the cache info metrics |
-v, --v | int | 0 | Number for the log level verbosity |
--zap-log-level | string | Zap log level (debug, info, warn, error) | |
--zap-devel | bool | true | Development Mode defaults (encoder=consoleEncoder,logLevel=Debug,stackTraceLevel=Warn) |
--zap-encoder | string | Zap log encoding ('json' or 'console') | |
--zap-stacktrace-level | string | Zap Level at and above which stacktraces are captured | |
--tracing | bool | true | Enables emitting traces |
--health-checking | bool | false | Enables health checking |
--metrics-port | int | 9090 | The metrics port exposed by EPP |
--grpc-health-port | int | 9003 | The port used for gRPC liveness and readiness probes |
--enable-pprof | bool | true | Enables pprof handlers |
--cert-path | string | "" | The path to the certificate for secure serving. Certificate and private key files are assumed to be named tls.crt and tls.key |
--enable-cert-reload | bool | false | Enables certificate reloading of the certificates specified in --cert-path |
--secure-serving | bool | true | Enables secure serving |
--metrics-endpoint-auth | bool | true | Enables authentication and authorization of the metrics endpoint |
--config-file | string | "" | The path to the configuration file |
--config-text | string | "" | The configuration specified as text, in lieu of a file |
Environment Variables
| Variable | Description | Deprecation |
|---|---|---|
NAMESPACE | Used to determine pool namespace when --pool-namespace is not set | - |
POD_NAME | Used to determine EPP name when using --endpoint-selector mode | - |
ENABLE_EXPERIMENTAL_DATALAYER_V2 | Enables experimental pluggable data layer | DEPRECATED: Use FeatureGates in config file instead |
ENABLE_EXPERIMENTAL_FLOW_CONTROL_LAYER | Enables experimental pluggable flow control layer | DEPRECATED: Use FeatureGates in config file instead |
SD_QUEUE_DEPTH_THRESHOLD | Saturation detector queue depth threshold | DEPRECATED: Use config file instead |
SD_KV_CACHE_UTIL_THRESHOLD | Saturation detector KV cache utilization threshold | DEPRECATED: Use config file instead |
SD_METRICS_STALENESS_THRESHOLD | Saturation detector metrics staleness threshold | DEPRECATED: Use config file instead |
Key Differences Between Main and v0.5.0
-
Metric Flags: In main branch,
--total-queued-requests-metric,--total-running-requests-metric,--kv-cache-usage-percentage-metric,--lora-info-metric, and--cache-info-metricare deprecated and will error if explicitly set. In v0.5.0, these flags are functional. -
Configuration: Main branch encourages using
EndpointPickerConfigwithengineConfigsfor metrics configuration instead of CLI flags.
llm-d Inference Simulator CLI Reference
This section documents the command-line flags and environment variables supported by the llm-d inference simulator (llm-d-inference-sim). The simulator is a vLLM server simulator supporting OpenAI API endpoints.
Main Branch (Latest)
Command-Line Flags
| Flag | Type | Default | Description |
|---|---|---|---|
--config | string | "" | Path to a YAML configuration file. Command line values overwrite config file values |
--port | int | 8000 | Port on which the simulator runs |
--model | string | "" | Currently 'loaded' model name (required) |
--served-model-name | []string | [] | Model names exposed by the API (space-separated strings). Falls back to --model if not set |
--max-num-seqs | int | 5 | Maximum number of inference requests that could be processed at the same time |
--max-waiting-queue-length | int | 1000 | Maximum length of inference requests waiting queue |
--max-loras | int | 1 | Maximum number of LoRAs in a single batch |
--max-cpu-loras | int | (same as --max-loras) | Maximum number of LoRAs to store in CPU memory |
--max-model-len | int | 1024 | Model's context window, maximum number of tokens in a single request including input and output |
--lora-modules | []string | [] | List of LoRA adapters (space-separated JSON strings) |
--mode | string | random | Simulator mode: echo returns input text; random returns random pre-defined sentences |
--seed | int64 | (current Unix nano) | Random seed for operations |
--time-to-first-token | duration | 0 | Time to first token (e.g., "100ms"). Integer format (milliseconds) is deprecated |
--time-to-first-token-std-dev | duration | 0 | Standard deviation for time to first token (max 30% of TTFT) |
--inter-token-latency | duration | 0 | Time to generate one token (e.g., "100ms"). Integer format is deprecated |
--inter-token-latency-std-dev | duration | 0 | Standard deviation for inter-token latency (max 30% of ITL) |
--prefill-overhead | duration | 0 | Time to prefill. Ignored if --time-to-first-token is set |
--prefill-time-per-token | duration | 0 | Time to prefill per token |
--prefill-time-std-dev | duration | 0 | Standard deviation for prefill time |
--kv-cache-transfer-latency | duration | 0 | Time for KV-cache transfer from a remote vLLM (P/D mode) |
--kv-cache-transfer-latency-std-dev | duration | 0 | Standard deviation for KV-cache transfer latency |
--kv-cache-transfer-time-per-token | duration | 0 | Time for KV-cache transfer per token from a remote vLLM |
--kv-cache-transfer-time-std-dev | duration | 0 | Standard deviation for KV-cache transfer time per token |
--time-factor-under-load | float64 | 1.0 | Multiplicative factor affecting request time when parallel requests are processed (must be >= 1.0) |
--enable-kvcache | bool | false | Enables KV cache feature |
--kv-cache-size | int | 1024 | Maximum number of token blocks in KV cache |
--global-cache-hit-threshold | float64 | 0 | Default cache hit threshold [0, 1] for all requests |
--block-size | int | 16 | Token block size for contiguous chunks (valid: 8, 16, 32, 64, 128) |
--tokenizers-cache-dir | string | hf_cache | Directory for caching tokenizers |
--hash-seed | string | "" | Seed for hash generation (falls back to PYTHONHASHSEED env var) |
--zmq-endpoint | string | tcp://localhost:5557 | ZMQ address to publish events |
--zmq-max-connect-attempts | int | 0 | Maximum number of times to try ZMQ connect (max 10) |
--event-batch-size | int | 16 | Maximum number of KV-cache events to be sent together |
--data-parallel-size | int | 1 | Number of ranks to run (1-8) |
--data-parallel-rank | int | -1 | The rank when running each rank in a process |
--failure-injection-rate | int | 0 | Probability (0-100) of injecting failures |
--failure-types | []string | [] | Specific failure types to inject: rate_limit, invalid_api_key, context_length, server_error, invalid_request, model_not_found |
--fake-metrics | string | "" | JSON metrics to report to Prometheus instead of real metrics |
--ssl-certfile | string | "" | Path to SSL certificate file for HTTPS |
--ssl-keyfile | string | "" | Path to SSL private key file for HTTPS |
--self-signed-certs | bool | false | Enable automatic generation of self-signed certificates for HTTPS |
--dataset-path | string | "" | Local path to SQLite database file for response generation from a dataset |
--dataset-url | string | "" | URL to download the SQLite database file for response generation |
--dataset-in-memory | bool | false | Load the entire dataset into memory for faster access |
--enable-sleep-mode | bool | false | Enable sleep mode |
--enable-request-id-headers | bool | false | Enable including X-Request-Id header in responses |
--latency-calculator | string | "" | Name of the latency calculator: constant or per-token |
--max-tool-call-integer-param | int | 100 | Maximum possible value of integer parameters in a tool call |
--min-tool-call-integer-param | int | 0 | Minimum possible value of integer parameters in a tool call |
--max-tool-call-number-param | float64 | 100 | Maximum possible value of number (float) parameters in a tool call |
--min-tool-call-number-param | float64 | 0 | Minimum possible value of number (float) parameters in a tool call |
--max-tool-call-array-param-length | int | 5 | Maximum possible length of array parameters in a tool call |
--min-tool-call-array-param-length | int | 1 | Minimum possible length of array parameters in a tool call |
--tool-call-not-required-param-probability | int | 50 | Probability (0-100) to add a non-required parameter in a tool call |
--object-tool-call-not-required-field-probability | int | 50 | Probability (0-100) to add a non-required field in an object in a tool call |
Environment Variables
| Variable | Description |
|---|---|
POD_NAME | Pod name of simulator |
POD_NAMESPACE | Namespace where simulator is running |
POD_IP | IP address on which simulator runs |
PYTHONHASHSEED | Fallback seed for hash generation if --hash-seed is not set |
VLLM_SERVER_DEV_MODE | Set to 1 to enable development mode |
v0.5.0
Command-Line Flags
| Flag | Type | Default | Description |
|---|---|---|---|
--config | string | "" | Path to a YAML configuration file. Command line values overwrite config file values |
--port | int | 8000 | Port on which the simulator runs |
--model | string | "" | Currently 'loaded' model name (required) |
--served-model-name | []string | [] | Model names exposed by the API (space-separated strings). Falls back to --model if not set |
--max-num-seqs | int | 5 | Maximum number of inference requests that could be processed at the same time (parameter to simulate requests waiting queue) |
--max-loras | int | 1 | Maximum number of LoRAs in a single batch |
--max-cpu-loras | int | (same as --max-loras) | Maximum number of LoRAs to store in CPU memory |
--max-model-len | int | 1024 | Model's context window, maximum number of tokens in a single request including input and output |
--lora-modules | []string | [] | List of LoRA adapters (space-separated JSON strings) |
--mode | string | random | Simulator mode: echo returns input text; random returns random pre-defined sentences |
--seed | int64 | (current Unix nano) | Random seed for operations |
--time-to-first-token | int | 0 | Time to first token in milliseconds |
--time-to-first-token-std-dev | int | 0 | Standard deviation for time to first token in milliseconds (max 30% of TTFT) |
--inter-token-latency | int | 0 | Time to generate one token in milliseconds |
--inter-token-latency-std-dev | int | 0 | Standard deviation for inter-token latency in milliseconds (max 30% of ITL) |
--prefill-overhead | int | 0 | Time to prefill in milliseconds. Ignored if --time-to-first-token is not 0 |
--prefill-time-per-token | int | 0 | Time to prefill per token in milliseconds |
--prefill-time-std-dev | int | 0 | Standard deviation for prefill time in milliseconds |
--kv-cache-transfer-latency | int | 0 | Time for KV-cache transfer from a remote vLLM in milliseconds (P/D mode) |
--kv-cache-transfer-latency-std-dev | int | 0 | Standard deviation for KV-cache transfer latency in milliseconds |
--kv-cache-transfer-time-per-token | int | 0 | Time for KV-cache transfer per token from a remote vLLM in milliseconds |
--kv-cache-transfer-time-std-dev | int | 0 | Standard deviation for KV-cache transfer time per token in milliseconds |
--time-factor-under-load | float64 | 1.0 | Multiplicative factor affecting request time when parallel requests are processed (must be >= 1.0) |
--enable-kvcache | bool | false | Enables KV cache feature |
--kv-cache-size | int | 1024 | Maximum number of token blocks in KV cache |
--block-size | int | 16 | Token block size for contiguous chunks (valid: 8, 16, 32, 64, 128) |
--tokenizers-cache-dir | string | "" | Directory for caching tokenizers |
--hash-seed | string | "" | Seed for hash generation (falls back to PYTHONHASHSEED env var) |
--zmq-endpoint | string | tcp://localhost:5557 | ZMQ address to publish events |
--zmq-max-connect-attempts | uint | 0 | Maximum number of times to try ZMQ connect (max 10) |
--event-batch-size | int | 16 | Maximum number of KV-cache events to be sent together |
--data-parallel-size | int | 1 | Number of ranks to run (1-8) |
--failure-injection-rate | int | 0 | Probability (0-100) of injecting failures |
--failure-types | []string | [] | Specific failure types to inject: rate_limit, invalid_api_key, context_length, server_error, invalid_request, model_not_found |
--fake-metrics | string | "" | JSON metrics to report to Prometheus instead of real metrics |
--max-tool-call-integer-param | int | 100 | Maximum possible value of integer parameters in a tool call |
--min-tool-call-integer-param | int | 0 | Minimum possible value of integer parameters in a tool call |
--max-tool-call-number-param | float64 | 100 | Maximum possible value of number (float) parameters in a tool call |
--min-tool-call-number-param | float64 | 0 | Minimum possible value of number (float) parameters in a tool call |
--max-tool-call-array-param-length | int | 5 | Maximum possible length of array parameters in a tool call |
--min-tool-call-array-param-length | int | 1 | Minimum possible length of array parameters in a tool call |
--tool-call-not-required-param-probability | int | 50 | Probability (0-100) to add a non-required parameter in a tool call |
--object-tool-call-not-required-field-probability | int | 50 | Probability (0-100) to add a non-required field in an object in a tool call |
Environment Variables
| Variable | Description |
|---|---|
POD_NAME | Pod name of simulator |
POD_NAMESPACE | Namespace where simulator is running |
PYTHONHASHSEED | Fallback seed for hash generation if --hash-seed is not set |
Key Differences Between Main and v0.5.0
-
Duration Parameters: In main branch, latency-related parameters (
--time-to-first-token,--inter-token-latency, etc.) use Go duration strings (e.g., "100ms", "1.5s"). In v0.5.0, these are integers representing milliseconds. -
New Flags in Main:
--max-waiting-queue-length,--global-cache-hit-threshold,--data-parallel-rank,--ssl-certfile,--ssl-keyfile,--self-signed-certs,--dataset-path,--dataset-url,--dataset-in-memory,--enable-sleep-mode,--enable-request-id-headers,--latency-calculator. -
Environment Variables: Main branch adds
POD_IPandVLLM_SERVER_DEV_MODE.