Claude Code Assistant Guidelines

Go Code Style

Follow the standard Go code style and conventions. Use gofmt for formatting and adhere to idiomatic Go practices.
Follow best practices from the Effective Go guide:

Naming Conventions

Use MixedCaps or mixedCaps rather than underscores for multi-word names
Package names should be short, lowercase, single-word names
Getters don't use "Get" prefix (use obj.Name() not obj.GetName())
Interface names use "-er" suffix for single-method interfaces (e.g., Reader, Writer)

Formatting

Use gofmt for consistent formatting (tabs for indentation, spaces for alignment)
Line length: no strict limit, but keep lines reasonable
Group related declarations together

Error Handling

Return errors as the last return value
Check errors immediately after the call
Provide context with fmt.Errorf and error wrapping

Logging

Use ctrl.Log for structured logging
Keep log fields consistent and meaningful
Avoid logging sensitive data

Documentation

Every exported name should have a doc comment
Start comments with the name being described
Use complete sentences

Concurrency

Share memory by communicating; don't communicate by sharing memory
Use channels to orchestrate goroutines
Always handle goroutine cleanup and cancellation properly

Project Structure

Keep packages focused and cohesive
Avoid circular dependencies
Place tests in *_test.go files

Documentation

Prefer placing documentation in the docs/ directory.

There are 3 main types of documentation targeting different audiences:

Developer Documentation - For contributors and maintainers of this project
- Architecture decisions
- Development setup and workflow
- Contributing guidelines
- usually in the docs/developer-guide/ subdirectory
Administrator Documentation - For operators deploying and managing the autoscaler controller
- Installation and configuration
- Deployment guidelines
- Monitoring and troubleshooting
- usually located under the docs/user-guide/ directory (for example, in an admin-focused subdirectory)
End-User Documentation - For application developers creating applications that use the autoscaler
- Usage guides and examples
- API reference
- Best practices and common patterns
- usually located under the docs/user-guide/ directory (for example, in an end-user-focused subdirectory)

E2E Testing

use make targets for running e2e tests (e.g., make test-e2e-smoke or make test-e2e-full) and document the process in docs/developer-guide/testing.md
use make test for unit tests
Never use images from docker.io in e2e tests. All container images must use fully-qualified registry paths (e.g., registry.k8s.io/, quay.io/, or a private registry). Do not rely on Docker Hub as a default registry.

CLI Tools

llm-d Inference Scheduler EPP CLI Reference

This section documents the command-line flags and environment variables supported by the llm-d inference scheduler EPP (Endpoint Picker). The EPP inherits its CLI from gateway-api-inference-extension.

Main Branch (Latest)

Uses gateway-api-inference-extension at commit fd30cb97714a (post-v1.3.0).

Command-Line Flags

Flag	Type	Default	Description
`--grpc-port`	int	`9002`	gRPC port used for communicating with Envoy proxy
`--ha-enable-leader-election`	bool	`false`	Enables leader election for high availability. When enabled, readiness probes will only pass on the leader
`--pool-group`	string	`inference.networking.k8s.io`	Kubernetes resource group of the InferencePool this Endpoint Picker is associated with
`--pool-namespace`	string	`""`	Namespace of the InferencePool this Endpoint Picker is associated with
`--pool-name`	string	`""`	Name of the InferencePool this Endpoint Picker is associated with
`--endpoint-selector`	string	`""`	Selector to filter model server pods on, only 'key=value' pairs are supported. Format: comma-separated list of key=value pairs (e.g., 'app=vllm-llama3-8b-instruct,env=prod')
`--endpoint-target-ports`	[]int	`[]`	Target ports of model server pods. Format: comma-separated list of numbers (e.g., '3000,3001,3002')
`--disable-endpoint-subset-filter`	bool	`false`	Disables respecting the x-gateway-destination-endpoint-subset metadata for dispatching requests in EPP
`--model-server-metrics-scheme`	string	`http`	Protocol scheme used in scraping metrics from endpoints
`--model-server-metrics-path`	string	`/metrics`	URL path used in scraping metrics from endpoints
`--model-server-metrics-port`	int	`0`	DEPRECATED: Port to scrape metrics from endpoints
`--model-server-metrics-https-insecure-skip-verify`	bool	`true`	Disable certificate verification when using 'https' scheme for model-server-metrics-scheme
`--refresh-metrics-interval`	duration	`50ms`	Interval to refresh metrics
`--refresh-prometheus-metrics-interval`	duration	`5s`	Interval to flush Prometheus metrics
`--metrics-staleness-threshold`	duration	`2s`	Duration after which metrics are considered stale
`--total-queued-requests-metric`	string	`vllm:num_requests_waiting`	DEPRECATED: Use engineConfigs in EndpointPickerConfig instead
`--total-running-requests-metric`	string	`vllm:num_requests_running`	DEPRECATED: Use engineConfigs in EndpointPickerConfig instead
`--kv-cache-usage-percentage-metric`	string	`vllm:kv_cache_usage_perc`	DEPRECATED: Use engineConfigs in EndpointPickerConfig instead
`--lora-info-metric`	string	`vllm:lora_requests_info`	DEPRECATED: Use engineConfigs in EndpointPickerConfig instead
`--cache-info-metric`	string	`vllm:cache_config_info`	DEPRECATED: Use engineConfigs in EndpointPickerConfig instead
`-v`, `--v`	int	`0`	Number for the log level verbosity
`--zap-log-level`	string		Zap log level (debug, info, warn, error)
`--zap-devel`	bool	`true`	Development Mode defaults (encoder=consoleEncoder,logLevel=Debug,stackTraceLevel=Warn)
`--zap-encoder`	string		Zap log encoding ('json' or 'console')
`--zap-stacktrace-level`	string		Zap Level at and above which stacktraces are captured
`--tracing`	bool	`true`	Enables emitting traces
`--health-checking`	bool	`false`	Enables health checking
`--metrics-port`	int	`9090`	The metrics port exposed by EPP
`--grpc-health-port`	int	`9003`	The port used for gRPC liveness and readiness probes
`--enable-pprof`	bool	`true`	Enables pprof handlers
`--cert-path`	string	`""`	The path to the certificate for secure serving. Certificate and private key files are assumed to be named tls.crt and tls.key
`--enable-cert-reload`	bool	`false`	Enables certificate reloading of the certificates specified in --cert-path
`--secure-serving`	bool	`true`	Enables secure serving
`--metrics-endpoint-auth`	bool	`true`	Enables authentication and authorization of the metrics endpoint
`--config-file`	string	`""`	The path to the configuration file
`--config-text`	string	`""`	The configuration specified as text, in lieu of a file

Environment Variables

Variable	Description	Deprecation
`NAMESPACE`	Used to determine pool namespace when `--pool-namespace` is not set	-
`POD_NAME`	Used to determine EPP name when using `--endpoint-selector` mode	-
`ENABLE_EXPERIMENTAL_DATALAYER_V2`	Enables experimental pluggable data layer	DEPRECATED: Use FeatureGates in config file instead
`ENABLE_EXPERIMENTAL_FLOW_CONTROL_LAYER`	Enables experimental pluggable flow control layer	DEPRECATED: Use FeatureGates in config file instead
`SD_QUEUE_DEPTH_THRESHOLD`	Saturation detector queue depth threshold	DEPRECATED: Use config file instead
`SD_KV_CACHE_UTIL_THRESHOLD`	Saturation detector KV cache utilization threshold	DEPRECATED: Use config file instead
`SD_METRICS_STALENESS_THRESHOLD`	Saturation detector metrics staleness threshold	DEPRECATED: Use config file instead

v0.5.0

Uses gateway-api-inference-extension v1.3.0.

Command-Line Flags

Flag	Type	Default	Description
`--grpc-port`	int	`9002`	gRPC port used for communicating with Envoy proxy
`--ha-enable-leader-election`	bool	`false`	Enables leader election for high availability. When enabled, readiness probes will only pass on the leader
`--pool-group`	string	`inference.networking.k8s.io`	Kubernetes resource group of the InferencePool this Endpoint Picker is associated with
`--pool-namespace`	string	`""`	Namespace of the InferencePool this Endpoint Picker is associated with
`--pool-name`	string	`""`	Name of the InferencePool this Endpoint Picker is associated with
`--endpoint-selector`	string	`""`	Selector to filter model server pods on, only 'key=value' pairs are supported. Format: comma-separated list of key=value pairs (e.g., 'app=vllm-llama3-8b-instruct,env=prod')
`--endpoint-target-ports`	[]int	`[]`	Target ports of model server pods. Format: comma-separated list of numbers (e.g., '3000,3001,3002')
`--disable-endpoint-subset-filter`	bool	`false`	Disables respecting the x-gateway-destination-endpoint-subset metadata for dispatching requests in EPP
`--model-server-metrics-scheme`	string	`http`	Protocol scheme used in scraping metrics from endpoints
`--model-server-metrics-path`	string	`/metrics`	URL path used in scraping metrics from endpoints
`--model-server-metrics-port`	int	`0`	DEPRECATED: Port to scrape metrics from endpoints. Set to InferencePool.Spec.TargetPorts[0].Number if not defined
`--model-server-metrics-https-insecure-skip-verify`	bool	`true`	Disable certificate verification when using 'https' scheme for model-server-metrics-scheme
`--refresh-metrics-interval`	duration	`50ms`	Interval to refresh metrics
`--refresh-prometheus-metrics-interval`	duration	`5s`	Interval to flush Prometheus metrics
`--metrics-staleness-threshold`	duration	`2s`	Duration after which metrics are considered stale
`--total-queued-requests-metric`	string	`vllm:num_requests_waiting`	Prometheus metric for the number of queued requests
`--total-running-requests-metric`	string	`vllm:num_requests_running`	Prometheus metric for the number of running requests
`--kv-cache-usage-percentage-metric`	string	`vllm:kv_cache_usage_perc`	Prometheus metric for the fraction of KV-cache blocks currently in use (from 0 to 1)
`--lora-info-metric`	string	`vllm:lora_requests_info`	Prometheus metric for the LoRA info metrics (must be in vLLM label format)
`--cache-info-metric`	string	`vllm:cache_config_info`	Prometheus metric for the cache info metrics
`-v`, `--v`	int	`0`	Number for the log level verbosity
`--zap-log-level`	string		Zap log level (debug, info, warn, error)
`--zap-devel`	bool	`true`	Development Mode defaults (encoder=consoleEncoder,logLevel=Debug,stackTraceLevel=Warn)
`--zap-encoder`	string		Zap log encoding ('json' or 'console')
`--zap-stacktrace-level`	string		Zap Level at and above which stacktraces are captured
`--tracing`	bool	`true`	Enables emitting traces
`--health-checking`	bool	`false`	Enables health checking
`--metrics-port`	int	`9090`	The metrics port exposed by EPP
`--grpc-health-port`	int	`9003`	The port used for gRPC liveness and readiness probes
`--enable-pprof`	bool	`true`	Enables pprof handlers
`--cert-path`	string	`""`	The path to the certificate for secure serving. Certificate and private key files are assumed to be named tls.crt and tls.key
`--enable-cert-reload`	bool	`false`	Enables certificate reloading of the certificates specified in --cert-path
`--secure-serving`	bool	`true`	Enables secure serving
`--metrics-endpoint-auth`	bool	`true`	Enables authentication and authorization of the metrics endpoint
`--config-file`	string	`""`	The path to the configuration file
`--config-text`	string	`""`	The configuration specified as text, in lieu of a file

Environment Variables

Variable	Description	Deprecation
`NAMESPACE`	Used to determine pool namespace when `--pool-namespace` is not set	-
`POD_NAME`	Used to determine EPP name when using `--endpoint-selector` mode	-
`ENABLE_EXPERIMENTAL_DATALAYER_V2`	Enables experimental pluggable data layer	DEPRECATED: Use FeatureGates in config file instead
`ENABLE_EXPERIMENTAL_FLOW_CONTROL_LAYER`	Enables experimental pluggable flow control layer	DEPRECATED: Use FeatureGates in config file instead
`SD_QUEUE_DEPTH_THRESHOLD`	Saturation detector queue depth threshold	DEPRECATED: Use config file instead
`SD_KV_CACHE_UTIL_THRESHOLD`	Saturation detector KV cache utilization threshold	DEPRECATED: Use config file instead
`SD_METRICS_STALENESS_THRESHOLD`	Saturation detector metrics staleness threshold	DEPRECATED: Use config file instead

Key Differences Between Main and v0.5.0

Metric Flags: In main branch, --total-queued-requests-metric, --total-running-requests-metric, --kv-cache-usage-percentage-metric, --lora-info-metric, and --cache-info-metric are deprecated and will error if explicitly set. In v0.5.0, these flags are functional.
Configuration: Main branch encourages using EndpointPickerConfig with engineConfigs for metrics configuration instead of CLI flags.

llm-d Inference Simulator CLI Reference

This section documents the command-line flags and environment variables supported by the llm-d inference simulator (llm-d-inference-sim). The simulator is a vLLM server simulator supporting OpenAI API endpoints.

Main Branch (Latest)

Command-Line Flags

Flag	Type	Default	Description
`--config`	string	`""`	Path to a YAML configuration file. Command line values overwrite config file values
`--port`	int	`8000`	Port on which the simulator runs
`--model`	string	`""`	Currently 'loaded' model name (required)
`--served-model-name`	[]string	`[]`	Model names exposed by the API (space-separated strings). Falls back to `--model` if not set
`--max-num-seqs`	int	`5`	Maximum number of inference requests that could be processed at the same time
`--max-waiting-queue-length`	int	`1000`	Maximum length of inference requests waiting queue
`--max-loras`	int	`1`	Maximum number of LoRAs in a single batch
`--max-cpu-loras`	int	(same as `--max-loras`)	Maximum number of LoRAs to store in CPU memory
`--max-model-len`	int	`1024`	Model's context window, maximum number of tokens in a single request including input and output
`--lora-modules`	[]string	`[]`	List of LoRA adapters (space-separated JSON strings)
`--mode`	string	`random`	Simulator mode: `echo` returns input text; `random` returns random pre-defined sentences
`--seed`	int64	(current Unix nano)	Random seed for operations
`--time-to-first-token`	duration	`0`	Time to first token (e.g., "100ms"). Integer format (milliseconds) is deprecated
`--time-to-first-token-std-dev`	duration	`0`	Standard deviation for time to first token (max 30% of TTFT)
`--inter-token-latency`	duration	`0`	Time to generate one token (e.g., "100ms"). Integer format is deprecated
`--inter-token-latency-std-dev`	duration	`0`	Standard deviation for inter-token latency (max 30% of ITL)
`--prefill-overhead`	duration	`0`	Time to prefill. Ignored if `--time-to-first-token` is set
`--prefill-time-per-token`	duration	`0`	Time to prefill per token
`--prefill-time-std-dev`	duration	`0`	Standard deviation for prefill time
`--kv-cache-transfer-latency`	duration	`0`	Time for KV-cache transfer from a remote vLLM (P/D mode)
`--kv-cache-transfer-latency-std-dev`	duration	`0`	Standard deviation for KV-cache transfer latency
`--kv-cache-transfer-time-per-token`	duration	`0`	Time for KV-cache transfer per token from a remote vLLM
`--kv-cache-transfer-time-std-dev`	duration	`0`	Standard deviation for KV-cache transfer time per token
`--time-factor-under-load`	float64	`1.0`	Multiplicative factor affecting request time when parallel requests are processed (must be >= 1.0)
`--enable-kvcache`	bool	`false`	Enables KV cache feature
`--kv-cache-size`	int	`1024`	Maximum number of token blocks in KV cache
`--global-cache-hit-threshold`	float64	`0`	Default cache hit threshold [0, 1] for all requests
`--block-size`	int	`16`	Token block size for contiguous chunks (valid: 8, 16, 32, 64, 128)
`--tokenizers-cache-dir`	string	`hf_cache`	Directory for caching tokenizers
`--hash-seed`	string	`""`	Seed for hash generation (falls back to `PYTHONHASHSEED` env var)
`--zmq-endpoint`	string	`tcp://localhost:5557`	ZMQ address to publish events
`--zmq-max-connect-attempts`	int	`0`	Maximum number of times to try ZMQ connect (max 10)
`--event-batch-size`	int	`16`	Maximum number of KV-cache events to be sent together
`--data-parallel-size`	int	`1`	Number of ranks to run (1-8)
`--data-parallel-rank`	int	`-1`	The rank when running each rank in a process
`--failure-injection-rate`	int	`0`	Probability (0-100) of injecting failures
`--failure-types`	[]string	`[]`	Specific failure types to inject: `rate_limit`, `invalid_api_key`, `context_length`, `server_error`, `invalid_request`, `model_not_found`
`--fake-metrics`	string	`""`	JSON metrics to report to Prometheus instead of real metrics
`--ssl-certfile`	string	`""`	Path to SSL certificate file for HTTPS
`--ssl-keyfile`	string	`""`	Path to SSL private key file for HTTPS
`--self-signed-certs`	bool	`false`	Enable automatic generation of self-signed certificates for HTTPS
`--dataset-path`	string	`""`	Local path to SQLite database file for response generation from a dataset
`--dataset-url`	string	`""`	URL to download the SQLite database file for response generation
`--dataset-in-memory`	bool	`false`	Load the entire dataset into memory for faster access
`--enable-sleep-mode`	bool	`false`	Enable sleep mode
`--enable-request-id-headers`	bool	`false`	Enable including X-Request-Id header in responses
`--latency-calculator`	string	`""`	Name of the latency calculator: `constant` or `per-token`
`--max-tool-call-integer-param`	int	`100`	Maximum possible value of integer parameters in a tool call
`--min-tool-call-integer-param`	int	`0`	Minimum possible value of integer parameters in a tool call
`--max-tool-call-number-param`	float64	`100`	Maximum possible value of number (float) parameters in a tool call
`--min-tool-call-number-param`	float64	`0`	Minimum possible value of number (float) parameters in a tool call
`--max-tool-call-array-param-length`	int	`5`	Maximum possible length of array parameters in a tool call
`--min-tool-call-array-param-length`	int	`1`	Minimum possible length of array parameters in a tool call
`--tool-call-not-required-param-probability`	int	`50`	Probability (0-100) to add a non-required parameter in a tool call
`--object-tool-call-not-required-field-probability`	int	`50`	Probability (0-100) to add a non-required field in an object in a tool call

Environment Variables

Variable	Description
`POD_NAME`	Pod name of simulator
`POD_NAMESPACE`	Namespace where simulator is running
`POD_IP`	IP address on which simulator runs
`PYTHONHASHSEED`	Fallback seed for hash generation if `--hash-seed` is not set
`VLLM_SERVER_DEV_MODE`	Set to `1` to enable development mode

v0.5.0

Command-Line Flags

Flag	Type	Default	Description
`--config`	string	`""`	Path to a YAML configuration file. Command line values overwrite config file values
`--port`	int	`8000`	Port on which the simulator runs
`--model`	string	`""`	Currently 'loaded' model name (required)
`--served-model-name`	[]string	`[]`	Model names exposed by the API (space-separated strings). Falls back to `--model` if not set
`--max-num-seqs`	int	`5`	Maximum number of inference requests that could be processed at the same time (parameter to simulate requests waiting queue)
`--max-loras`	int	`1`	Maximum number of LoRAs in a single batch
`--max-cpu-loras`	int	(same as `--max-loras`)	Maximum number of LoRAs to store in CPU memory
`--max-model-len`	int	`1024`	Model's context window, maximum number of tokens in a single request including input and output
`--lora-modules`	[]string	`[]`	List of LoRA adapters (space-separated JSON strings)
`--mode`	string	`random`	Simulator mode: `echo` returns input text; `random` returns random pre-defined sentences
`--seed`	int64	(current Unix nano)	Random seed for operations
`--time-to-first-token`	int	`0`	Time to first token in milliseconds
`--time-to-first-token-std-dev`	int	`0`	Standard deviation for time to first token in milliseconds (max 30% of TTFT)
`--inter-token-latency`	int	`0`	Time to generate one token in milliseconds
`--inter-token-latency-std-dev`	int	`0`	Standard deviation for inter-token latency in milliseconds (max 30% of ITL)
`--prefill-overhead`	int	`0`	Time to prefill in milliseconds. Ignored if `--time-to-first-token` is not 0
`--prefill-time-per-token`	int	`0`	Time to prefill per token in milliseconds
`--prefill-time-std-dev`	int	`0`	Standard deviation for prefill time in milliseconds
`--kv-cache-transfer-latency`	int	`0`	Time for KV-cache transfer from a remote vLLM in milliseconds (P/D mode)
`--kv-cache-transfer-latency-std-dev`	int	`0`	Standard deviation for KV-cache transfer latency in milliseconds
`--kv-cache-transfer-time-per-token`	int	`0`	Time for KV-cache transfer per token from a remote vLLM in milliseconds
`--kv-cache-transfer-time-std-dev`	int	`0`	Standard deviation for KV-cache transfer time per token in milliseconds
`--time-factor-under-load`	float64	`1.0`	Multiplicative factor affecting request time when parallel requests are processed (must be >= 1.0)
`--enable-kvcache`	bool	`false`	Enables KV cache feature
`--kv-cache-size`	int	`1024`	Maximum number of token blocks in KV cache
`--block-size`	int	`16`	Token block size for contiguous chunks (valid: 8, 16, 32, 64, 128)
`--tokenizers-cache-dir`	string	`""`	Directory for caching tokenizers
`--hash-seed`	string	`""`	Seed for hash generation (falls back to `PYTHONHASHSEED` env var)
`--zmq-endpoint`	string	`tcp://localhost:5557`	ZMQ address to publish events
`--zmq-max-connect-attempts`	uint	`0`	Maximum number of times to try ZMQ connect (max 10)
`--event-batch-size`	int	`16`	Maximum number of KV-cache events to be sent together
`--data-parallel-size`	int	`1`	Number of ranks to run (1-8)
`--failure-injection-rate`	int	`0`	Probability (0-100) of injecting failures
`--failure-types`	[]string	`[]`	Specific failure types to inject: `rate_limit`, `invalid_api_key`, `context_length`, `server_error`, `invalid_request`, `model_not_found`
`--fake-metrics`	string	`""`	JSON metrics to report to Prometheus instead of real metrics
`--max-tool-call-integer-param`	int	`100`	Maximum possible value of integer parameters in a tool call
`--min-tool-call-integer-param`	int	`0`	Minimum possible value of integer parameters in a tool call
`--max-tool-call-number-param`	float64	`100`	Maximum possible value of number (float) parameters in a tool call
`--min-tool-call-number-param`	float64	`0`	Minimum possible value of number (float) parameters in a tool call
`--max-tool-call-array-param-length`	int	`5`	Maximum possible length of array parameters in a tool call
`--min-tool-call-array-param-length`	int	`1`	Minimum possible length of array parameters in a tool call
`--tool-call-not-required-param-probability`	int	`50`	Probability (0-100) to add a non-required parameter in a tool call
`--object-tool-call-not-required-field-probability`	int	`50`	Probability (0-100) to add a non-required field in an object in a tool call

Environment Variables

Variable	Description
`POD_NAME`	Pod name of simulator
`POD_NAMESPACE`	Namespace where simulator is running
`PYTHONHASHSEED`	Fallback seed for hash generation if `--hash-seed` is not set

Key Differences Between Main and v0.5.0

Duration Parameters: In main branch, latency-related parameters (--time-to-first-token, --inter-token-latency, etc.) use Go duration strings (e.g., "100ms", "1.5s"). In v0.5.0, these are integers representing milliseconds.
New Flags in Main: --max-waiting-queue-length, --global-cache-hit-threshold, --data-parallel-rank, --ssl-certfile, --ssl-keyfile, --self-signed-certs, --dataset-path, --dataset-url, --dataset-in-memory, --enable-sleep-mode, --enable-request-id-headers, --latency-calculator.
Environment Variables: Main branch adds POD_IP and VLLM_SERVER_DEV_MODE.

ナビゲーション

Skillsとは？

リンク

Claude Code Assistant Guidelines

Claude Code Assistant Guidelines

Go Code Style

Naming Conventions

Formatting

Error Handling

Logging

Documentation

Concurrency

Project Structure

Documentation

E2E Testing

CLI Tools

llm-d Inference Scheduler EPP CLI Reference

Main Branch (Latest)

Command-Line Flags

Environment Variables

v0.5.0

Command-Line Flags

Environment Variables

Key Differences Between Main and v0.5.0

llm-d Inference Simulator CLI Reference

Main Branch (Latest)

Command-Line Flags

Environment Variables

v0.5.0

Command-Line Flags

Environment Variables

Key Differences Between Main and v0.5.0

関連スキル(🔧 開発ツール)