name: hf-mem description: CLI to estimate the required VRAM to load Safetensors models for inference from the Hugging Face Hub (Transformers, Diffusers and Sentence Transformers) license: mit
hf-mem
What it does
Estimates inference memory requirements for models on the Hugging Face Hub via Safetensors metadata with HTTP Range requests.
Requirements
uvpackage manager (foruvxcommand)HF_TOKENenvironment variable (only for gated/private models)
When to use
- User asks about model VRAM/memory needs
- User wants to check if a model fits in their GPU
- User provides a Hugging Face model URL or model ID
Usage
uvx hf-mem --model-id <org/model-name>
Add --experimental to include KV cache estimations for LLMs and VLMs.
Use GPU estimation flags when the user asks how many GPUs are needed:
--list-gpusto print supported GPU presets (works without--model-id)--gpu <name>to estimate GPU count--overhead <fraction>to reserve VRAM headroom (for example0.2)--gpu-vram-gib <value>to override preset VRAM for cluster-specific configs
Examples
uvx hf-mem --model-id black-forest-labs/FLUX.1-devuvx hf-mem --model-id mistralai/Mistral-7B-v0.1 --experimentaluvx hf-mem --list-gpusuvx hf-mem --model-id Qwen/Qwen3.5-397B-A17B-FP8 --gpu h100uvx hf-mem --model-id Qwen/Qwen3.5-397B-A17B-FP8 --gpu l40s --gpu-vram-gib 32
When it fails
- HTTP 401, if the model is gated/private, meaning you need to set
HF_TOKENwith read access to it. - HTTP 404, if the provided
--model-idis not available on the Hugging Face Hub. - RuntimeError, if none of
model.safetensors,model.safetensors.index.json, ormodel_index.jsonis available.