name: bulk-inference description: "Runs bulk VLM inference via vLLM, OpenAI, or Gemini. Async parallel with resume and JSONL append. Use for 'run inference', 'bulk inference', '추론 실행'." model: sonnet
Bulk Inference
Purpose
Execute bulk VLM inference across multiple providers (vLLM local, OpenAI, Gemini) using scripts/inference_runner.py. Handles JSONL input/output, resume from interruption, and concurrent async requests.
Prerequisites
- Input JSONL file with at minimum: an image path field, a question/prompt field, and one or more ID fields.
- For
vllm_local: running vLLM server(s) — use/vllm-servefirst. - For
openai:OPENAI_API_KEYenv var set. - For
gemini:GOOGLE_API_KEYenv var set.
Process
-
Gather parameters from user:
--provider:vllm_local,openai, orgemini--endpoints: server URLs (vllm_local) or API base URL--model-id: HF model name or API model ID--input: path to input JSONL--output: path for output JSONL--n-concurrent: requests per endpoint (vllm) or total (API), default 6--max-tokens: default 100--temperature: default 0.0- Optional:
--api-key-env,--reasoning-effort,--thinking-budget,--rate-limit-delay - Optional:
--image-field,--question-field,--id-fields,--prompt-template
-
Validate inputs — Confirm input JSONL exists and is readable. Check provider-specific requirements (API keys, server health).
-
Run inference:
python scripts/inference_runner.py \ --provider {provider} \ --endpoints {urls} \ --model-id {model_id} \ --input {input_jsonl} \ --output {output_jsonl} \ --n-concurrent {n} \ --max-tokens {max_tokens} \ --temperature {temp} \ [--api-key-env {env_var}] \ [--reasoning-effort {effort}] \ [--thinking-budget {budget}] \ [--rate-limit-delay {delay}] \ [--no-resume] \ [--image-field {field}] \ [--question-field {field}] \ [--id-fields {f1},{f2}] \ [--prompt-template "Answer the question..."] -
Monitor output — The script prints a tqdm progress bar and final summary with total, success, errors, and throughput.
-
Report results — After completion, report: output file path, total processed, success rate, error count.
Input JSONL Format
Each line is a JSON object. Required fields are configurable via --image-field, --question-field, --id-fields. Defaults:
image_path— path to image filequestion_string— prompt/question texttriplet_id,condition— composite ID for resume
Output JSONL Format
Each output line preserves ALL original input fields plus:
{"...original fields...", "model": "...", "raw_response": "...", "parsed_answer": "...", "error": null}
Rules
- Resume is ON by default — interrupted runs continue from where they stopped.
- Never modify the input JSONL file.
- Append mode: output JSONL is opened in append mode, one line per completed item.
- All errors are captured per-item; the runner does not abort on individual failures.