name: bulk-inference description: "Runs bulk VLM inference via vLLM, OpenAI, or Gemini. Async parallel with resume and JSONL append. Use for 'run inference', 'bulk inference', '추론 실행'." model: sonnet

Bulk Inference

Purpose

Execute bulk VLM inference across multiple providers (vLLM local, OpenAI, Gemini) using scripts/inference_runner.py. Handles JSONL input/output, resume from interruption, and concurrent async requests.

Prerequisites

Input JSONL file with at minimum: an image path field, a question/prompt field, and one or more ID fields.
For vllm_local: running vLLM server(s) — use /vllm-serve first.
For openai: OPENAI_API_KEY env var set.
For gemini: GOOGLE_API_KEY env var set.

Process

Gather parameters from user:
- --provider: vllm_local, openai, or gemini
- --endpoints: server URLs (vllm_local) or API base URL
- --model-id: HF model name or API model ID
- --input: path to input JSONL
- --output: path for output JSONL
- --n-concurrent: requests per endpoint (vllm) or total (API), default 6
- --max-tokens: default 100
- --temperature: default 0.0
- Optional: --api-key-env, --reasoning-effort, --thinking-budget, --rate-limit-delay
- Optional: --image-field, --question-field, --id-fields, --prompt-template
Validate inputs — Confirm input JSONL exists and is readable. Check provider-specific requirements (API keys, server health).

Run inference:

python scripts/inference_runner.py \
  --provider {provider} \
  --endpoints {urls} \
  --model-id {model_id} \
  --input {input_jsonl} \
  --output {output_jsonl} \
  --n-concurrent {n} \
  --max-tokens {max_tokens} \
  --temperature {temp} \
  [--api-key-env {env_var}] \
  [--reasoning-effort {effort}] \
  [--thinking-budget {budget}] \
  [--rate-limit-delay {delay}] \
  [--no-resume] \
  [--image-field {field}] \
  [--question-field {field}] \
  [--id-fields {f1},{f2}] \
  [--prompt-template "Answer the question..."]

Monitor output — The script prints a tqdm progress bar and final summary with total, success, errors, and throughput.
Report results — After completion, report: output file path, total processed, success rate, error count.

Input JSONL Format

Each line is a JSON object. Required fields are configurable via --image-field, --question-field, --id-fields. Defaults:

image_path — path to image file
question_string — prompt/question text
triplet_id, condition — composite ID for resume

Output JSONL Format

Each output line preserves ALL original input fields plus:

{"...original fields...", "model": "...", "raw_response": "...", "parsed_answer": "...", "error": null}

Rules

Resume is ON by default — interrupted runs continue from where they stopped.
Never modify the input JSONL file.
Append mode: output JSONL is opened in append mode, one line per completed item.
All errors are captured per-item; the runner does not abort on individual failures.

ナビゲーション

Skillsとは？

リンク

bulk-inference