name: zhipuai-glm-cli-and-agents description: CLI invocation, direct API usage (curl/Python), non-interactive Claude Code with GLM backend, sub-agent model configuration, tool calling, streaming, and multi-agent orchestration patterns on the Z.ai GLM Coding Plan. type: reference
Z.ai GLM — CLI Invocation & Agent Communication
API Endpoints
| Format | Endpoint |
|---|---|
| Native Z.ai | https://api.z.ai/api/paas/v4/chat/completions |
| OpenAI-compatible | https://api.z.ai/v1/chat/completions |
| Anthropic-compatible | https://api.z.ai/api/anthropic |
Authentication: Authorization: Bearer <your-z.ai-api-key> on all endpoints.
Direct curl Invocation
Basic completion (OpenAI-compatible endpoint)
curl -s https://api.z.ai/v1/chat/completions \
-H "Authorization: Bearer $ZAI_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "GLM-4.7",
"messages": [
{"role": "system", "content": "You are a senior software engineer."},
{"role": "user", "content": "Write a Python function to parse ISO 8601 dates."}
],
"temperature": 1.0,
"max_tokens": 4096
}' | jq '.choices[0].message.content'
Streaming (SSE)
curl -sN https://api.z.ai/v1/chat/completions \
-H "Authorization: Bearer $ZAI_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "GLM-4.7",
"messages": [{"role": "user", "content": "Implement a binary search in Go."}],
"stream": true,
"temperature": 1.0
}'
# -N disables curl buffering — required for SSE streams
# Parse: each line starting with "data:" contains a JSON chunk; stream ends at data: [DONE]
With thinking/reasoning mode
curl -s https://api.z.ai/v1/chat/completions \
-H "Authorization: Bearer $ZAI_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "GLM-5.1",
"messages": [{"role": "user", "content": "Design a distributed rate limiter."}],
"thinking": {"type": "enabled"},
"temperature": 1.0,
"max_tokens": 8192
}'
Official Python SDK (zai)
pip install zai
Basic usage
from zai import ZaiClient
client = ZaiClient(api_key="your-z.ai-api-key")
response = client.chat.completions.create(
model="GLM-4.7",
messages=[{"role": "user", "content": "Refactor this function for clarity."}],
temperature=1.0,
max_tokens=4096
)
print(response.choices[0].message.content)
Streaming
stream = client.chat.completions.create(
model="GLM-4.7",
messages=[{"role": "user", "content": "Implement a Fibonacci generator."}],
stream=True,
temperature=1.0
)
for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)
Thinking/reasoning mode
response = client.chat.completions.create(
model="GLM-5.1",
messages=[{"role": "user", "content": "Debug this race condition."}],
thinking={"type": "enabled"},
temperature=1.0,
max_tokens=8192
)
Via OpenAI SDK (drop-in)
from openai import OpenAI
client = OpenAI(
api_key="your-z.ai-api-key",
base_url="https://api.z.ai/v1"
)
response = client.chat.completions.create(
model="GLM-4.7",
messages=[{"role": "user", "content": "Write unit tests for this module."}],
temperature=1.0
)
Tool Calling (Function Calling)
Standard OpenAI-compatible schema. Supported on GLM-4.6, GLM-4.7, GLM-5.
from openai import OpenAI
client = OpenAI(api_key="your-key", base_url="https://api.z.ai/v1")
tools = [
{
"type": "function",
"function": {
"name": "run_tests",
"description": "Execute the test suite and return results",
"parameters": {
"type": "object",
"properties": {
"test_path": {"type": "string", "description": "Path to test file or directory"},
"verbose": {"type": "boolean", "default": False}
},
"required": ["test_path"]
}
}
}
]
response = client.chat.completions.create(
model="GLM-4.7",
messages=[{"role": "user", "content": "Run the auth tests and summarize failures."}],
tools=tools,
tool_choice="auto"
)
if response.choices[0].message.tool_calls:
for call in response.choices[0].message.tool_calls:
print(f"Tool: {call.function.name}")
print(f"Args: {call.function.arguments}")
Streaming tool calls
response = client.chat.completions.create(
model="GLM-4.7",
messages=[{"role": "user", "content": "Check the weather in Beijing."}],
tools=tools,
stream=True,
tool_stream=True # enables streaming of tool call parameters
)
for chunk in response:
if not chunk.choices:
continue
delta = chunk.choices[0].delta
if delta.content:
print(delta.content, end="", flush=True)
if delta.tool_calls:
for tc in delta.tool_calls:
# accumulate streamed JSON arguments
pass
tool_stream=True reduces latency by streaming tool argument JSON as it's generated rather than buffering the full call.
Non-Interactive Claude Code (CLI Automation)
Claude Code supports non-interactive mode via -p flag — useful for scripts, CI, and agent-to-agent pipelines.
One-shot task
# Using native claude with GLM env vars pre-set
ANTHROPIC_AUTH_TOKEN=$ZAI_API_KEY \
ANTHROPIC_BASE_URL=https://api.z.ai/api/anthropic \
API_TIMEOUT_MS=3000000 \
claude -p "Add docstrings to all public functions in src/api.py"
Using the z/glm community wrapper
Two community projects provide z and glm as drop-in wrappers:
geoh/z.ai-powered-claude-code — installs z and glm commands:
bash scripts/install.sh # Linux/macOS
# Windows: .\scripts\install.ps1
z "Implement the pagination feature described in JIRA-123"
glm --model opus "Refactor the auth module for testability"
z -p "Run the test suite and fix all failures" # non-interactive
ankurkakroo2/claude-code-glm-setup — adds glm shell function:
glm -p "what is 2+2" # non-interactive single prompt
glm --debug api # enables API request logging
Both wrappers use --settings ~/.claude/glm-settings.json to inject Z.ai credentials without touching the default Claude config.
Sub-Agent Model Configuration
When Claude Code spawns sub-agents (background tasks, parallel tool execution), you can control which GLM model they use independently of the main model.
.zai.json config (geoh wrapper)
{
"apiKey": "your-z.ai-key",
"opusModel": "GLM-5.1",
"sonnetModel": "GLM-4.7",
"haikuModel": "GLM-4.5-Air",
"subagentModel": "GLM-4.7",
"defaultModel": "opus",
"enableThinking": "true",
"reasoningEffort": "high"
}
subagentModel — the model used for sub-agent tasks. Recommended: GLM-4.7 (1× quota cost) to prevent sub-agents from burning through advanced model budget while the main agent uses GLM-5.1.
Per-project config
Place .zai.json in a project root to override global settings for that project:
{
"defaultModel": "sonnet",
"subagentModel": "GLM-4.5-Air",
"reasoningEffort": "medium"
}
Config search order (first found wins): ./.zai.json → $ZAI_CONFIG_PATH → ~/.config/zai/config.json → ~/.zai.json
Env var override (highest priority)
export ZAI_API_KEY="your-key" # overrides apiKey in any config file
Agent-to-Agent Patterns
Pattern: Orchestrator + Worker agents
The 1-concurrent-request limit means true parallel fan-out won't work. Use a serial dispatch queue:
import time
from openai import OpenAI
client = OpenAI(api_key="your-key", base_url="https://api.z.ai/v1")
def glm_call(prompt, model="GLM-4.7"):
"""Single agent call — serialize these, don't parallelize."""
response = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": prompt}],
temperature=1.0
)
return response.choices[0].message.content
# Orchestrator plans, then dispatches workers serially
plan = glm_call("Break this task into 3 subtasks: implement OAuth login", model="GLM-5.1")
results = []
for subtask in parse_subtasks(plan):
result = glm_call(subtask, model="GLM-4.7") # workers use cheaper model
results.append(result)
time.sleep(0.5) # optional: be gentle on rate limits
Pattern: Shell script agent loop
#!/usr/bin/env bash
# Non-interactive agentic loop — each step fed to next
set -e
CONTEXT="Fix all failing tests in ./tests/"
# Step 1: analyze
ANALYSIS=$(ANTHROPIC_AUTH_TOKEN=$ZAI_API_KEY \
ANTHROPIC_BASE_URL=https://api.z.ai/api/anthropic \
API_TIMEOUT_MS=3000000 \
claude -p "Analyze: $CONTEXT. List specific failures.")
# Step 2: fix based on analysis
ANTHROPIC_AUTH_TOKEN=$ZAI_API_KEY \
ANTHROPIC_BASE_URL=https://api.z.ai/api/anthropic \
API_TIMEOUT_MS=3000000 \
claude -p "Given this analysis: $ANALYSIS — fix the failures."
Pattern: Claude Code as sub-agent caller
Within a Claude Code session backed by GLM, you can spawn Claude Code subprocesses pointed at the same Z.ai backend:
import subprocess
result = subprocess.run(
["claude", "-p", "Implement the data validation layer"],
env={
**os.environ,
"ANTHROPIC_AUTH_TOKEN": os.getenv("ZAI_API_KEY"),
"ANTHROPIC_BASE_URL": "https://api.z.ai/api/anthropic",
"API_TIMEOUT_MS": "3000000"
},
capture_output=True, text=True, timeout=300
)
print(result.stdout)
LiteLLM Integration (Multi-Provider Orchestration)
LiteLLM provides a unified interface for routing across providers. Prefix Z.ai models with zai/:
import litellm
# Route to Z.ai GLM
response = litellm.completion(
model="zai/GLM-4.7",
messages=[{"role": "user", "content": "Implement error handling."}],
api_key=os.getenv("ZAI_API_KEY")
)
# Fallback: try GLM-5.1 first, fall back to GLM-4.7
response = litellm.completion(
model="zai/GLM-5.1",
messages=[{"role": "user", "content": "Solve this hard bug."}],
api_key=os.getenv("ZAI_API_KEY"),
fallbacks=["zai/GLM-4.7"]
)
Reasoning Effort Levels (geoh wrapper)
The reasoningEffort field in .zai.json maps to model thinking depth:
| Value | Use Case |
|---|---|
auto | Let model decide |
low | Simple tasks, fast response |
medium | Balanced (default) |
high | Complex reasoning, architecture |
max | Maximum — most quota intensive |
Windows Notes
| Shell | Wrapper | Limitations |
|---|---|---|
| Git Bash | z / glm | Full functionality, status line works |
| PowerShell | z.ps1 / glm.ps1 | Core features, no status line |
| CMD | z.cmd / glm.cmd | Core features, no status line |
Status line requires bash + stdin piping. On Windows, use Git Bash for full experience.
Requires jq installed and on PATH for JSON processing in wrapper scripts.