name: speech-vlm description: Run a multimodal Visual Language Model (VLM) with speech interaction on reComputer Jetson (AGX Orin 64G or Orin NX 16G), combining NVIDIA VLM, SenseVoice speech-to-text, and Coqui-ai TTS for voice-driven visual scene understanding.

Run VLM with Speech Interaction

Execution model

Run one phase at a time. After each phase, verify the expected result before continuing.

If a phase succeeds → print [OK] and move to the next phase.
If a phase fails → print [STOP], consult the failure decision tree, and ask the user before retrying.

Phase 1 — Verify prerequisites

Required hardware:

reComputer Jetson AGX Orin 64G or Orin NX 16G (16GB+ memory)
USB driver-free speaker microphone
IP camera with RTSP output (or use NVStreamer for local video)

# Check JetPack 6 and CUDA
cat /etc/nv_tegra_release
nvcc --version
# Check available memory
free -h

Expected: JetPack 6.x installed; CUDA available; 16GB+ RAM.

Phase 2 — Initialize system environment

# Ensure nvidia-jetpack is fully installed
sudo apt-get install nvidia-jetpack

# Install system dependencies
sudo apt-get install libportaudio2 libportaudiocpp0 portaudio19-dev

# Install Python packages
sudo pip3 install pyaudio playsound subprocess wave keyboard
sudo pip3 --upgrade setuptools
sudo pip3 install sudachipy==0.5.2

Verify audio devices are working and network is stable:

arecord -l   # List recording devices
aplay -l     # List playback devices
ping -c 2 8.8.8.8

Expected: Audio devices listed; network reachable.

Phase 3 — Install VLM

Follow the NVIDIA Jetson VLM installation guide. Ensure you can perform text-based inference with VLM before proceeding.

Reference: Run VLM on reComputer

Phase 4 — Install PyTorch and Torchaudio

Install PyTorch, Torchaudio, and Torchvision matching your JetPack version.

Reference: PyTorch installation for Jetson

# Verify PyTorch with CUDA
python3 -c "import torch; print(torch.cuda.is_available())"

Expected: True

Phase 5 — Install Speech_vlm (SenseVoice)

cd ~/
git clone https://github.com/ZhuYaoHui1998/speech_vlm.git
cd ~/speech_vlm
sudo pip3 install -r requement.txt

Expected: All SenseVoice dependencies installed.

Phase 6 — Install TTS (Coqui-ai)

cd ~/speech_vlm/TTS
sudo pip3 install .[all]

Expected: TTS package installed successfully.

Phase 7 — Start VLM service

cd ~/speech_vlm
sudo docker compose up -d

# Verify containers are running
sudo docker ps

Expected: VLM containers running.

Phase 8 — Add RTSP camera stream

Edit set_streamer_id.sh — replace 0.0.0.0 with Jetson IP and set your RTSP stream address:

cd ~/speech_vlm
# Edit the script with your Jetson IP and RTSP URL
nano set_streamer_id.sh
sudo chmod +x ./set_streamer_id.sh
./set_streamer_id.sh

Record the returned camera ID — it is needed for the next phase.

Expected: Camera ID returned in the response.

Phase 9 — Run speech VLM

Edit vlm_voice.py — replace 0.0.0.0 with Jetson IP in API_URL and fill in the camera ID in REQUEST_ID.

cd ~/speech_vlm
sudo python3 vlm_voice.py

After launch, select the audio device index when prompted. Press 1 to record, 2 to send.

Expected: Program starts, audio device selection shown, speech interaction works.

Phase 10 — View results (optional)

Edit view_rtsp.py — replace 0.0.0.0 in rtsp_url with Jetson IP.

sudo pip3 install opencv-python
cd ~/speech_vlm
sudo python3 view_rtsp.py

Expected: RTSP output stream displayed with VLM annotations.

Failure decision tree

Symptom	Likely cause	Suggested fix
`nvidia-jetpack` install fails	Incomplete JetPack flash	Reflash with full JetPack 6 image
`pyaudio` install fails	Missing portaudio dev headers	`sudo apt-get install portaudio19-dev`
No audio devices found	USB mic not recognized	Check `lsusb`; try different USB port
Docker compose fails	Docker not installed or no permissions	Install docker-ce; add user to docker group
Camera ID not returned	Wrong IP or RTSP URL	Verify Jetson IP and camera RTSP stream accessibility
VLM inference timeout	Insufficient memory	Ensure 16GB+ RAM; close other processes
TTS install fails	Missing build dependencies	`sudo apt-get install build-essential python3-dev`

Reference files

references/source.body.md — Full original wiki content (reference only)

ナビゲーション

Skillsとは？

リンク

speech-vlm