name: speech-vlm description: Run a multimodal Visual Language Model (VLM) with speech interaction on reComputer Jetson (AGX Orin 64G or Orin NX 16G), combining NVIDIA VLM, SenseVoice speech-to-text, and Coqui-ai TTS for voice-driven visual scene understanding.
Run VLM with Speech Interaction
Execution model
Run one phase at a time. After each phase, verify the expected result before continuing.
- If a phase succeeds → print
[OK]and move to the next phase. - If a phase fails → print
[STOP], consult the failure decision tree, and ask the user before retrying.
Phase 1 — Verify prerequisites
Required hardware:
- reComputer Jetson AGX Orin 64G or Orin NX 16G (16GB+ memory)
- USB driver-free speaker microphone
- IP camera with RTSP output (or use NVStreamer for local video)
# Check JetPack 6 and CUDA
cat /etc/nv_tegra_release
nvcc --version
# Check available memory
free -h
Expected: JetPack 6.x installed; CUDA available; 16GB+ RAM.
Phase 2 — Initialize system environment
# Ensure nvidia-jetpack is fully installed
sudo apt-get install nvidia-jetpack
# Install system dependencies
sudo apt-get install libportaudio2 libportaudiocpp0 portaudio19-dev
# Install Python packages
sudo pip3 install pyaudio playsound subprocess wave keyboard
sudo pip3 --upgrade setuptools
sudo pip3 install sudachipy==0.5.2
Verify audio devices are working and network is stable:
arecord -l # List recording devices
aplay -l # List playback devices
ping -c 2 8.8.8.8
Expected: Audio devices listed; network reachable.
Phase 3 — Install VLM
Follow the NVIDIA Jetson VLM installation guide. Ensure you can perform text-based inference with VLM before proceeding.
Reference: Run VLM on reComputer
Phase 4 — Install PyTorch and Torchaudio
Install PyTorch, Torchaudio, and Torchvision matching your JetPack version.
Reference: PyTorch installation for Jetson
# Verify PyTorch with CUDA
python3 -c "import torch; print(torch.cuda.is_available())"
Expected: True
Phase 5 — Install Speech_vlm (SenseVoice)
cd ~/
git clone https://github.com/ZhuYaoHui1998/speech_vlm.git
cd ~/speech_vlm
sudo pip3 install -r requement.txt
Expected: All SenseVoice dependencies installed.
Phase 6 — Install TTS (Coqui-ai)
cd ~/speech_vlm/TTS
sudo pip3 install .[all]
Expected: TTS package installed successfully.
Phase 7 — Start VLM service
cd ~/speech_vlm
sudo docker compose up -d
# Verify containers are running
sudo docker ps
Expected: VLM containers running.
Phase 8 — Add RTSP camera stream
Edit set_streamer_id.sh — replace 0.0.0.0 with Jetson IP and set your RTSP stream address:
cd ~/speech_vlm
# Edit the script with your Jetson IP and RTSP URL
nano set_streamer_id.sh
sudo chmod +x ./set_streamer_id.sh
./set_streamer_id.sh
Record the returned camera ID — it is needed for the next phase.
Expected: Camera ID returned in the response.
Phase 9 — Run speech VLM
Edit vlm_voice.py — replace 0.0.0.0 with Jetson IP in API_URL and fill in the camera ID in REQUEST_ID.
cd ~/speech_vlm
sudo python3 vlm_voice.py
After launch, select the audio device index when prompted. Press 1 to record, 2 to send.
Expected: Program starts, audio device selection shown, speech interaction works.
Phase 10 — View results (optional)
Edit view_rtsp.py — replace 0.0.0.0 in rtsp_url with Jetson IP.
sudo pip3 install opencv-python
cd ~/speech_vlm
sudo python3 view_rtsp.py
Expected: RTSP output stream displayed with VLM annotations.
Failure decision tree
| Symptom | Likely cause | Suggested fix |
|---|---|---|
nvidia-jetpack install fails | Incomplete JetPack flash | Reflash with full JetPack 6 image |
pyaudio install fails | Missing portaudio dev headers | sudo apt-get install portaudio19-dev |
| No audio devices found | USB mic not recognized | Check lsusb; try different USB port |
| Docker compose fails | Docker not installed or no permissions | Install docker-ce; add user to docker group |
| Camera ID not returned | Wrong IP or RTSP URL | Verify Jetson IP and camera RTSP stream accessibility |
| VLM inference timeout | Insufficient memory | Ensure 16GB+ RAM; close other processes |
| TTS install fails | Missing build dependencies | sudo apt-get install build-essential python3-dev |
Reference files
references/source.body.md— Full original wiki content (reference only)