Troubleshooting Reference Directory - Agent Guide

Purpose

This directory contains symptom-based troubleshooting guides for the Home Security Intelligence system. These guides help operators and users quickly diagnose and resolve common problems.

Directory Contents

troubleshooting/
  AGENTS.md              # This file
  index.md               # Symptom quick reference table
  ai-issues.md           # AI service troubleshooting
  connection-issues.md   # Network and connectivity problems
  database-issues.md     # PostgreSQL problems
  gpu-issues.md          # GPU and CUDA issues

Key Files

index.md

Purpose: First stop when something goes wrong. Symptom-based quick reference.

Content:

Quick self-check commands before troubleshooting
Symptom quick reference table with likely causes and quick fixes
Common problems with detailed diagnosis and solutions:
- Dashboard shows no events
- Risk gauge stuck at 0
- Camera shows offline
- AI not working
- WebSocket disconnected
- High CPU/memory usage
- Disk space running out
- Slow AI inference
- CORS errors in browser
Emergency procedures (system won't start, database corruption, security breach)
Information to gather for bug reports

When to use: First stop for any problem, symptom lookup, emergency situations.

ai-issues.md

Purpose: Troubleshooting YOLO26, Nemotron, and pipeline problems.

Topics Covered:

Service not running
Degraded mode (one service up, one down)
Batch not processing
Analysis failing (null risk scores)
Detection quality issues (false positives/negatives)
Slow inference
Model loading issues
Circuit breaker open

Diagnostic Commands:

# Check AI service status
./scripts/start-ai.sh status

# Check individual services
curl http://localhost:8095/health  # YOLO26
curl http://localhost:8091/health  # Nemotron

# Check pipeline
curl http://localhost:8000/api/system/pipeline | jq

When to use: AI services failing, no detections, analysis problems.

connection-issues.md

Purpose: Network, container, and connectivity troubleshooting.

Topics Covered:

Backend connection refused
Redis connection failed
Database connection failed
File watcher issues
WebSocket connection problems
CORS errors
Container networking
Port conflicts

When to use: Services can't connect to each other, network errors.

database-issues.md

Purpose: PostgreSQL troubleshooting.

Topics Covered:

Connection refused
Authentication failed
Migration failures
Disk space issues
Performance problems
Backup and recovery
Data corruption

When to use: Database errors, migration problems, storage issues.

gpu-issues.md

Purpose: NVIDIA GPU and CUDA troubleshooting.

Topics Covered:

CUDA not available
GPU not detected
Running on CPU instead of GPU
VRAM exhaustion
Thermal throttling
Container GPU access (NVIDIA Container Toolkit)
Driver issues
Multi-GPU configuration

When to use: AI running slow, GPU not being used, CUDA errors.

triton-rootless-cuda.md

Purpose: Triton CUDA init failure in rootless Podman (cudaGetDeviceCount err=3).

Topics Covered:

Root cause: CUDA Runtime API vs Driver API
Rootful Podman workaround
Rootless CDI spec in user directory
nvidia-cap device permissions
Explicit nvidia-cap bind mounts

When to use: ai-gateway models UNAVAILABLE while ai-llm works; Triton cudaErrorInitializationError.

Troubleshooting Approach

All troubleshooting guides follow this pattern:

Structure

Symptoms - What you observe
Quick Diagnosis - Commands to identify the problem
Possible Causes - Ordered by likelihood
Solutions - Step-by-step fixes

Solution Order

Solutions are presented most-likely-first:

Quick fixes that resolve most cases
Configuration changes
Service restarts
More complex debugging
Last resort options

Example Pattern

## Problem Title

### Symptoms

- What the user observes

### Quick Diagnosis

```bash
# Commands to identify the problem
```

Possible Causes

Most common cause
Second most common
Less common cause

Solutions

1. Try this first:

# Command

2. If that doesn't work:

# Alternative command


## Diagnostic Command Reference

### System Health

```bash
# Overall health
curl http://localhost:8000/api/system/health | jq

# Detailed readiness
curl http://localhost:8000/api/system/health/ready | jq

# Container status
docker compose -f docker-compose.prod.yml ps

AI Services

# YOLO26
curl http://localhost:8095/health

# Nemotron
curl http://localhost:8091/health

# Pipeline status
curl http://localhost:8000/api/system/pipeline | jq

GPU

# GPU status
nvidia-smi

# GPU processes
nvidia-smi --query-compute-apps=pid,name,used_memory --format=csv

Queues

# Queue depths
curl http://localhost:8000/api/system/telemetry | jq .queues

# Redis directly
redis-cli llen detection_queue
redis-cli llen analysis_queue

Target Audiences

Audience	Needs	Primary Documents
Operators	Quick problem resolution	index.md, all issue guides
Support	Systematic diagnosis	All files
Users	Basic troubleshooting	index.md (quick reference)
Developers	Deep debugging	Specific issue guides

ナビゲーション

Skillsとは？

リンク

Troubleshooting Reference Directory - Agent Guide

Troubleshooting Reference Directory - Agent Guide

Purpose

Directory Contents

Key Files

index.md

ai-issues.md

connection-issues.md

database-issues.md

gpu-issues.md

triton-rootless-cuda.md

Troubleshooting Approach

Structure

Solution Order

Example Pattern

Possible Causes

Solutions

AI Services

GPU

Queues

Target Audiences

Related Documentation

関連スキル(🌐 Web開発)