name: audio-transcriber description: Extracts audio from dashcam MP4 files and produces GPU-accelerated timestamped transcripts with optional speaker diarization. This skill should be used when users request audio transcription from video files, mention dashcam audio/transcribe MP4/extract speech, want to analyze conversations from video footage, need timestamped transcripts with speaker identification, or ask to process video folders with audio extraction.
Audio Transcriber
Skill Type: Media Processing & Analysis Domain: Audio Transcription, Speech Recognition, GPU Acceleration Version: 2.0 Last Updated: 2025-10-26
Description
Extracts audio from dashcam MP4 files and produces GPU-accelerated timestamped transcripts with optional speaker diarization. Uses faster-whisper with CUDA for efficient processing, organizing outputs by date with comprehensive metadata and quality metrics.
When to Use This Skill:
- User requests audio transcription from video files
- User mentions "dashcam audio", "transcribe MP4", or "extract speech"
- User wants to analyze conversations from video footage
- User needs timestamped transcripts with speaker identification
- User asks to process video folders with audio extraction
Quick Start
User Trigger Phrases
- "Transcribe audio from my dashcam videos"
- "Extract and transcribe speech from [folder/date]"
- "Generate transcripts for [MP4 files/date range]"
- "Process dashcam audio with speaker identification"
- "Create subtitles from video files"
Expected Inputs
- Video Folder Path (required) - Path to MP4 files or date-organized folders
- Date Range (optional) - Single day, range, or "all available"
- Output Directory (optional) - Default: parallel to input with
_transcriptssuffix - Processing Options (optional) - Model size, formats, diarization, GPU settings
Expected Outputs
- Audio extracts (WAV files) organized by date
- Transcripts in multiple formats (TXT, JSON, SRT, VTT)
- Global INDEX.csv with searchable segment metadata
- Results JSON with GPU metrics and processing statistics
- Quality reports with confidence scores and coverage
Core Capabilities
1. User Input Acquisition (Section 0 Protocol)
CRITICAL: Always follow the Section 0 protocol before processing.
Step 1: Parse User Request
User: "Transcribe my dashcam videos from September 3rd"
AI Detection:
✅ Date detected: 2025-09-03
❌ Folder path not specified → Try auto-discovery
Step 2: Auto-Discovery
Search these locations in order:
C:\Users\[user]\Desktop\CARDV\Movie_F\YYYYMMDD\*.MP4G:\My Drive\PROJECTS\INVESTIGATION\DASHCAM\Movie_F\YYYYMMDD\*.MP4- Current working directory +
\*.mp4 - User-provided explicit path
Step 3: Validate Files
def validate_video_folder(folder_path):
# Check folder exists
# Count MP4 files
# Estimate total audio duration
# Check disk space for outputs
# Verify FFmpeg installation
Step 4: Prompt for Missing Inputs
If files not found:
🎥 I need video files to transcribe. I couldn't find MP4s in the expected location.
Please provide ONE of the following:
1. **Direct folder path**: e.g., `C:\Users\yousu\Desktop\CARDV\Movie_F\20250903`
2. **Parent directory**: I'll search for date subfolders
3. **Single video file**: Process just one MP4
💡 Tip: Dashcam folders are usually organized by date (YYYYMMDD)
Step 5: Configuration Summary & Confirmation
📋 **Transcription Configuration Summary**
**Input:**
📁 Folder: C:\Users\yousu\Desktop\CARDV\Movie_F\20250903
📹 Videos found: 47 MP4 files
⏱️ Estimated audio: ~7.8 hours
💾 Estimated output size: ~450 MB (transcripts only)
**Processing:**
🖥️ GPU: NVIDIA GeForce RTX 4080 (detected)
🧠 Model: faster-whisper base (FP16, CUDA)
🎯 Segmentation: fixed 30s chunks
🗣️ Diarization: disabled (opt-in)
📝 Formats: txt, json, srt
**Output:**
💾 Audio extracts: C:\Users\yousu\Desktop\CARDV\Movie_F\20250903\audio\
📄 Transcripts: C:\Users\yousu\Desktop\CARDV\Movie_F\20250903\transcripts\
📊 INDEX.csv: C:\Users\yousu\Desktop\CARDV\Movie_F\20250903\transcripts\INDEX.csv
Ready to proceed? (Yes/No)
NEVER begin processing without user confirmation.
2. Audio Processing Pipeline
A. Audio Extraction (FFmpeg with Retry Matrix)
# Primary extraction command
ffmpeg -i video.mp4 -vn -acodec pcm_s16le -ar 16000 -ac 1 audio.wav
# Retry sequence on failure:
# 1. Codec fallback: pcm_s16le → flac
# 2. Add demuxer args: -fflags +genpts -rw_timeout 30000000
# 3. Extended probe: -analyzeduration 100M -probesize 100M
Quality Checks:
- Verify audio stream exists (ffprobe preflight)
- Check duration matches video duration
- Detect silent/corrupted audio
- Log extraction errors to
_FAILED.json
B. Segmentation (Two Modes)
Fixed Mode (Default):
- Split audio into 30-second chunks
- Predictable processing time
- No external VAD required
- Best for continuous speech
VAD Mode (Advanced):
- Use Silero VAD to detect speech regions
- Variable-length segments (2-60s)
- Skip long silences
- Best for sparse audio (parking mode)
Mutual Exclusion: Only one mode active at a time.
C. GPU Transcription (faster-whisper)
# Load model with GPU optimization
model = WhisperModel(
"base",
device="cuda",
compute_type="float16"
)
# Transcribe with word-level timestamps
segments, info = model.transcribe(
audio_path,
beam_size=5,
word_timestamps=True,
vad_filter=True
)
GPU Metrics Captured:
- Device name, VRAM, utilization
- CUDA version, driver version
- Average GPU % during run (sampled at 1-2 Hz)
- Memory usage peaks
D. Speaker Diarization (Optional)
Backends:
- pyannote: State-of-the-art (requires HF token + VRAM)
- speechbrain: Good performance (no auth required)
Label Normalization:
- Different backends → unified
spkA,spkB, etc. - Consistent across INDEX.csv and JSON outputs
Fallback Behavior:
- If HF token missing → skip diarization, log warning
- If OOM error → disable diarization, continue transcription
3. Output Generation
A. File Organization (Per-Day Structure)
C:\Users\yousu\Desktop\CARDV\Movie_F\
└── 20250903\
├── audio\
│ ├── 20250903133516_059495B.wav
│ ├── 20250903134120_059496B.wav
│ └── ... (47 files)
├── transcripts\
│ ├── 20250903133516_059495B.txt
│ ├── 20250903133516_059495B.json
│ ├── 20250903133516_059495B.srt
│ └── ... (47 × 3 = 141 files)
└── INDEX.csv
B. Format Details
TXT (Plain Text):
[00:00:15] Speaker A: Hey, where are we going?
[00:00:18] Speaker B: Just heading to the mall.
[00:00:22] Speaker A: Okay, sounds good.
JSON (Complete Metadata):
{
"video_file": "20250903133516_059495B.MP4",
"audio_duration_sec": 60,
"language": "en",
"language_confidence": 0.95,
"segments": [
{
"start": 15.2,
"end": 17.8,
"text": "Hey, where are we going?",
"confidence": 0.89,
"speaker": "spkA",
"words": [
{"word": "Hey", "start": 15.2, "end": 15.4, "confidence": 0.92},
{"word": "where", "start": 15.5, "end": 15.8, "confidence": 0.88}
]
}
]
}
SRT (SubRip Subtitles):
1
00:00:15,200 --> 00:00:17,800
[spkA] Hey, where are we going?
2
00:00:18,000 --> 00:00:20,500
[spkB] Just heading to the mall.
VTT (WebVTT):
WEBVTT
00:00:15.200 --> 00:00:17.800
<v spkA>Hey, where are we going?
00:00:18.000 --> 00:00:20.500
<v spkB>Just heading to the mall.
C. INDEX.csv (Global Search Index)
Composite key: (video_rel, seg_idx)
| Column | Description |
|---|---|
dataset | Movie_F / Movie_R / Park_F / Park_R |
date | YYYYMMDD |
video_rel | Relative path from root |
video_stem | Filename without extension |
seg_idx | 0-based segment index |
ts_start_ms | Segment start milliseconds |
ts_end_ms | Segment end milliseconds |
text | Transcript text (truncated to 512 chars) |
text_len | Full text length |
lang | ISO language code |
lang_conf | Language detection confidence |
conf_avg | Average token confidence |
speaker | Normalized speaker label |
format_mask | Files generated (txt/json/srt/vtt) |
transcript_file | Basename |
audio_file | Basename |
engine | e.g., faster-whisper:base:fp16 |
cuda_version | CUDA version |
driver_version | Driver version |
created_utc | ISO 8601 timestamp |
D. Results JSON (Single Source of Truth)
{
"status": "ok",
"summary": {
"videos_processed": 47,
"segments": 1847,
"hours_audio": 7.8,
"gpu_detected": true,
"device_count": 1,
"devices": [
{
"index": 0,
"name": "NVIDIA GeForce RTX 4080",
"total_mem_mb": 16384,
"free_mem_mb": 14200
}
],
"utilization": {
"gpu_pct": 35,
"mem_pct": 42,
"sampling_hz": 2
},
"cuda_version": "12.1",
"driver_version": "546.01",
"torch_version": "2.2.0+cu121",
"errors": 0,
"failed_files": []
},
"artifacts": {
"index_csv": "C:\\Users\\yousu\\Desktop\\CARDV\\Movie_F\\20250903\\INDEX.csv",
"output_dir": "C:\\Users\\yousu\\Desktop\\CARDV\\Movie_F\\20250903\\transcripts"
}
}
4. Quality & Error Handling
A. Resume Safety
- Skip existing transcripts unless
--forceflag - Idempotent: re-running is safe
- Checkpoint support for long runs
B. Error Types & Recovery
Per-Video Failures ({video_stem}_FAILED.json):
{
"video_path": "C:\\...\\video.mp4",
"error_type": "ffmpeg_err",
"error_message": "Failed to decode audio stream",
"ffprobe_metadata": {"duration": null, "codec": "h264"},
"timestamp": "2025-09-03T14:30:00Z"
}
Error types:
ffmpeg_err: Audio extraction faileddecode_err: Whisper decode failedOOM: Out of GPU memorycorrupted: Container/stream corruptedno_audio: No audio stream detected
C. SRT/VTT Validation
- Strictly monotonic timestamps
- No overlapping segments
- Clamp gaps <50ms
- Proper timecode formatting (comma vs period)
Implementation Guide
Phase 1: Input Acquisition
# 1. Parse user request
inputs = parse_user_request(user_message)
# 2. Auto-discover video files
if not inputs['video_folder']:
inputs['video_folder'] = auto_discover_videos()
# 3. Validate inputs
validate_video_folder(inputs['video_folder'])
check_ffmpeg_available()
check_gpu_available()
# 4. Estimate resource requirements
estimate_processing_time(inputs)
estimate_disk_space(inputs)
# 5. Present configuration summary
show_configuration_summary(inputs)
# 6. Wait for confirmation
if not user_confirms():
return # Do not proceed
Phase 2: Audio Extraction
for video_file in video_files:
# FFprobe preflight check
metadata = ffprobe(video_file)
if not has_audio_stream(metadata):
log_failed(video_file, "no_audio")
continue
# Extract audio with retry
try:
audio_path = extract_audio_ffmpeg(
video_file,
output_dir=audio_output_dir,
sample_rate=16000,
channels=1
)
except FFmpegError as e:
# Retry with fallback codec
audio_path = extract_audio_ffmpeg_retry(video_file)
Phase 3: Transcription
# Load model once (reuse for all files)
model = load_whisper_model(
model_size="base",
device="cuda",
compute_type="float16"
)
for audio_file in audio_files:
# Segment audio
if segmentation_mode == "fixed":
chunks = segment_fixed(audio_file, chunk_size=30)
else:
chunks = segment_vad(audio_file, vad_model)
# Transcribe each chunk
all_segments = []
for chunk in chunks:
segments = model.transcribe(chunk)
all_segments.extend(segments)
# Optional: Diarization
if diarization_enabled:
all_segments = apply_diarization(audio_file, all_segments)
Phase 4: Output Generation
# Generate all formats
for video_file, segments in results.items():
# TXT
write_txt(segments, output_dir)
# JSON
write_json(segments, metadata, output_dir)
# SRT
srt_content = generate_srt(segments)
validate_srt_monotonic(srt_content)
write_srt(srt_content, output_dir)
# VTT (optional)
write_vtt(segments, output_dir)
# Update INDEX.csv
append_to_index(segments, index_csv_path)
Phase 5: Completion Report
# Generate results JSON
results_json = {
"status": "ok",
"summary": collect_statistics(),
"artifacts": list_output_files(),
"gpu_metrics": get_gpu_metrics()
}
# Save to file
save_results_json(results_json, output_dir)
# Report to user
print(f"✅ Complete! Processed {video_count} videos")
print(f" Transcripts: {output_dir}")
print(f" INDEX: {index_csv_path}")
print(f" GPU Util: {avg_gpu_pct}%")
Reference Materials
In This Skill
- SKILL_MANIFEST.md - Complete technical specification (v2.0)
- references/TECHNICAL_SPECIFICATION.md - Detailed implementation rules
- scripts/batch_transcriber.py - Main batch processing script
- scripts/audio_extractor.py - FFmpeg wrapper with retry logic
- scripts/transcriber.py - Whisper transcription engine
- scripts/diarizer.py - Speaker diarization integration
- scripts/format_writers.py - TXT/JSON/SRT/VTT generators
- scripts/gpu_monitor.py - GPU metrics collection
- scripts/validation.py - Input validation and checks
- assets/config_template.json - Default configuration
- assets/params.json - Tunable parameters
External Documentation
- faster-whisper documentation
- FFmpeg audio processing guide
- pyannote.audio diarization guide
- SRT/VTT subtitle format specifications
Tunable Parameters
{
"whisper": {
"model_size": "base",
"device": "cuda",
"compute_type": "float16",
"batch_size": 8,
"beam_size": 5,
"language": "en",
"detect_language": false
},
"audio": {
"sample_rate": 16000,
"channels": 1,
"format": "wav",
"keep_intermediate": false
},
"segmentation": {
"mode": "fixed",
"chunk_length_sec": 30,
"vad_min_len_sec": 2,
"vad_max_len_sec": 60
},
"diarization": {
"enabled": false,
"backend": "pyannote",
"min_speakers": 1,
"max_speakers": 10
},
"output": {
"formats": ["txt", "json", "srt"],
"text_truncate_csv": 512
},
"parallel": {
"max_workers": 3
}
}
Common Issues & Solutions
Issue 1: "GPU not detected"
Cause: CUDA not installed or incompatible driver Solution:
- Check:
python -c "import torch; print(torch.cuda.is_available())" - Install/update CUDA toolkit
- Fallback to CPU:
--device cpu
Issue 2: "FFmpeg command failed"
Cause: FFmpeg not in PATH or unsupported codec Solution:
- Verify:
ffmpeg -version - Install from ffmpeg.org
- Use retry matrix with codec fallback
Issue 3: "Out of memory (OOM)"
Cause: GPU VRAM insufficient for model + batch size Solution:
- Use smaller model:
tinyorsmall - Reduce batch size:
--batch 4 - Process fewer files in parallel
Issue 4: "Diarization failed"
Cause: HF token missing or network error Solution:
- Set token:
export HF_TOKEN=hf_... - Accept pyannote license on HuggingFace
- Disable diarization:
--no-diarize
Issue 5: "SRT validation errors"
Cause: Overlapping timestamps or malformed timecodes Solution:
- Enable timestamp clamping:
--clamp-gaps - Check for negative durations
- Validate with subtitle validator tool
Security & Privacy Notes
Data Sensitivity
Dashcam audio may contain:
- Personal conversations
- Addresses and locations
- Phone numbers and names
- Private information
Processing Guidelines
- Local Processing Only - Never upload audio to external services
- Secure Storage - Encrypt transcripts if sharing devices
- Redaction - Use
--redactflag for PII patterns (phone, email) - Retention - Delete audio extracts after transcription if not needed
Investigation Use
- Designed for legitimate personal data analysis
- NOT an anti-forensics tool
- All conclusions require independent corroboration
Skill Invocation
This skill is invoked when the model detects:
- User mentions "transcribe audio", "dashcam transcription", or "extract speech"
- User requests processing of video/MP4 files for audio content
- User provides paths to video folders
- User asks for subtitles or timestamped transcripts
Success Criteria
A successful audio transcription must:
✅ Obtain all required inputs from user (video folder, output preferences) ✅ Validate all inputs before processing (files exist, FFmpeg available, GPU detected) ✅ Present configuration summary and get confirmation ✅ Extract audio successfully (or log failures) ✅ Transcribe with GPU acceleration (or CPU fallback) ✅ Generate all requested formats (TXT, JSON, SRT, VTT) ✅ Create INDEX.csv with searchable metadata ✅ Include GPU metrics in results JSON ✅ Report output locations to user
Key Principle: Never guess critical inputs. Always validate, confirm, and provide clear feedback.
Version History
- v2.0 (2025-10-26) - Production-ready skill with GPU-first architecture
- v1.5 (2025-10-25) - Added diarization support and retry matrix
- v1.0 (2025-10-20) - Initial release with basic transcription
Last Updated: 2025-10-26 Status: Production Ready Maintained By: Audio Transcription Pipeline Project