name: video-clipper description: Repurposes long-form video (podcasts, interviews, talks) into short-form vertical clips for Instagram Reels, TikTok, and YouTube Shorts. Handles transcription, moment selection, clip extraction, speaker-tracked reframing (16:9 to 9:16), and animated captions. user-invocable: true allowed-tools: Bash, Read, Write, Edit, Grep, Glob, WebSearch, WebFetch argument-hint: [video-file-path-or-url]
Video Clipper
Takes a long-form video and produces ready-to-post short-form vertical clips with speaker-tracked framing and professional animated captions. Works with podcasts, interviews, talks, and any talking-head content.
Requirements
- FFmpeg installed and available in PATH (
brew install ffmpegon macOS,apt install ffmpegon Linux) - Python 3 with
openai-whisperandrequestspackages (pip install openai-whisper requests). Note:openai-whisperinstalls PyTorch (~2GB download). This skill usesopenai-whisperinstead of the lighterwhisper-cppbecause it provides word-level timestamps needed for accurate viral moment scoring. - yt-dlp installed (for YouTube/URL downloads) —
brew install yt-dlpon macOS,pip install yt-dlpon Linux - API Keys in
.envfile (project root or any parent directory):KLAP_API_KEY— from klap.app (reframing with speaker tracking)CAPTIONS_AI_API_KEY— from captions.ai / platform.mirage.app (animated captions)
Before starting: Verify that FFmpeg, yt-dlp, and the Python packages are installed. If any are missing, instruct the user to install them before proceeding.
Cost Per Clip
| Step | Cost |
|---|---|
| Whisper (transcription) | Free (local) |
| FFmpeg (clip extraction) | Free (local) |
| Klap (reframing) | ~$1.50-2.50/clip depending on plan |
| Captions.ai (captions) | ~$0.15/min of output |
| Total per clip | ~$2-3 |
Input
The user provides:
-
Video source (required) — one of:
- Local file path — e.g.
/path/to/podcast.mp4 - YouTube URL — e.g.
https://www.youtube.com/watch?v=... - Any public video URL — direct link to MP4
- Local file path — e.g.
-
Moment selection mode (ask the user):
- Automatic — Claude picks the best moments
- Manual — user provides specific timestamps
- Hybrid — Claude proposes moments, user approves/adjusts before processing
-
Number of clips (optional) — default 3-5. Depends on video length and content density.
-
Caption template (optional) — Captions.ai template ID. Default:
ctpl_DxflLOnuKkb198FNdI9E(Heat). List available templates via the API if user wants to browse. -
Target clip duration (optional) — default 15-60 seconds. User can specify a range.
Pipeline
Step 1: Get the Video
Based on input type:
Local file:
# Verify it exists and get duration
ffprobe -v quiet -print_format json -show_format "video.mp4"
YouTube URL:
yt-dlp -f "bestvideo[height<=720]+bestaudio/best[height<=720]" --merge-output-format mp4 -o "<workdir>/source.mp4" "<URL>"
Other URL:
curl -L -o "<workdir>/source.mp4" "<URL>"
Step 2: Transcribe with Whisper
import whisper
model = whisper.load_model("base")
result = model.transcribe("source.mp4", language="en", word_timestamps=True)
Save both:
transcript.json— full result with word-level timestamps (needed for Step 3)transcript.txt— readable version with timestamps per segment (for Claude to analyze)
Step 3: Identify Best Moments (Viral Scoring)
This is the key intelligence step. Claude reads the full transcript and identifies potential clip moments.
Step 3a: Segment the transcript into candidate moments
Scan the transcript for self-contained 15-60 second windows. Look for natural start/end points (topic changes, pauses, complete thoughts).
Step 3b: Score each candidate moment on this rubric
For each candidate, score 1-10 on these five criteria:
| Criteria | What to look for | Score guide |
|---|---|---|
| Hook Strength | Does the first sentence grab attention? Is it a surprising claim, provocative question, or bold statement? | 10 = "wait, what?" reaction. 1 = generic setup |
| Quotability | Contains a memorable one-liner that people would screenshot or share? | 10 = tweet-worthy standalone quote. 1 = no standalone phrases |
| Emotional Intensity | Does the speaker show passion, humor, anger, vulnerability, or conviction? | 10 = genuine emotion. 1 = monotone/flat delivery |
| Self-Containedness | Does it make complete sense without watching the rest of the video? | 10 = fully standalone. 1 = needs prior context |
| Surprise/Controversy | Does it challenge conventional wisdom, reveal something unexpected, or take a hot take? | 10 = counterintuitive insight. 1 = commonly known information |
Total score = sum of all five (max 50).
Step 3c: Rank and select top N moments
- Sort by total score descending
- Select top N (user-specified or default 3-5)
- Ensure selected moments don't overlap
- Prefer variety in topics/angles — don't pick 3 clips about the same point
Step 3d: Present to user for approval
For each selected moment, show:
- Timestamp range (start - end)
- Duration
- Transcript excerpt (first 2-3 lines)
- Score breakdown (hook/quotability/emotion/self-contained/surprise)
- Total score
- Suggested hook text for the clip
Wait for user approval. User can:
- Approve all
- Remove specific clips
- Add their own timestamps
- Adjust start/end times
- Request more options
Do NOT proceed to Step 4 until user approves.
Step 4: Extract Raw Clips
For each approved moment, extract with FFmpeg:
ffmpeg -y -ss <start> -to <end> -i source.mp4 -c copy clip<N>-raw.mp4
Step 5: Reframe with Klap
Upload each raw clip to Klap for AI-powered speaker-tracked reframing to 9:16.
API: Klap
- Endpoint:
POST https://api.klap.app/v2/tasks/video-to-video - Auth:
Authorization: Bearer <KLAP_API_KEY>
Submit each clip:
import requests
headers = {
"Authorization": f"Bearer {klap_key}",
}
# Direct file upload
with open("clip-raw.mp4", "rb") as f:
r = requests.post(
"https://api.klap.app/v2/tasks/video-to-video",
headers=headers,
files={"video": f},
data={
"language": "en",
"editing_options": '{"captions":false,"reframe":true,"emojis":false,"intro_title":false}',
"dimensions": '{"width":1080,"height":1920}'
}
)
task_id = r.json()["id"]
output_id = r.json().get("output_id")
Poll until ready:
# Poll every 30 seconds
r = requests.get(f"https://api.klap.app/v2/tasks/{task_id}", headers=headers)
status = r.json()["status"] # "processing" or "ready"
output_id = r.json()["output_id"] # project ID when ready
Export the reframed video:
# Request export
r = requests.post(
f"https://api.klap.app/v2/projects/{output_id}/exports",
headers=headers,
json={}
)
export_id = r.json()["id"]
# Poll export every 15 seconds
r = requests.get(
f"https://api.klap.app/v2/projects/{output_id}/exports/{export_id}",
headers=headers
)
# When status != "processing", download from src_url
download_url = r.json()["src_url"]
Klap handles:
- Face detection and tracking
- Active speaker detection (for multi-person videos)
- Smooth 16:9 → 9:16 reframing
- Dynamic cropping that follows the speaker
Step 6: Add Animated Captions with Captions.ai
Upload each reframed clip to Captions.ai for professional animated captions.
API: Captions.ai (Mirage)
- Endpoint:
POST https://api.mirage.app/v1/videos/captions - Auth:
x-api-key: <CAPTIONS_AI_API_KEY>
Submit each clip:
headers = {"x-api-key": captions_key}
with open("clip-reframed.mp4", "rb") as f:
r = requests.post(
"https://api.mirage.app/v1/videos/captions",
headers=headers,
files={"video": f},
data={"caption_template_id": "ctpl_DxflLOnuKkb198FNdI9E"}
)
video_id = r.json()["video_id"]
Poll until complete:
# Poll every 10 seconds
r = requests.get(f"https://api.mirage.app/v1/videos/{video_id}", headers=headers)
status = r.json()["status"] # QUEUED → PROCESSING → COMPLETE or FAILED
Download the captioned video:
r = requests.get(
f"https://api.mirage.app/v1/videos/{video_id}/content",
headers=headers,
allow_redirects=True
)
with open("clip-FINAL.mp4", "wb") as f:
f.write(r.content)
Video requirements for Captions.ai:
- Aspect ratio: 9:16 (Klap's output satisfies this)
- Max file size: 50 MB
- Max duration: 5 minutes
- Formats: MP4, MOV
Available caption templates (fetch full list via GET https://api.mirage.app/v1/videos/captions/templates):
Some popular templates:
| Template | ID |
|---|---|
| Heat (default) | ctpl_DxflLOnuKkb198FNdI9E |
| Buzz | ctpl_yvE0ZnYzEj6ClCD2ee1f |
| Medusa | ctpl_yNnJyDLSH5oIouKdjQx2 |
| Drive | ctpl_wR9PXfmxW1DFxEUuATFg |
| Magazine | ctpl_vrs1M2VrxvzQWNRypRvh |
| Energy | ctpl_oofP3mxbx8CaEPNYqnKD |
| Sirius | ctpl_miZu2nLWyP7X8oEAAHcM |
| Milky Way | ctpl_jcTmJGX77Uwz2AqLOX4S |
Step 7: Generate Platform Captions
For each final clip, Claude writes platform-specific captions:
Instagram Reel:
- Hook line (first sentence people see)
- 2-3 sentences of context
- CTA (save, share, follow)
- 20-30 relevant hashtags
- Tone: professional but conversational
TikTok:
- Short, punchy caption (1-2 lines max)
- 5-8 hashtags
- Tone: casual, direct
YouTube Short:
- Title (under 60 characters, curiosity-driven)
- Description (2-3 sentences)
- Tags
LinkedIn (if applicable):
- Longer caption (3-5 sentences with a takeaway)
- Tone: professional, insight-driven
Step 8: Output
Save everything to the output directory:
<output-dir>/
clip1-FINAL.mp4 # Ready-to-post clip
clip2-FINAL.mp4
clip3-FINAL.mp4
captions.md # All platform captions for each clip
summary.md # Overview: source video, clips made, scores, costs
Output specs:
- Format: MP4 (H.264)
- Resolution: 1080×1920 (9:16)
- Duration: 15-60 seconds per clip
- Audio: AAC
Workflow Summary
User provides video
↓
[ASK] "Do you want me to pick the best moments, or do you have specific timestamps?"
↓
Whisper transcribes locally (free)
↓
Claude scores moments on viral rubric (hook, quotability, emotion, self-contained, surprise)
↓
[ASK] "Here are the top N moments with scores. Approve, adjust, or add your own?"
↓
FFmpeg extracts raw clips (free)
↓
Klap reframes to 9:16 with speaker tracking (~$2/clip)
↓
Captions.ai adds animated captions (~$0.15/clip)
↓
Claude writes platform-specific captions
↓
Output: final clips + captions, ready to post
Known Limitations
- yt-dlp may fail on some YouTube videos due to YouTube's evolving download restrictions. Install via
brew install yt-dlpand keep updated. If download fails, user should download the video manually and provide the local file path. - Klap credit costs can add up at scale. Each clip costs ~76 credits (44 processing + 32 generation). Monitor credit balance before batch processing.
- Captions.ai requires 9:16 input — always run Klap before Captions.ai, never the other way around.
- Whisper base model is fast but may have transcription errors on technical terms, accents, or overlapping speech. Use
whisper.load_model("medium")for better accuracy at the cost of slower transcription. - Viral scoring is heuristic — Claude's scoring is based on content patterns, not engagement data. Scores indicate relative quality within a video, not absolute viral potential.
- Max 5 minutes per clip for Captions.ai, and 50MB file size limit. Klap has plan-based limits on video length (45 min to 3 hours depending on plan).
- Processing time — Klap takes 2-5 minutes per clip, Captions.ai takes 1-2 minutes. A batch of 5 clips takes roughly 15-25 minutes total.
Environment Variables
Add these to your .env file:
KLAP_API_KEY=kak_xxxxx
CAPTIONS_AI_API_KEY=sk-xxxxx
No other API keys or local dependencies required. Whisper model downloads automatically on first run.