name: agentic-vision description: | Gemini 3 Flash Agentic Vision - The Sandwich Architecture for pixel-perfect UI generation. Phase 1: SURVEYOR measures layout BEFORE generation (grids, spacing, colors). Phase 2: QA TESTER verifies AFTER render (SSIM, diff regions, auto-fix). "Measure twice, cut once" - generator gets hard data, not guesses.
Use when: video-to-code, image-to-code, UI verification, layout measurement, pixel-perfect generation, SSIM comparison, auto-fix suggestions. user-invocable: true
Agentic Vision - The Sandwich Architecture
Version: 1.0.0 Last Updated: 2026-01-30
What is Agentic Vision?
Agentic Vision in Gemini 3 Flash converts image understanding from a static act into an agentic process. It combines visual reasoning with Code Execution.
Think → Act → Observe loop:
1. THINK: Analyze image, formulate plan
2. ACT: Generate and execute Python code (crop, measure, annotate)
3. OBSERVE: Process results, refine understanding
Key capability: Instead of "guessing" padding is p-4, it MEASURES and returns 24px.
The Sandwich Architecture
REPLAY "SANDWICH" ARCHITECTURE
┌───────────────────────────────────────────────────────────────────┐
│ │
│ ┌──────────┐ │
│ │ Video │──────────────────────────────┐ │
│ │ Input │ │ │
│ └────┬─────┘ │ │
│ │ ▼ │
│ │ ┌─────────────────────────┐ │
│ │ │ PHASE 1: THE SURVEYOR │ │
│ │ │ (Agentic Vision Flash) │ │
│ │ ├─────────────────────────┤ │
│ │ │ 1. Measure Grids (px) │ │
│ │ │ 2. Extract Colors (hex) │ │
│ │ │ 3. Map Layout (JSON) │ ◄─── KEY
│ │ └────────────┬────────────┘ │
│ │ │ │
│ ▼ ▼ │
│ ┌──────────────┐ ┌─────────────────────────┐ │
│ │ Gemini 3 Pro │◄────────────│ Architecture Specs │ │
│ │ (Code Gen) │ │ (Hard Data JSON) │ │
│ └──────┬───────┘ └─────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────┐ ┌──────────────────────────────────┐ │
│ │ Render View │───▶│ PHASE 2: THE QA TESTER │ │
│ └──────────────┘ │ (Agentic Vision Flash) │ │
│ ├──────────────────────────────────┤ │
│ │ 1. Compare Original vs Render │ │
│ │ 2. "Spot the difference" (SSIM) │ │
│ │ 3. Auto-fix suggestions │ │
│ └─────────────────┬────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────┐ │
│ │ FINAL PIXEL-PERFECT │ │
│ │ COMPONENT │ │
│ └──────────────────┘ │
│ │
└───────────────────────────────────────────────────────────────────┘
Phase 1: THE SURVEYOR
Measures layout BEFORE code generation.
API Endpoint
POST /api/survey/measure
{
imageBase64: string, // Base64 encoded frame
mimeType?: string, // default: 'image/png'
useParallel?: boolean, // default: true (faster)
includePromptFormat?: boolean // Include formatted prompt for generator
}
Response
{
success: true,
measurements: {
imageDimensions: { width: 1920, height: 1080 },
grid: { columns: 12, gap: "24px" },
spacing: {
sidebarWidth: "256px",
navHeight: "64px",
cardPadding: "24px",
sectionGap: "48px",
containerPadding: "32px"
},
colors: {
background: "#0f172a",
surface: "#1e293b",
primary: "#6366f1",
text: "#ffffff",
textMuted: "#94a3b8",
border: "#334155"
},
typography: {
h1: "48px",
h2: "32px",
body: "16px",
small: "14px"
},
components: [
{ type: "sidebar", bbox: {...}, confidence: 0.95 }
],
confidence: 0.91
},
promptFormat: "... formatted for code generator ..."
}
Code Usage
import { runParallelSurveyor, formatSurveyorDataForPrompt } from '@/lib/agentic-vision';
// 1. Run Surveyor on video frame
const { measurements } = await runParallelSurveyor(frameBase64, 'image/png');
// 2. Inject into code generator prompt
const prompt = `
${SYSTEM_PROMPT}
${formatSurveyorDataForPrompt(measurements)}
Generate code based on the video above.
`;
// 3. Generator uses EXACT values: p-[24px] not p-4
Phase 2: THE QA TESTER
Verifies generated UI AFTER render.
API Endpoint
POST /api/verify/diff
{
originalImageBase64: string, // Original frame from video
generatedImageBase64: string, // Screenshot of generated code
mimeType?: string, // default: 'image/png'
quickCheck?: boolean, // Only SSIM, skip full analysis
includeReport?: boolean // Include formatted text report
}
Response
{
success: true,
verification: {
ssimScore: 0.94,
overallAccuracy: "94%",
verdict: "needs_fixes", // "pass" | "needs_fixes" | "major_issues"
issues: [
{
type: "spacing",
severity: "medium",
location: "card padding",
description: "Card padding is 16px, should be 24px",
expected: "24px",
actual: "16px"
}
],
autoFixSuggestions: [
{
selector: ".card",
property: "padding",
suggestedValue: "24px",
confidence: 0.85
}
]
},
report: "✅ QA VERIFICATION REPORT..."
}
Verdict Rules
| Verdict | Condition |
|---|---|
pass | SSIM >= 0.95 AND no high severity issues |
needs_fixes | SSIM >= 0.85 AND <= 3 high severity issues |
major_issues | SSIM < 0.85 OR > 3 high severity issues |
Enabling Code Execution
Agentic Vision requires codeExecution tool in Gemini API:
import { GoogleGenAI } from '@google/genai';
const ai = new GoogleGenAI({ apiKey: process.env.GEMINI_API_KEY });
const response = await ai.models.generateContent({
model: 'gemini-3-flash',
contents: [
{ text: prompt },
{ inlineData: { data: imageBase64, mimeType: 'image/png' } }
],
config: {
tools: [{ codeExecution: {} }] // <-- CRITICAL
}
});
// Response contains:
// - executableCode: { code: "Python code..." }
// - codeExecutionResult: { outcome: "OUTCOME_OK", output: "JSON result" }
Available Python Libraries in Sandbox
# Data Science
import numpy as np
import pandas as pd
from scipy import ndimage
from sklearn import preprocessing
# Image Processing
from PIL import Image
from skimage import filters, measure, transform
from skimage.metrics import structural_similarity as ssim
# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
# Utilities
import io
import json
Technical Considerations
1. Coordinate Normalization
Gemini may rescale images internally. Always request BOTH:
- Normalized coordinates (0.0-1.0)
- Image dimensions for backend rescaling
def normalize_bbox(x, y, w, h, img_width, img_height):
return {
"x": x / img_width,
"y": y / img_height,
"width": w / img_width,
"height": h / img_height
}
2. Parallel Execution for Speed
Run color sampling and spacing measurement in parallel:
const [colors, spacing] = await Promise.all([
surveyColors(frame), // Fast
surveySpacing(frame) // Heavier CV
]);
// Time reduced by ~50%
3. SSIM with scikit-image
Use industry-standard SSIM calculation:
from skimage.metrics import structural_similarity as ssim
score, diff_image = ssim(img1, img2, full=True)
# score: 0.0 (different) to 1.0 (identical)
# diff_image: per-pixel difference map
Integration with Replay Pipeline
Before (Without Surveyor)
Video → Gemini Pro "guesses" → p-4 or p-6? → 3-5 iterations
After (With Sandwich Architecture)
Video → Surveyor MEASURES → padding: 24px → Generator EXECUTES → 1-2 iterations
Result: First generation is 80% better!
File Structure
lib/agentic-vision/
├── index.ts # Main exports
├── types.ts # TypeScript interfaces
├── prompts.ts # Surveyor & QA prompts
├── surveyor.ts # Phase 1 implementation
└── qa-tester.ts # Phase 2 implementation
app/api/
├── survey/measure/route.ts # Surveyor endpoint
└── verify/diff/route.ts # QA Tester endpoint
Quick Start
// Full pipeline with Agentic Vision
// 1. PHASE 1: Measure before generation
const surveyResult = await fetch('/api/survey/measure', {
method: 'POST',
body: JSON.stringify({
imageBase64: videoFrame,
includePromptFormat: true
})
});
const { measurements, promptFormat } = await surveyResult.json();
// 2. Generate code with HARD DATA
const codeResult = await generateWithConstraints(video, promptFormat);
// 3. Render and screenshot
const screenshot = await renderAndCapture(codeResult.code);
// 4. PHASE 2: Verify
const qaResult = await fetch('/api/verify/diff', {
method: 'POST',
body: JSON.stringify({
originalImageBase64: videoFrame,
generatedImageBase64: screenshot
})
});
const { verification } = await qaResult.json();
// 5. Check result
if (verification.verdict === 'pass') {
console.log('✅ Pixel-perfect!');
} else {
console.log('⚠️ Apply fixes:', verification.autoFixSuggestions);
}