name: "rag-engineer" description: "RAG workflow skill. Use this skill when a user needs retrieval pipelines, chunking, ranking, citations, and evaluation for an AI application." version: "0.0.1" category: "ai-agents" tags:

"rag"
"retrieval"
"embeddings"
"evaluation"
"citations"
"indexing"
"omni-enhanced" complexity: "advanced" risk: "safe" tools:
"claude-code"
"cursor"
"gemini-cli"
"codex-cli"
"opencode" source: "omni-team" author: "Omni Skills Team" date_added: "2026-03-27" date_updated: "2026-04-19" source_type: "omni-curated" maintainer: "Omni Skills Team" family_id: "rag-engineer" family_name: "RAG Engineer" variant_id: "omni" variant_label: "Omni Curated" is_default_variant: true derived_from: "skills/rag-engineer" upstream_skill: "skills/rag-engineer" upstream_author: "sickn33" upstream_source: "community" upstream_pr: "79" upstream_head_repo: "diegosouzapw/awesome-omni-skills" upstream_head_sha: "6bf093920a93e68fa8263cf6ee767d7407989d56" curation_surface: "skills_omni" enhanced_origin: "omni-skills-private" source_repo: "diegosouzapw/awesome-omni-skills" replaces:
"rag-engineer"

RAG Engineer

Overview

Use this skill when the user needs a Retrieval-Augmented Generation workflow that is measurable, debuggable, and grounded in evidence.

This skill is for designing or improving:

corpus preparation and ingestion
chunking and metadata strategy
embedding and indexing choices
semantic, keyword, or hybrid retrieval
reranking and context assembly
citation and provenance behavior
retrieval evaluation and troubleshooting

The operating principle is simple: fix retrieval before tuning generation. If the right evidence is not found, ranked, filtered, and assembled correctly, prompt changes will mostly mask the problem.

Use the companion files when needed:

references/domain-notes.md for chunking decisions, hybrid retrieval rules, metrics, and failure lookup
examples/worked-example.md for a concrete end-to-end RAG tuning example

When to Use This Skill

Use this skill when:

the user is building or repairing a knowledge-grounded assistant, search-backed chat system, or internal question-answering workflow
the user needs help with embeddings, vector search, chunking, indexing, hybrid retrieval, reranking, or citations
the user needs a retrieval eval plan instead of prompt-only iteration
the corpus contains documents where provenance, freshness, filtering, or permissions matter
the system must explain which retrieved evidence supports an answer

Do not use this skill by itself when:

the task is mainly model fine-tuning without a retrieval component
the task is generic web search UX rather than document-grounded retrieval engineering
document permissions, tenant boundaries, or provenance cannot be enforced
the user only wants a one-off prompt and there is no retrieval pipeline to design or debug

Operating Table

Situation	Start here	Why it matters	Minimum acceptable outcome
New RAG system	Define corpus slices, users, and query classes	Prevents building retrieval with no target behavior	Named corpus scope and at least 3 realistic query classes
Existing system gives weak answers	Check retrieval metrics before prompts	Bad retrieval often looks like bad generation	A small eval set with expected supporting passages
Chunking design	Use `references/domain-notes.md` chunking matrix	Chunking should follow document structure and query behavior	Chunks preserve boundaries and carry useful metadata
Identifier-heavy corpus	Test hybrid retrieval, not semantic-only	Error codes, version strings, SKUs, and policy numbers are easy to miss semantically	Keyword or metadata path validated on identifier queries
Security-sensitive corpus	Design ACL and tenant filtering first	Retrieval can leak data even if generation is safe	Authorization filters applied before or during retrieval
Production tuning	Set stage budgets for retrieve, rerank, assemble, answer	Latency and cost failures often come from over-retrieving	Budget recorded per stage with at least one trimming plan
Debugging failures	Use troubleshooting section plus `references/domain-notes.md`	Fast diagnosis depends on mapping symptoms to pipeline stages	A suspected failure mode tied to evidence from logs or evals
Team handoff	Record corpus version, metadata schema, eval set, and known limits	Makes retrieval behavior reproducible	Another operator can rerun the same checks

Workflow

1. Define the retrieval job before selecting tools

Document:

what corpus or corpora are in scope
who is allowed to retrieve which data
what query classes matter most
what a good retrieval result looks like
what downstream answer behavior is required

At minimum, identify query classes such as:

factual lookup
semantic paraphrase
identifier lookup
policy or compliance lookup
recent or freshness-sensitive lookup
multi-hop or comparison queries

Do not start with embedding model or vector database debates. Start with expected retrieval behavior.

2. Build a small retrieval eval set first

Before tuning chunk size, prompts, or ranking:

collect representative queries
record the expected supporting document or passage for each
separate retrieval success from answer quality
keep the set small but realistic so it can be rerun often

Useful eval artifacts:

query text
query class
expected document IDs or passage IDs
any required metadata filters
notes on ambiguity or acceptable alternatives

If the team cannot agree on expected evidence for a query, the requirement is probably underspecified.

3. Define the ingestion contract

Specify how documents become retrievable records:

normalization rules
deduplication rules
document identifiers
section extraction rules
freshness fields
ACL or tenant metadata
source URL, file path, title, timestamp, version, and provenance fields

Good ingestion contracts make debugging possible later. Every chunk should be traceable back to a source document and section.

4. Choose chunking based on content structure and queries

Chunk by document-aware boundaries where possible, not by arbitrary length alone.

Preserve metadata that supports retrieval and citations:

document ID
section heading or path
timestamp or effective date
source type
ACL or tenant tags
version or freshness markers

Use references/domain-notes.md for a content-type chunking matrix and common failure patterns.

Avoid assuming one universal chunk size, overlap, or top-k value. Treat these as testable starting points, not truths.

5. Choose retrieval strategy for the corpus

Select retrieval behavior that matches the corpus:

semantic retrieval for concept-heavy natural language content
keyword or lexical retrieval for identifiers, exact phrases, version strings, and error codes
metadata filtering for access control, tenant isolation, time ranges, product families, or content type
hybrid retrieval when both semantic similarity and exact matching matter
reranking when first-pass retrieval has adequate recall but poor ordering

A practical default is to test:

semantic-only
keyword-only or lexical fallback for exact terms
hybrid retrieval
hybrid plus reranking if latency and cost allow

6. Instrument the pipeline

Make the system observable enough to answer:

which chunks were retrieved
with what scores or rank positions
under which filters
from which corpus version
which chunks were passed to the model
which citations appeared in the answer

Log safely. Do not leak restricted content in debug traces. If needed, log chunk IDs and metadata instead of full text.

7. Evaluate retrieval separately from answer generation

Run retrieval checks before changing prompts.

For each query, ask:

Was the relevant document retrieved at all?
Was it retrieved high enough to survive truncation or reranking?
Did filters wrongly exclude it?
Did duplicate or near-duplicate chunks crowd out diversity?
Did context assembly omit the best evidence?

Then evaluate answer behavior separately:

correct use of retrieved evidence
citation correctness
abstention when evidence is weak
handling of ambiguity or missing context

Do not blur retrieval failure with answer synthesis failure.

8. Tune in the right order

Preferred tuning order:

corpus scope and data quality
ingestion and deduplication
metadata schema and filters
chunking strategy
retrieval method
reranking
context assembly
answer prompt and response policy

This order prevents prompt work from hiding broken retrieval.

9. Budget latency and cost by stage

Track major stages such as:

ingest and embedding generation
first-pass retrieval
reranking
context assembly
final answer generation

If the system is slow or expensive, trim in this order first:

remove unnecessary retrieved candidates
improve filtering before increasing top-k
reduce duplicated or low-value context
rerank fewer but better candidates
shorten context payloads before weakening grounding requirements

10. Define production acceptance criteria

A RAG system is ready for wider use only when it has:

a versioned eval set
retrieval metrics on representative query classes
a documented metadata schema
source traceability and citation behavior
ACL or tenant controls where needed
a known refresh or reindex policy
a troubleshooting path for common failures

Troubleshooting

Symptom: The answer sounds fluent but cites weak or irrelevant evidence

Check:

whether the expected passage appears in retrieved results at all
whether low-quality chunks outrank better ones
whether context packing includes too many marginal chunks
whether prompt instructions are causing overconfident synthesis

Likely fixes:

improve retrieval recall first
tighten chunk boundaries
add reranking
reduce noisy context
require the system to narrow claims when evidence is weak

Symptom: Exact identifiers are missed

Common causes:

semantic-only retrieval on identifier-heavy data
normalization that strips meaningful tokens
missing lexical path
poor metadata filtering

Likely fixes:

add keyword or hybrid retrieval
preserve exact identifiers in chunks and metadata
test identifier queries as their own eval class

Symptom: Relevant documents are found, but wrong sections are used

Common causes:

chunks too large or too mixed
sections not preserved during ingestion
reranker or context assembly preferring broad summaries

Likely fixes:

chunk on section boundaries
retain headings and local path metadata
rerank for passage relevance, not only document relevance

Symptom: Retrieval returns many near-duplicates

Common causes:

duplicated source documents
overlapping chunks dominating top results
repeated boilerplate text

Likely fixes:

deduplicate during ingestion
collapse near-duplicate neighbors in ranking
downweight boilerplate-heavy chunks

Symptom: Good retrieval offline, poor answers online

Common causes:

online filters differ from eval conditions
context truncation removes the best evidence
answer stage ignores or misuses evidence
stale index or stale metadata in production

Likely fixes:

compare offline and online traces
verify final packed context
check citation-to-source mapping
confirm refresh and reindex behavior

Symptom: Cross-tenant or restricted content leakage risk

Common causes:

filters applied after retrieval instead of before or during it
missing ACL metadata at chunk level
unsafe logs that expose retrieved text

Likely fixes:

enforce authorization in retrieval
carry ACL metadata into every chunk
sanitize traces and debug output

Symptom: The system is slow or too expensive

Common causes:

over-retrieval
expensive reranking depth
oversized context assembly
unnecessary second-pass calls

Likely fixes:

reduce candidate count with better filtering
rerank only where recall already looks acceptable
pass fewer, better chunks to the model
set explicit per-stage budgets

For a more detailed symptom-to-fix matrix, use references/domain-notes.md.

Examples

Open examples/worked-example.md for a concrete mini-corpus showing:

corpus preparation
metadata fields
document-aware chunking
retrieval eval queries
expected retrieval behavior
failure analysis
before/after tuning decisions

Additional Resources

OpenAI Embeddings guide: https://platform.openai.com/docs/guides/embeddings
OpenAI Evals guide: https://platform.openai.com/docs/guides/evals
OpenAI Latency optimization guide: https://platform.openai.com/docs/guides/latency-optimization
OpenAI Building agents track: https://developers.openai.com/tracks/building-agents/
OpenAI Responses guide: https://platform.openai.com/docs/guides/responses
pgvector project documentation: https://github.com/pgvector/pgvector

Related Skills

Consider a different or additional skill when the center of gravity changes:

use a database or search-infrastructure skill when the main work is storage engine administration or production database operations
use an eval-focused skill when the main task is dataset design, scoring, and regression automation across many systems
use an application security skill when the main issue is data isolation, authorization architecture, or compliance review beyond retrieval boundaries

Execution Notes

During execution, keep outputs concrete:

name the corpus scope
list the metadata fields
describe the retrieval path
identify the eval queries used
state what changed and why
distinguish retrieval fixes from answer-generation fixes

A strong final answer from this skill should leave the operator with a retrieval plan that can be tested, traced, and improved without guesswork.

ナビゲーション

Skillsとは？

リンク

rag-engineer

RAG Engineer

Overview

When to Use This Skill

Operating Table

Workflow

1. Define the retrieval job before selecting tools

2. Build a small retrieval eval set first

3. Define the ingestion contract

4. Choose chunking based on content structure and queries

5. Choose retrieval strategy for the corpus

6. Instrument the pipeline

7. Evaluate retrieval separately from answer generation

8. Tune in the right order

9. Budget latency and cost by stage

10. Define production acceptance criteria

Troubleshooting

Symptom: The answer sounds fluent but cites weak or irrelevant evidence

Symptom: Exact identifiers are missed

Symptom: Relevant documents are found, but wrong sections are used

Symptom: Retrieval returns many near-duplicates

Symptom: Good retrieval offline, poor answers online

Symptom: Cross-tenant or restricted content leakage risk

Symptom: The system is slow or too expensive

Examples

Additional Resources

Related Skills

Execution Notes

関連スキル(🔧 開発ツール)