name: ingest-pdf-to-normalized description: Ingest KUKA PDFs (manuals, application notes, training material, error code refs) from kuka_dataset/raw_sources/ into normalized knowledge entries under kuka_dataset/normalized/. Use when new PDFs are added or when re-ingesting after schema updates.
Ingest PDF to Normalized
Turn a directory of raw KUKA PDFs into typed, schema-validated knowledge entries the agent cell can reliably cite.
When to Use
- New PDFs added to
kuka_dataset/raw_sources/. kuka_dataset/INGESTION_SCHEMA.mdchanged (re-ingest to conform).kuka_knowledgeMCP search quality is poor (gap indicates missing normalization).
Prerequisites
- Python 3.11+ with
pdfplumberorpypdfinstalled (for text extraction). - Optional:
ocrmypdfif any PDF is image-only. kuka_dataset/INGESTION_SCHEMA.mdread in context.cowork/schemas/dataset_entry.schema.jsonread in context.cowork/templates/INGESTION_ENTRY_TEMPLATE.mdread in context.
Steps
1. Inventory
List every PDF in kuka_dataset/raw_sources/ with its current file name and size. For each PDF, determine:
- Document kind —
vendor_manual,application_note,training,error_code_ref,white_paper,third_party_integrator,community. - Platform — KR C4, KR C5, KR C2 (legacy), iiQKA, Sunrise.
- KSS version(s) — from title page or metadata.
- Primary topic(s) — motion, safety, fieldbus, KRL programming, KAREL-equivalent, etc.
Output this inventory to kuka_dataset/_ingestion_log.md as a table.
2. Extract Text
For each PDF, extract text per page. Preserve page numbers — they are needed for citation. If a PDF is image-only, OCR it first:
ocrmypdf input.pdf output.pdf
Keep extracted text in a scratch directory (kuka_dataset/raw_sources/_scratch/, gitignored).
3. Chunk
Per kuka_dataset/INGESTION_SCHEMA.md:
- One concept per file. Chunks should be coherent units (e.g., "PTP Motion Instruction," not the entire motion chapter).
- Max ~400 lines per normalized file. Split longer chunks.
- Preserve section hierarchy in markdown headings.
4. Categorize
For each chunk, decide the target subdirectory:
articles/— conceptual explanations, how-to. PrefixONE_<topic>_<slug>.md.reference/— syntax, instruction reference, parameter tables. PrefixKUKA_REF_<topic>.md.examples/— code examples with context. PrefixEG_<scenario>.md.protocols/— fieldbus, EKI, RSI, mxAutomation. PrefixKUKA_<protocol>_<aspect>.md.safety/— safety content only. PrefixKUKA_SAFETY_<topic>.md.
5. Emit Normalized Entries
For each chunk, write a file under the chosen subdirectory with:
YAML frontmatter (required; see INGESTION_SCHEMA.md for full spec):
---
id: KUKA_REF_PTP_Motion
title: KRL PTP Motion Instruction
topic: motion
kuka_platform: [KR C4, KR C5]
controller: [KSS 8.3, KSS 8.6]
language: KRL
source:
type: vendor_manual
title: "KUKA System Software 8.x Operating and Programming Instructions"
tier: T1
pages: [412, 418]
access_date: 2026-04-21
license: reference-only
revision_date: 2026-04-21
related: [KUKA_REF_LIN_Motion, ONE_motion_termination]
difficulty: intermediate
tags: [motion, ptp, asynchronous]
---
Body — summary first, syntax/details next, examples last. Cite by page range. Do NOT reproduce more than a short quote verbatim — summarize in your own words.
6. Validate
For each file, validate the frontmatter block against cowork/schemas/dataset_entry.schema.json. If a field is missing or typed wrong, fix and re-validate.
7. Update Manifests
- Append an entry to
kuka_dataset/_manifest.jsonwith the file'sid, path, frontmatter summary. - Add a row to
kuka_dataset/DATASET_INDEX.mdunder the appropriate topic section.
8. Reindex
Trigger kuka_knowledge.reindex() via the MCP tool so new entries are searchable.
9. QA Gate
Hand every new entry to the QA agent for validation:
- Frontmatter schema-valid?
- Citations present and correct?
- No verbatim copyright violation?
- Topic / category correct?
QA issues a REVIEW_ingestion_<date>.md. Fix any findings before declaring the ingestion done.
10. Log
Update kuka_dataset/_ingestion_log.md with which PDFs produced which normalized entries, date, and agent.
Notes
- Raw PDFs stay in
raw_sources/and remain git-lfs tracked. Normalized entries are the citable product. - When a PDF is updated (new KSS version), produce a new normalized entry with incremented
revision_date; keep the old for provenance if still relevant. - The Architect and Motion agents will cite normalized entries via
kuka_knowledge.search; a good normalization schema means they find the right entry on the first search.