name: document-conversion description: Convert DOC/DOCX/PDF/PPT/PPTX documents to Markdown format. Automatically detect PDF type (electronic/scanned), extract images to separate directory. Use this Skill when administrator onboards non-Markdown documents. Trigger condition: Onboard DOC/DOCX/PDF/PPT/PPTX format files.
Document Format Conversion
Convert various document formats to Markdown for knowledge base onboarding.
Supported Formats
| Format | Processing Method |
|---|---|
| DOCX | Pandoc conversion, preserve formatting and images |
| DOC | LibreOffice → DOCX → Pandoc |
| PDF Electronic | PyMuPDF4LLM fast conversion |
| PDF Scanned | PaddleOCR-VL online OCR |
| PPTX | pptx2md professional conversion |
| PPT | LibreOffice → PPTX → pptx2md |
Usage
python .claude/skills/document-conversion/scripts/smart_convert.py \
<temp_path> \
--original-name "<original_filename>" \
--json-output
Parameters:
<temp_path>: Temporary file path (e.g./tmp/kb_upload_xxx.pptx)--original-name: Must pass original filename, used to generate correct image directory name--json-output: Output JSON format result
Output Format
{
"success": true,
"markdown_file": "/path/to/output.md",
"images_dir": "original_filename_images",
"image_count": 5,
"input_file": "/path/to/input.pptx"
}
Processing Flow
- Execute conversion command (must use
--original-nameand--json-output) - Parse JSON output, check
successfield - If
success: false, report error and end - If
success: true, record generated file path and image directory
Important Notes
- Image directory uses original filename naming (e.g.
培训资料_images/) - Not passing
--original-namewill cause incorrect image reference paths - PDF type is automatically detected, scanned version processing is slower (tens of seconds to minutes)
Format Details
Detailed processing instructions for each format, see FORMATS.md