name: oma-hwp
description: >
Convert HWP / HWPX / HWPML files to Markdown using kordoc. Extracts text, headings, tables,
lists, images, footnotes, and hyperlinks. Use for Korean word processor files (Hangul),
government documents, and AI-ready data preparation.
HWP Skill - HWP / HWPX / HWPML to Markdown Conversion
Scheduling
Goal
Convert Korean HWP-family documents into readable Markdown or structured JSON while preserving document structure for LLM context, RAG, government-document review, or enterprise document processing.
Intent signature
- User asks to convert, parse, read, extract, or transform
.hwp, .hwpx, or .hwpml.
- User mentions Korean word processor files, Hangul documents, government forms, or "한글 파일".
- User needs headings, tables, nested tables, lists, images, footnotes, or hyperlinks extracted from HWP-family files.
When to use
- Converting Korean HWP documents (
.hwp, .hwpx, .hwpml) to Markdown
- Preparing Korean government/enterprise documents for LLM context or RAG
- Extracting structured content (tables, headings, lists, images) from HWP
- User says "convert this HWP", "parse hwpx", "HWP to markdown", "한글 파일"
When NOT to use
- PDF files -> use
oma-pdf (OCR + Tagged PDF specialization)
- XLSX / DOCX files -> currently out of scope (may be covered by a future
oma-docs)
- Generating or editing HWP documents -> out of scope
- Already-text files -> use Read tool directly
Expected inputs
input_path: .hwp, .hwpx, or .hwpml file path
output_path or output_dir: optional explicit output target
format: optional output format, default markdown
page_range: optional page or section range
kordoc_version: optional pinned kordoc version
Expected outputs
- Markdown output next to the input file or in the requested directory
- Optional JSON output when requested
- Post-processed Markdown with flattened GFM tables and stripped Private Use Area glyphs by default
- A short report with output path, detected source format, and conversion issues
Dependencies
bun and bunx
bunx kordoc@latest or configured pinned kordoc version
resources/flatten-tables.ts for Markdown cleanup
- Local filesystem access to input and output paths
Control-flow features
- Branches by file extension, output target, format, page range, encryption/DRM state, and post-processing requirements
- Calls external CLI tools through
bunx and bun run
- Reads local HWP-family files and writes local Markdown or JSON output
- Routes non-HWP inputs to other skills instead of stretching this skill's scope
Structural Flow
Entry
- Confirm the input path exists.
- Confirm the extension is
.hwp, .hwpx, or .hwpml.
- Resolve output path or directory and default filename.
- Check that
bun is available.
Scenes
- PREPARE: Validate path, extension, size, output target, and requested format.
- ACQUIRE: Detect source format and runtime availability.
- ACT: Run
kordoc with explicit output target and requested options.
- VERIFY: Post-process Markdown and inspect structure for headings, tables, lists, images, and footnotes.
- FINALIZE: Report output path, source format, and any conversion limitations.
Transitions
- If the input is
.pdf, stop and route to oma-pdf.
- If the input is
.xlsx or .docx, explain that this skill does not advertise those formats.
- If
bun is unavailable, stop and ask the user to install Bun.
- If Markdown is produced, run
resources/flatten-tables.ts unless the caller explicitly needs HTML tables or PUA glyphs preserved.
- If output is empty or garbled, consult
resources/troubleshooting.md.
Failure and recovery
| Failure | Recovery |
|---|
bun or bunx unavailable | Ask user to install Bun |
| Unsupported or mismatched format | Check extension and magic bytes, then route or stop |
| Encrypted or DRM-locked document | Report limitation and request an accessible copy when needed |
| Empty Markdown output | Treat as possible scanned-image content and recommend OCR outside this skill |
| Complex merged tables | Accept flattened Markdown or HTML fallback as best effort |
| Stale kordoc cache | Use bunx kordoc@latest or configured pinned version |
Exit
- Success: output file exists and structure is readable after post-processing.
- Partial success: output exists with explicitly reported table, glyph, encryption, or fidelity limitations.
- Failure: no reliable output is produced and the blocking cause is reported.
Logical Operations
Actions
| Action | SSL primitive | Evidence |
|---|
| Validate file path and extension | VALIDATE | Input preflight in execution protocol |
| Check runtime availability | VALIDATE | bun --version |
| Select output target and format | SELECT | Output behavior and config |
| Run converter | CALL_TOOL | bunx kordoc@latest |
| Write output artifact | WRITE | Markdown or JSON output |
| Flatten tables and strip PUA glyphs | CALL_TOOL | resources/flatten-tables.ts |
| Inspect extraction quality | VALIDATE | Verification step |
| Report result | NOTIFY | Final user-facing summary |
Tools and instruments
kordoc: primary HWP-family conversion CLI
flatten-tables.ts: post-processing for GFM tables and Hancom PUA cleanup
bun / bunx: runtime and CLI executor
Canonical command path
bunx kordoc@latest "{input_path}" -o "{output_path}"
bun run ".agents/skills/oma-hwp/resources/flatten-tables.ts" "{output_path}"
For batch conversion, use an explicit output directory:
bunx kordoc@latest "{input_pattern}" -d "{output_dir}"
Resource scope
| Scope | Resource target |
|---|
LOCAL_FS | Input HWP-family files and generated outputs |
PROCESS | bunx kordoc and bun run subprocesses |
MEMORY | Format decisions, validation notes, and final report |
Preconditions
- Input file exists and is readable.
- Output location is writable or can be created.
bun is installed.
kordoc can parse the document or fail with a reportable error.
Effects and side effects
- Creates Markdown or JSON output files.
- May flatten merged-cell tables, trading cell fidelity for Markdown compatibility.
- Strips Private Use Area characters by default because they render as blanks without Hancom fonts.
- Does not intentionally modify the source HWP-family document.
Guardrails
- Always pass
@latest or an explicit pinned version to avoid stale bunx cache.
- Always pass an explicit output target when the user expects a file.
- Do not add custom security defenses around kordoc's ZIP, XML, SSRF, or XSS defenses.
- Report missing tables, garbled text, empty output, encrypted segments, and best-effort DRM extraction.
- Keep full CLI details in
resources/execution-protocol.md and troubleshooting branches in resources/troubleshooting.md.
Supported Formats
| Format | Extension | Notes |
|---|
| HWP 5.x binary | .hwp | Full support (incl. DRM-locked via kordoc's rhwp-algorithm port) |
| HWPX | .hwpx | Full support incl. nested tables, merged cells |
| HWPML | .hwp (XML variant) | Auto-detected by signature |
kordoc also parses PDF / XLSX / DOCX. Those are intentionally outside this skill's scope — see "When NOT to use".
References
- Execution protocol:
resources/execution-protocol.md
- Troubleshooting:
resources/troubleshooting.md
- Configuration:
config/hwp-config.yaml
- Upstream: https://github.com/chrisryugj/kordoc
- Related:
../oma-pdf/SKILL.md (use for .pdf inputs)