name: doc-reader
description: Read any common document/data file — PDF, Word (.docx), Excel (.xlsx/.xls), PowerPoint (.pptx), images (OCR), CSV/TSV, plain text, JSON/YAML/TOML, HTML/XML, and most source-code files. Use the read_document tool.
category: tool
Universal Document Reader
Purpose
Return extracted text from any supported file in a single unified JSON envelope. The tool dispatches by file extension — you always call the same tool regardless of format.
Supported formats
| Category | Extensions | Notes |
|---|---|---|
.pdf | Text pages extracted in ms; scanned/image pages fall back to OCR | |
| Word | .docx | Paragraphs + table cells |
| Excel | .xlsx, .xls | All sheets, first 100 rows per sheet as preview |
| PowerPoint | .pptx | Slide text content |
| Images | .png/.jpg/.jpeg/.gif/.bmp/.webp/.tiff | OCR only |
| CSV / TSV | .csv, .tsv | Raw text with encoding fallback |
| Plain text | .txt/.md/.log/.rst | Encoding fallback |
| Config | .json/.yaml/.yml/.toml/.ini/.cfg/.env | Raw text |
| Markup | .html/.htm/.xml | Raw text (no HTML stripping) |
| Source code | .py/.js/.ts/.tsx/.go/.rs/.java/.cpp/.c/.sql/.sh/... | Raw text |
| Unknown extension | anything else | Best-effort read as UTF-8/GBK text |
Blocked (rejected at /upload): executables (.exe/.dll/.so/...) and
archives (.zip/.tar/...). Ask the user to unpack archives locally first.
Usage
Always call the tool directly — do not run Python from bash.
read_document(file_path="uploads/paper.pdf")
read_document(file_path="uploads/annual_report.pdf", pages="1-10")
read_document(file_path="uploads/contract.docx")
read_document(file_path="uploads/sales.xlsx")
read_document(file_path="uploads/deck.pptx")
read_document(file_path="uploads/chart.png") # image → OCR
read_document(file_path="uploads/config.yaml")
read_document(file_path="uploads/notes.md")
The pages parameter only applies to PDF; other formats ignore it.
Return envelope
All formats share this shape:
{
"status": "ok",
"file": "paper.pdf",
"format": "pdf",
"char_count": 52000,
"truncated": true,
"text": "..."
}
Format-specific extra fields:
| Format | Extra keys |
|---|---|
pdf | total_pages, pages_read, ocr_pages |
docx | paragraphs, tables |
excel | sheets (array of {name, rows, cols}) |
pptx | slides |
text | encoding, size |
Content longer than 15000 chars is truncated; for PDFs use the pages
parameter to read slices.
Workflows
Paper / report summary
1. read_document(file_path="paper.pdf") → full text
2. Extract abstract, methodology, conclusion → summarize
Contract review
1. read_document(file_path="contract.docx") → paragraphs + tables
2. Flag key clauses (termination, liability, payment, IP)
Spreadsheet quick-look
1. read_document(file_path="sales.xlsx") → all sheet previews
2. If user wants trade journal analysis specifically, pivot to
`analyze_trade_journal` tool instead (see trade-journal skill).
Chart / screenshot / scanned PDF
1. read_document(file_path="scan.png") → OCR text
2. If OCR returns empty, tell the user; don't fabricate.
Notes
- Encoding fallback order for text: utf-8 → utf-8-sig → gbk → gb2312 → big5 → latin-1.
- OCR uses RapidOCR; if the package is missing, image/scanned files
return empty
textwith anotefield — tell the user to installrapidocr-onnxruntime. - Excel previews are limited to 100 rows per sheet to stay in budget.
If the user needs full data (e.g. trade journals), call
analyze_trade_journalinstead. - Source-code files are returned raw; do not re-format or re-indent.