Repository Guidelines

Quick reference for developers working on pandoc-embedz.

Quick Reference

Common Commands

# Development setup
uv sync --all-extras              # Install dependencies

# Testing
uv run pytest tests/              # Run all tests (REQUIRED before commit)
uv run pytest tests/test_*.py     # Run specific test file
PANDOC_EMBEDZ_DEBUG=1 pandoc ...  # Enable debug output

# Build and release
uv build                          # Build package
make release-n                    # Preview release (dry run)
make release                      # Execute release

# Manual testing
pandoc report.md --filter pandoc-embedz -o output.pdf

Project Structure

pandoc_embedz/
├── filter.py           # Pandoc filter entry point
├── data_loader.py      # Data format loaders (CSV, JSON, SQLite, Excel, etc.)
├── config.py           # Configuration parsing and validation
└── main.py             # CLI entry point (standalone mode)

tests/
├── test_attributes.py  # Attribute parsing
├── test_load_data.py   # Data loading
├── test_template.py    # Template functionality
├── test_variables.py   # Variable scoping
├── test_standalone.py  # Standalone mode
└── fixtures/           # Test data files

examples/               # Runnable examples
CODE_ANALYSIS.md        # Code quality analysis

Development Workflow

Setup

With uv (Recommended):

uv sync --all-extras        # Install dependencies + dev tools
uv add <package>            # Add dependency
uv remove <package>         # Remove dependency

With pip (Alternative):

pip install -e .[dev]       # Editable install with dev dependencies
python -m build             # Build package

Testing

CRITICAL: Always run full test suite before committing:

uv run pytest tests/        # Run ALL tests (not individual files)

Running individual test files may miss failures in other suites.

Debug Mode:

PANDOC_EMBEDZ_DEBUG=1 pandoc report.md --filter pandoc-embedz -o output.pdf

Debug output includes:

Configuration parsing (attributes and YAML)
Template operations (save/load)
Variable preparation (global/local)
Data loading and SQL queries
Template rendering context

All debug messages are prefixed with [DEBUG] and written to stderr.

Code Guidelines

Style & Conventions

PEP 8 compliant: 4-space indentation, snake_case functions, CONSTANT_CASE for module constants
Type hints: Use explicit typing for function signatures
Docstrings: Document public functions and complex logic
Security: Always use validate_file_path() for file operations
Comments: Use English for all comments (international collaboration)

Testing

Tests in tests/ with Test* classes and test_* functions
Store fixtures in tests/fixtures/
Add tests for new features and bug fixes

Verify actual content, not just types:

# Good
assert isinstance(result, list)
markdown = pf.convert_text(result, input_format='panflute', output_format='markdown')
assert 'expected content' in markdown

# Insufficient
assert isinstance(result, list)  # Doesn't verify rendering

Security

Path traversal protection: Use validate_file_path() (in config.py)
SQL injection protection: Parameterized queries via pandas/sqlite3
Template security: Jinja2 templates can execute Python code - only process trusted templates

Commit & Release

Commit Guidelines

Run uv run pytest tests/ before committing
Follow current git log style: capitalized verb phrase, no terminal periods
PRs should target main with summary and test results
Before modifying files: summarize plan, wait for confirmation, then proceed
Tag commits with model info: (via Claude Code / Opus 4.5) - use actual model name
Do NOT include emoji line or Co-Authored-By line in commit messages

IMPORTANT: Before executing commits, always show the commit message to user and wait for confirmation.

Release Process (Minilla-style)

IMPORTANT: Before executing releases, always show the user:

CHANGELOG entry and version number for review
Wait for user confirmation before proceeding

Workflow:

Edit code and add CHANGELOG entry:

## [0.9.0] - 2025-11-26

### Added
- New feature description

Show CHANGELOG entry to user and confirm version number
Preview release: make release-n (dry run, shows commands)
Execute release: make release (only after user confirmation)
- Extracts version from CHANGELOG.md (single source of truth)
- Auto-updates pyproject.toml and __init__.py
- Runs all tests
- Commits with git add -u (includes all tracked changes)
- Creates annotated tag with release notes
- Pushes to GitHub
- GitHub Actions publishes to PyPI

Key Features:

CHANGELOG.md is single source of truth for version
Automatic version number updates
Dirty check prevents releases with uncommitted changes
Dry run support for safe preview

Post-Release:

pip index versions pandoc-embedz  # Verify PyPI
pip install --no-cache-dir --upgrade pandoc-embedz

Technical Details

Filter Chaining

Generate code blocks for subsequent filters (e.g., pantable):

````{.embedz format=csv}
---
---
```{.table}
{% for row in data -%}
{{ row.product }},{{ row.sales }}
{% endfor -%}
```
---
product,sales
Widget,100


**Key points:**
- Use 4 backticks for outer fence to write 3 backticks inside
- Empty YAML header `---\n---` is required
- Blocks without data return empty list `[]` (by design)

### Raw Format Output (e.g. LaTeX)

Templates can output raw format blocks using Jinja2 string expressions to
avoid interfering with the embedz code block parser:

````markdown
```embedz
{{ "```" }}{=latex}
\expandafter\gdef\csname embedz@key\endcsname{value}
{{ "```" }}
```

Key points:

Use {{ " ``` " }} to emit backticks — the embedz parser won't see them as fenced code delimiters
Pandoc parses ```{=latex} as a RawBlock, which the validator accepts alongside CodeBlock
This pattern is useful for generating LaTeX macro definitions from embedz data

Data and Template Sections

Three-section structure:

```embedz
---
YAML configuration
---
Jinja2 template
---
Inline data (optional)
```

Critical: Template usage with inline data requires THREE --- separators:

```embedz
---
template: my-template
format: json
---
(empty template section)
---
[{"data": "here"}]
```

Structure: YAML header → --- → template section → --- → data section

Template Variables and Queries

Global variables are available throughout the document:

```{.embedz}
---
global:
  year: 2024
  start_date: "{{ year }}-01-01"  # Variables can reference other globals
  query: "SELECT * FROM data WHERE date >= '{{ start_date }}'"
---
```

```{.embedz data=sales.csv}
---
query: "{{ query }}"  # Use global variable in query
---
{% for row in data %}
- {{ row.product }}: {{ row.amount }}
{% endfor %}
```

Key points:

Global variable values are expanded if they contain {{ or {%
Variables processed in definition order (later can reference earlier)
Prefix global. is optional: {{ year }} = {{ global.year }}
Works with CSV, TSV, SSV, and SQLite

Preamble for Shared Macros

Preamble defines document-wide control structures:

```{.embedz}
---
preamble: |
  {% set fiscal_year = 2024 %}
  {% macro BETWEEN(start, end) -%}
  SELECT * FROM data WHERE date BETWEEN '{{ start }}' AND '{{ end }}'
  {%- endmacro %}
global:
  yearly_query: "{{ BETWEEN(fiscal_year ~ '-01-01', fiscal_year ~ '-12-31') }}"
---
```

```{.embedz data=sales.csv}
---
query: "{{ yearly_query }}"
---
Template here
```

Alternative: Define macros in named templates (auto-shared):

```{.embedz define=sql-macros}
{%- macro BETWEEN(start, end) -%}
SELECT * FROM data WHERE date BETWEEN '{{ start }}' AND '{{ end }}'
{%- endmacro -%}
```

Implementation:

All templates share single Jinja2 environment (GLOBAL_ENV)
Preamble prepended to all templates via CONTROL_STRUCTURES_STR
Macros from named templates auto-added to control structures

Multi-Document YAML Config Files

Config files support multiple YAML documents separated by ---:

# config/settings.yaml
---
global:
  fiscal_year: 2024
---
bind:
  prev_year: fiscal_year - 1
---
preamble: |
  {% macro format_yen(n) %}{{ "{:,}".format(n) }}円{% endmacro %}
---

Key points:

Documents are merged in order; later documents override earlier ones
Sections can be written in any order; processing order is fixed (preamble → with → query → data → bind → global → alias → render)
Enables logical grouping of settings within a single file
Implementation: load_config_file() in config.py uses yaml.safe_load_all()

Parameter Aliasing

User-facing parameters:

Purpose	Recommended	Alternative	Deprecated
Template definition	`define`	-	`name` (shows warning)
Template usage (YAML)	`template`	`as`	-
Template usage (attributes)	`as`	`template`	-

Example:

```{.embedz define=my-template}
...
```

```{.embedz data=file.csv template=my-template}
```

```{.embedz data=file.csv as=my-template}
```

Implementation: pandoc_embedz/config.py

PARAMETER_PREFERRED_ALIASES: Maps external aliases to internal names
DEPRECATED_DIRECT_USE: Parameters that show warnings
normalize_config(): Converts aliases and warns about deprecated usage

Design rationale:

define vs template/as clarifies definition vs usage
template is declarative (YAML), as is concise (attributes)
Both template and as work without warnings (context-dependent choice)
name is deprecated but still works (backward compatibility)

Adding new aliases:

Update PARAMETER_PREFERRED_ALIASES in config.py
Add to DEPRECATED_DIRECT_USE if old name should warn
Add tests in tests/test_attributes.py
Update documentation

Known Issues

See CODE_ANALYSIS.md for:

Test structure issues (template usage with inline data)
Code quality improvements
Test coverage gaps
Prioritized action plan

High-priority items:

Fix test_use_saved_template to use 3-separator structure
Replace Japanese comments with English
Enhance test assertions beyond isinstance checks

Global Variable Expansion from Query Results (Implemented)

Status: ✅ Implemented in v0.9.3

Global variables can now reference loaded data. The processing order is:

Prepare preamble and with variables
Expand query (using existing global variables from previous blocks)
Load data
Expand bind expressions (type-preserving)
Expand global variables (with access to loaded data and bind results)
Render template

Example:

---
format: csv
query: SELECT * FROM data WHERE value > 80
global:
  filtered_count: "{{ data | length }}"
  total_value: "{{ data | sum(attribute='value') }}"
---
Filtered count: {{ filtered_count }}, Total: {{ total_value }}
---
name,value
Alice,100
Bob,80
Charlie,120

Important: Variables used in query: can come from:

with: in the same block (input parameters)
global: from a previous block

Variables in the same block's global: cannot be used in query: (they are expanded after data loading).

# Using with: in the same block
---
format: csv
with:
  min_value: 80
query: SELECT * FROM data WHERE value >= {{ min_value }}
global:
  filtered_total: "{{ data | sum(attribute='value') }}"
---

# Or using global: from a previous block
---
global:
  min_value: 80
---

Type-Preserving Bindings with bind:

Status: ✅ Implemented

The bind: section evaluates Jinja2 expressions while preserving their result type (dict, list, int, bool, None). Unlike global:, values are not converted to strings.

Processing order: with → query → data load → bind → global → render

Example:

---
format: csv
bind:
  first: data | first           # dict type preserved
  total: data | sum(attribute='value')  # int type preserved
  has_data: data | length > 0   # bool type preserved
---
Name: {{ first.name }}, Total: {{ total }}
---
name,value
Alice,100
Bob,200

Nested structures:

bind:
  first: data | first
  stats:
    name: first.name
    value: first.value
    doubled: first.value * 2

Dot Notation for Nested Values

Status: ✅ Implemented

Both bind: and global: support dot notation in keys to set nested values:

bind:
  record: data | first
  record.note: "'Added note'"    # Adds 'note' key to record dict
global:
  record.label: Description text  # Adds 'label' key (no quotes needed)

Key differences:

bind: values are Jinja2 expressions (string literals need quotes)
global: values are plain strings (unless containing {{ or {%)

Creates intermediate dicts:

global:
  config.settings.name: My App   # Creates config → settings → name

Error handling:

Raises ValueError if parent exists but is not a dictionary

Attribute syntax:

Dot notation also works in code block attributes:

{.embedz data=file.csv as=template with.threshold=100}
{.embedz global.author="John Doe"}

This is useful for passing parameters to templates without writing a YAML header.

Custom Filter: to_dict

Status: ✅ Implemented

Converts a list of dictionaries to a dictionary keyed by a specified field:

bind:
  by_year: data | to_dict(key='year')
  current: by_year[2024]
  previous: by_year[2023]

Example:

# Input: [{'year': 2023, 'value': 100}, {'year': 2024, 'value': 200}]
# Result: {2023: {'year': 2023, 'value': 100}, 2024: {'year': 2024, 'value': 200}}

Strict mode (default): Raises ValueError on duplicate keys:

bind:
  by_id: data | to_dict(key='id')                # error if duplicate ids
  by_id: data | to_dict(key='id', strict=False)  # allows duplicates (last wins)

Transpose mode: Adds column-keyed dictionaries for dual access patterns:

bind:
  by_year: data | to_dict(key='year', transpose=True)
  # Result: {2023: {...}, 2024: {...}, 'value': {2023: 100, 2024: 200}}
  # Access: by_year[2023].value OR by_year.value[2023]

Use case: Access data by a specific key (e.g., year, ID) instead of iterating through a list with selectattr.

Custom Filter: raise

Status: ✅ Implemented

Raises an error with a custom message. Useful for validating required parameters in templates:

{%- if label is not defined -%}
{{ "Template error: label is required" | raise }}
{%- endif -%}

Use case: Enforce required parameters and provide clear error messages when templates are misused.

Template validation patterns:

Validate data structure expectations using Jinja2 type tests:

{# Ensure header=false was used (data is list of lists, not list of dicts) #}
{%- if data and data[0] is mapping -%}
{{ "This template requires header=false (use row[0], not row.name)" | raise }}
{%- endif -%}

{# Check for string vs list #}
{%- if row is string -%}
  {{ row }}
{%- else -%}
  {{ row[0] }}: {{ row[1] }}
{%- endif -%}

Common type tests: is string, is mapping (dict), is sequence (list), is number, is defined.

Custom Filter: regex_replace

Status: ✅ Implemented

Replaces substring using regular expression. Compatible with Ansible's regex_replace filter.

{{ "Hello World" | regex_replace("World", "Universe") }}
{{ "Hello（World）" | regex_replace("[（）]", "") }}
{{ value | regex_replace("^prefix_", "", ignorecase=true) }}

Parameters:

pattern: Regular expression pattern to match
replacement: Replacement string (default: empty string)
ignorecase: Case-insensitive matching (default: False)
multiline: Multiline mode (default: False)
count: Maximum replacements, 0 = unlimited (default: 0)

Use case: Clean up data by removing unwanted characters, normalizing formats, or transforming strings with pattern matching.

Variable Reference in data=

Status: ✅ Implemented

The data= attribute can reference variables from GLOBAL_VARS (defined via bind:):

```embedz
---
format: csv
bind:
  by_year: data | to_dict(key='year')
---
---
year,value
2023,100
2024,200
```

```{.embedz data=by_year}
2024: {{ data[2024].value }}
```

Dot notation for nested access:

bind:
  by_year: data | to_dict(key='year', transpose=True)

```{.embedz data=by_year.value}
{# data is now {2023: 100, 2024: 200} #}
2024: {{ data[2024] }}


**Resolution rules:**
1. Starts with `./` or `/` → file path (explicit)
2. Try to resolve as variable (supports dot notation like `var.subkey`)
3. If resolved to dict or list → use variable
4. Otherwise → attempt file loading

**Implementation:** `_resolve_data_variable()` and `_resolve_nested_variable()` in `filter.py`

**Key points:**
- Variable reference and inline data cannot be combined (raises `ValueError`)
- Only `dict` and `list` types are resolved; strings fall back to file loading
- Use `./filename` to force file loading when variable name conflicts
- Dot notation allows accessing nested dictionary values (e.g., `data=parent.child`)
- **Query support:** `query:` can be applied to variable data, enabling data pipelines:
  ```yaml
  # Load raw data and bind to variable
  ```{.embedz data=source.csv}
  bind:
    raw_data: data

Apply different queries to the same data

query: SELECT ... GROUP BY year
bind:
  by_year: data | to_dict(key='year')

- If the variable is a dict (from `to_dict`), it's automatically converted to list before query execution
- **YAML `data:` section also supports variable references:**
```yaml
# Single variable reference
data: my_variable

# Multi-table with mixed variables and file paths
data:
  t1: my_variable      # Variable reference
  t2: data.csv         # File path
query: SELECT * FROM t1 JOIN t2 ON ...

CSV/TSV/SSV Comment Handling

Status: ✅ Implemented

Lines starting with # can be treated as comments in CSV/TSV/SSV files.

Parameter	Type	Default	Description
`comment`	string or bool	`line` (CSV/TSV/SSV), `none` (others)	`line`: skip all `#` lines, `head`: leading `#` lines only, `none`/`false`: disable, `inline`: unquoted `#` to end of line

Mode details:

head: Only strips consecutive # lines at the start of the file
line: Strips all # lines anywhere in the file (default for CSV/TSV/SSV)
inline: Uses pandas comment='#' — skips from unquoted # to end of line (quoted # is preserved)
none / false: No comment handling

Implementation: _read_source() in data_loader.py handles head and line modes as a pre-processing step before CSV parsing.

Excel File Support

Status: ✅ Implemented

Load .xlsx/.xls files directly as data sources. Requires openpyxl package (optional dependency).

Excel-specific parameters:

Parameter	Type	Description
`table`	string	Sheet name (default: first sheet)
`transpose`	bool	Swap rows and columns
`startrow`	int, string, or list	Row to start from: `3` (1-indexed row number), `"氏名"` (find cell), `"1:氏名"` (find in column 1), `[年, 月]` (all must match). `skiprows` is deprecated.

Key behaviors:

Blank rows and all-blank columns are automatically removed
Empty cells in header row get auto-generated names (column_0, column_2, etc., based on position)
Duplicate header names get a suffix: score, score_1, score_2, ...
Empty cells in data rows become empty strings
Empty sheets return [] with a warning to stderr
startrow/skiprows string pattern uses exact match (not substring); list requires all patterns to match in the same row
query parameter works via _apply_sql_query (same as CSV)
Inline data not supported (raises ValueError)
Multi-table SQL supported via file: dict syntax (see below)

Implementation: _load_excel(), _clean_column_names(), and _skip_to_matching_row() in data_loader.py

Processing order:

pd.read_excel (with integer startrow/skiprows if given)
Pattern-based row skip (if startrow/skiprows is string or list)
dropna(how='all') on rows and columns
Transpose (if transpose=True)
NaN → empty string
Header extraction + column name cleanup (if has_header=True)
SQL query (if query given)

`file:` Dict Syntax for Multi-Table Data

Status: ✅ Implemented

In multi-table data: sections, use file: dict to pass per-table parameters:

data:
  sheet1:
    file: data/report.xlsx
    table: Sheet1
  sheet2:
    file: data/report.xlsx
    table: Sheet2
    startrow: 年
  csv_table: data/other.csv
query: |
  SELECT ... FROM sheet1 JOIN sheet2 ON ...

Key points:

file: dict, plain string paths, inline data, and variable references can be mixed freely
All formats (CSV, TSV, SSV, Excel, JSON, YAML, SQLite, etc.) work in multi-table SQL mode
Parameters in file: dict (except file and format) are passed as load_data() kwargs
_normalize_data_source() returns a 3-tuple: (source, format, load_kwargs)
_query_tables() uses load_data() for all formats (no longer restricted to SEP_MAP)

Alias Feature

Status: ✅ Implemented

The alias: section adds alternative keys to all dictionaries in GLOBAL_VARS:

bind:
  item:
    label: |-
      "Item description"
    value: 100
alias:
  description: label  # 'description' becomes an alias for 'label'

Processing:

After bind and global processing completes
Recursively walks all dicts in GLOBAL_VARS
For each alias mapping (e.g., description: label):
- If source key (label) exists and alias key (description) doesn't exist
- Adds alias key with same value

Use cases:

Providing user-friendly names for data access
Supporting multiple naming conventions (e.g., Japanese and English keys)
Separating data definition (formal names) from usage (friendly names)

Real-World Usage Patterns

Standalone Mode for Data Processing

Standalone mode (-s) is particularly useful for CSV transformation pipelines:

# Data normalization pipeline
extract_tool database.db table --columns 1-11 | \
  pandoc-embedz -s normalize.emz | \
  other_tool > output.csv

Key characteristics:

Output is plain text (not processed as Markdown)
Safe for CSV, JSON, configuration files
No Pandoc API calls - pure template rendering
Can be chained with other Unix tools

CSV Output Best Practices

When generating CSV output, proper escaping is critical:

{%- macro csv_escape(value) -%}
  {%- set v = value | string -%}
  {%- if ',' in v or '"' in v or '\n' in v -%}
    "{{ v | replace('"', '""') }}"
  {%- else -%}
    {{ v }}
  {%- endif -%}
{%- endmacro -%}

Why this matters:

CSV fields containing ,, ", or newlines must be quoted
Double quotes inside fields must be escaped as ""
Without proper escaping, CSV parsers will fail

File Naming Conventions

Recommended extensions for standalone templates:

.emz - Short, memorable (3 characters)
.embedz - Descriptive alternative
.md - Only for templates that generate Markdown

Example structure:

templates/
├── normalize_data.emz       # CSV transformation
├── format_report.emz        # Data formatting
└── report_body.md           # Markdown generation

Filter Mode vs Standalone Mode

Critical distinction:

Aspect	Filter Mode	Standalone Mode
Invocation	`--filter pandoc-embedz`	`-s` flag
Output format	Markdown (to Pandoc AST)	Plain text
Pandoc API	Called	Not called
Use case	Document embedding	Data transformation

Implementation notes:

Filter mode: filter.py processes code blocks, returns Markdown
Standalone mode: main.py renders template, outputs text
Output format note in README.md (line 976-978) clarifies this

Additional Resources

README.md: User documentation and examples (includes Best Practices section)
- Basic usage and advanced features
- Best Practices: CSV escaping, file extensions, pipeline patterns
- Filter mode vs standalone mode clarification
README.JA.md: Japanese translation of README.md
MULTI_TABLE.md: Advanced multi-table SQL features
COMPARISON.md: Comparison with alternative tools
CHANGELOG.md: Version history and release notes
CODE_ANALYSIS.md: Code quality analysis and improvement recommendations

ナビゲーション

Skillsとは？

リンク

Repository Guidelines

Repository Guidelines

Quick Reference

Common Commands

Project Structure

Development Workflow

Setup

Testing

Code Guidelines

Style & Conventions

Testing

Security

Commit & Release

Commit Guidelines

Release Process (Minilla-style)

Technical Details

Filter Chaining

Data and Template Sections

Template Variables and Queries

Preamble for Shared Macros

Multi-Document YAML Config Files

Parameter Aliasing

Known Issues

Global Variable Expansion from Query Results (Implemented)

Type-Preserving Bindings with bind:

Dot Notation for Nested Values

Custom Filter: to_dict

Custom Filter: raise

Custom Filter: regex_replace

Variable Reference in data=

Apply different queries to the same data

CSV/TSV/SSV Comment Handling

Excel File Support

file: Dict Syntax for Multi-Table Data

Alias Feature

Real-World Usage Patterns

Standalone Mode for Data Processing

CSV Output Best Practices

File Naming Conventions

Filter Mode vs Standalone Mode

Additional Resources

関連スキル(🔧 開発ツール)

`file:` Dict Syntax for Multi-Table Data