Himotoki AI Agent Context Guide
This document provides comprehensive context for AI agents working with the Himotoki codebase. It serves as a detailed technical reference for understanding the architecture, conventions, and implementation details.
Table of Contents
- Project Overview
- Architecture Overview
- Core Concepts
- Module Deep Dive
- Data Flow
- Database Schema
- Scoring System
- Conjugation System
- Suffix and Compound Word Handling
- Synergies and Segfilters
- Counter Word Recognition
- Testing Strategy
- Development Commands
- Code Conventions
- Key Constants and SEQ Numbers
- Common Patterns
- Troubleshooting Guide
Issue Tracking
This project uses bd (beads) for issue tracking.
Run bd prime for workflow context, or install hooks (bd hooks install) for auto-injection.
Quick reference:
bd ready- Find unblocked workbd create "Title" --type task --priority 2- Create issuebd close <id>- Complete workbd sync- Sync with git (run at session end)
For full workflow details: bd prime
For GitHub Copilot users: Add the same content to .github/copilot-instructions.md
How it works: • bd prime provides dynamic workflow context (~80 lines) • bd hooks install auto-injects bd prime at session start • AGENTS.md only needs this minimal pointer, not full instructions
Project Overview
Himotoki (紐解き, "unraveling") is a Python remake of ichiran, a comprehensive Japanese morphological analyzer written in Common Lisp. It segments Japanese text into words, provides dictionary definitions, romanization, and conjugation analysis.
Key Characteristics
| Aspect | Details |
|---|---|
| Language | Python 3.10+ |
| Database | SQLite (portable, ~3GB) |
| Dictionary | JMdict (EDRDG) |
| Algorithm | Viterbi-style dynamic programming |
| Package Manager | pip/uv with pyproject.toml |
| Code Style | Black (100 char line length), isort |
| Type Checking | mypy (optional) |
| Testing | pytest with hypothesis for property-based testing |
Project Structure
himotoki/
├── himotoki/ # Main package
│ ├── __init__.py # Public API: analyze(), analyze_async(), warm_up()
│ ├── __main__.py # Entry point for `python -m himotoki`
│ ├── cli.py # Command-line interface
│ ├── segment.py # Core segmentation algorithm (Viterbi DP)
│ ├── lookup.py # Dictionary lookup and scoring engine
│ ├── output.py # WordInfo dataclass and output formatting
│ ├── models.py # Pydantic models for API responses
│ ├── characters.py # Character utilities (kana/kanji detection, romanization)
│ ├── constants.py # Consolidated constants (conjugation types, SEQ numbers)
│ ├── synergies.py # Synergy bonuses and segfilter constraints
│ ├── suffixes.py # Suffix compound word handling (〜たい, 〜ている, etc.)
│ ├── counters.py # Counter word recognition (三匹, 五冊)
│ ├── splits.py # Word split definitions for compound scoring
│ ├── setup.py # First-time database setup and JMdict download
│ ├── db/ # Database layer
│ │ ├── __init__.py
│ │ ├── connection.py # SQLAlchemy engine, session management, caching
│ │ └── models.py # ORM models (Entry, KanjiText, KanaText, etc.)
│ └── loading/ # Data loading utilities
│ ├── __init__.py
│ ├── jmdict.py # JMdict XML parser and loader
│ ├── conjugations.py # Conjugation rule generation
│ └── errata.py # Manual dictionary corrections
├── tests/ # Test suite
│ ├── conftest.py # pytest fixtures (db_session)
│ ├── test_*.py # Unit and property-based tests
│ └── data/ # Test data files
├── scripts/ # Developer utilities
│ ├── compare.py # Compare output with ichiran
│ ├── init_db.py # Database initialization helper
│ └── report.py # HTML report generator
├── data/ # Dictionary data (CSV files for conjugations)
├── docs/ # Documentation
└── pyproject.toml # Project configuration
Architecture Overview
Himotoki follows a layered architecture with clear separation of concerns:
┌─────────────────────────────────────────────────────────────────┐
│ PUBLIC API (__init__.py) │
│ analyze(), analyze_async(), warm_up(), shutdown() │
│ Models: WordResult, AnalysisResult, VocabularyResult │
└────────────────────────────────────────────────────────────────┬┘
│
┌────────────────────────────────────────────────────────────────▼┐
│ CLI LAYER (cli.py) │
│ Command-line interface with multiple output formats │
│ Subcommands: analyze (default), setup, init-db │
└────────────────────────────────────────────────────────────────┬┘
│
┌────────────────────────────────────────────────────────────────▼┐
│ OUTPUT LAYER (output.py) │
│ - WordInfo dataclass: canonical word representation │
│ - dict_segment(): main entry point for segmentation │
│ - fill_segment_path(): converts segments to WordInfo │
│ - JSON/text formatting functions │
└────────────────────────────────────────────────────────────────┬┘
│
┌────────────────────────────────────────────────────────────────▼┐
│ SEGMENTATION ENGINE (segment.py) │
│ - find_sticky_positions(): detect forbidden word boundaries │
│ - join_substring_words(): find all candidate words │
│ - find_best_path(): Viterbi-style dynamic programming │
│ - TopArray: priority queue for tracking top N paths │
└────────────────────────────────────────────────────────────────┬┘
│
┌────────────────────────────────────────────────────────────────▼┐
│ LOOKUP & SCORING (lookup.py) │
│ - find_word_full(): database lookup with conjugation support │
│ - calc_score(): complex scoring algorithm │
│ - Segment, SegmentList: word match containers │
│ - WordMatch, CompoundWord, ConjData: data structures │
└────────────────────────────────────────────────────────────────┬┘
│
┌─────────────────────────────────────────────────────────────────┤
│ GRAMMAR SUBSYSTEMS │
├─────────────────┬─────────────────┬─────────────────────────────┤
│ synergies.py │ suffixes.py │ counters.py │
│ - Synergy bonus │ - Suffix cache │ - Number parsing │
│ - Segfilters │ - ~たい, ~ている │ - Counter cache │
│ - Penalties │ - Abbreviations │ - Phonetic rules │
└─────────────────┴─────────────────┴─────────────────────────────┤
│
┌────────────────────────────────────────────────────────────────▼┐
│ CHARACTER UTILITIES (characters.py) │
│ - is_kana(), is_kanji(), has_kanji() │
│ - as_hiragana(), as_katakana() │
│ - mora_length(), romanize_word() │
│ - get_char_class(): character classification │
└────────────────────────────────────────────────────────────────┬┘
│
┌────────────────────────────────────────────────────────────────▼┐
│ DATABASE LAYER (db/) │
│ - connection.py: SQLAlchemy engine, StaticPool, caching │
│ - models.py: ORM models matching ichiran schema │
│ - SQLite with WAL mode, memory-mapped I/O │
└────────────────────────────────────────────────────────────────┬┘
│
┌────────────────────────────────────────────────────────────────▼┐
│ SQLITE DATABASE │
│ himotoki.db (~3GB) stored in ~/.himotoki/ │
│ Tables: entry, kanji_text, kana_text, sense, gloss, │
│ sense_prop, conjugation, conj_prop, conj_source_reading│
└─────────────────────────────────────────────────────────────────┘
Core Concepts
1. Segmentation Problem
Given Japanese text (e.g., "学校で勉強しています"), find the optimal way to split it into words:
- 学校 (がっこう) - school
- で - at/in (particle)
- 勉強 (べんきょう) - study
- して - doing (conjugated する)
- います - is (ている form of いる)
2. The Algorithm: Viterbi-Style Dynamic Programming
The segmentation uses a classic pathfinding approach:
- Generate Candidates: For each position, find all dictionary words that could start there
- Score Candidates: Assign scores based on commonness, length, character type, context
- Find Best Path: Use DP to find the highest-scoring non-overlapping word sequence
- Apply Grammar Rules: Synergies boost valid patterns; segfilters block invalid ones
3. Key Data Structures
# WordMatch: Raw database hit
@dataclass
class WordMatch:
reading: Union[KanjiText, KanaText] # Database record
conjugations: Optional[List[int]] # Conjugation chain
# Segment: Scored word candidate
@dataclass
class Segment:
word: Union[WordMatch, CompoundWord, CounterText]
score: float
info: Dict[str, Any] # Scoring metadata
# SegmentList: All segments at a position
@dataclass
class SegmentList:
segments: List[Segment]
start: int
end: int
# WordInfo: Final output representation
@dataclass
class WordInfo:
type: WordType # kanji, kana, gap
text: str
kana: Union[str, List[str]]
seq: Optional[Union[int, List[int]]]
conjugations: Optional[Union[List[int], str]]
score: int
meanings: List[str]
pos: Optional[str]
# ... conjugation info, compound info, etc.
Module Deep Dive
himotoki/__init__.py - Public API
The main entry points for using Himotoki:
# Primary analysis function
def analyze(
text: str,
limit: int = 1,
session: Optional[Session] = None,
max_length: Optional[int] = None,
) -> List[Tuple[List[WordInfo], int]]:
"""Analyze Japanese text and return segmentation results."""
# Async version for FastAPI/asyncio
async def analyze_async(
text: str,
limit: int = 1,
timeout: Optional[float] = None,
) -> List[Tuple[List[WordInfo], int]]:
"""Async version using thread pool (SQLite isn't truly async)."""
# Cache initialization
def warm_up(verbose: bool = False) -> Tuple[float, dict]:
"""Pre-initialize caches: archaic words, suffixes, counters."""
# Cleanup
def shutdown():
"""Cleanup thread pool and database connections."""
himotoki/segment.py - Core Algorithm
Key Functions:
def segment_text(session, text, limit=5) -> List[Tuple[List[Segment], float]]:
"""Main entry: segment text into words."""
def find_sticky_positions(text) -> List[int]:
"""Find positions where words can't start/end (after sokuon, before modifiers)."""
def join_substring_words(session, text) -> List[SegmentList]:
"""Find all possible word matches, score them, return as SegmentLists."""
def find_best_path(segment_lists, text_length, limit=5) -> List[Tuple[List, float]]:
"""Viterbi DP to find optimal paths through segment lists."""
TopArray Class: Priority queue keeping top N paths by score:
class TopArray:
def __init__(self, limit: int = 5): ...
def register(self, score: float, payload: Any): ...
def get_items(self) -> List[TopArrayItem]: ...
himotoki/lookup.py - Dictionary & Scoring
Key Functions:
def find_word(session, word: str) -> List[WordMatch]:
"""Basic dictionary lookup by text."""
def find_word_full(session, word: str) -> List[WordMatch]:
"""Full lookup including conjugation tracing."""
def calc_score(session, word: WordMatch, final=False, kanji_break=None) -> Tuple[float, dict]:
"""The scoring algorithm - see Scoring System section."""
def get_conj_data(session, seq, from_seq=None) -> List[ConjData]:
"""Get conjugation chain data for a word."""
Important Constants:
MAX_WORD_LENGTH = 50 # Maximum substring length to search
SCORE_CUTOFF = 5 # Minimum score to keep a candidate
GAP_PENALTY = -500 # Penalty per character of uncovered text
IDENTICAL_WORD_SCORE_CUTOFF = 0.5 # Cull threshold
# Length coefficient sequences
LENGTH_COEFF_SEQUENCES = {
'strong': [0, 1, 8, 24, 40, 60], # Kanji, katakana
'weak': [0, 1, 4, 9, 16, 25, 36], # Hiragana
'tail': [0, 4, 9, 16, 24], # Suffix context
'ltail': [0, 4, 12, 18, 24], # Long suffix
}
himotoki/output.py - Output Formatting
Key Functions:
def dict_segment(session, text, limit=5) -> List[Tuple[List[WordInfo], int]]:
"""Segment and convert to WordInfo list."""
def fill_segment_path(session, text, path) -> List[WordInfo]:
"""Convert segment path to WordInfo, filling gaps."""
def word_info_gloss_json(session, wi) -> Dict:
"""Convert WordInfo to JSON-compatible dict."""
def segment_to_json(session, text, limit=5) -> List:
"""ichiran-compatible JSON output."""
himotoki/constants.py - Centralized Constants
All shared constants are defined here to avoid duplication:
- Conjugation type IDs:
CONJ_NON_PAST,CONJ_PAST,CONJ_TE, etc. - SEQ numbers:
SEQ_WA,SEQ_SURU,SEQ_IRU, etc. - Interned POS tags: Memory-efficient string interning
- Weak/skip conjugation forms: Forms that reduce or skip scoring
himotoki/characters.py - Character Utilities
Character Classification:
def get_char_class(char: str) -> Optional[str]:
"""Get kana class name: 'ka', 'shi', 'n', 'sokuon', etc."""
def is_kana(word: str) -> bool: ...
def is_hiragana(word: str) -> bool: ...
def is_katakana(word: str) -> bool: ...
def has_kanji(word: str) -> bool: ...
Conversion:
def as_hiragana(text: str) -> str: ...
def as_katakana(text: str) -> str: ...
def mora_length(text: str) -> int:
"""Count mora (doesn't count small kana/long vowels)."""
Romanization:
def romanize_word(text: str) -> str:
"""Convert kana to romaji."""
himotoki/synergies.py - Grammar Patterns
Synergies give bonuses to valid grammatical patterns:
| Pattern | Example | Bonus |
|---|---|---|
| Noun + Particle | 学校 + で | 10 + 4×len |
| Na-adjective + な/に | 静か + な | 15 |
| No-adjective + の | ... | 15 |
| To-adverb + と | ゆっくり + と | 10-50 |
Segfilters block invalid combinations:
- Auxiliary verbs must follow continuative form
- ん/んだ can't follow simple particles
- いる can't follow 終わる (つ + いる conflict)
himotoki/suffixes.py - Suffix Compounds
Handles suffix patterns like:
- 〜たい: want to (verb continuative + たい)
- 〜ている: ongoing action (verb て-form + いる)
- 〜そう: looks like (stem + そう)
- 〜ない: negative (stem + ない)
The suffix cache maps suffix strings to handler functions:
SUFFIX_HANDLERS = {
'tai': _handler_tai,
'teiru': _handler_teiru,
'sou': _handler_sou,
'nai': _handler_abbr_nai,
# ...
}
himotoki/counters.py - Counter Words
Recognizes patterns like 三匹 (sanbiki), 五冊 (gosatsu):
@dataclass
class CounterText:
text: str # "三匹"
kana: str # "さんびき"
number_value: int # 3
counter_text: str # "匹"
# ...
def find_counter_in_text(session, text) -> List[Tuple[int, int, CounterText]]:
"""Find all counter expressions in text."""
Handles phonetic rules (rendaku, gemination):
- 三 + 匹 → さんびき (b-voiced)
- 六 + 匹 → ろっぴき (geminated + p-voiced)
Data Flow
Analysis Pipeline
Input Text: "学校で勉強しています"
│
▼
┌──────────────────────────────────────────────────────────────┐
│ 1. find_sticky_positions() │
│ Result: [positions where boundaries are forbidden] │
└──────────────────────────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────┐
│ 2. find_substring_words() │
│ For each valid start position: │
│ - Extract substrings up to MAX_WORD_LENGTH │
│ - Query database for matches (kanji_text, kana_text) │
│ - Check suffix cache for compound patterns │
│ - Check counter cache for number expressions │
│ Result: Dict[substring -> List[WordMatch]] │
└──────────────────────────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────┐
│ 3. join_substring_words() │
│ For each position with matches: │
│ - Convert WordMatch to Segment with calc_score() │
│ - Apply SCORE_CUTOFF filter │
│ - Group into SegmentList by (start, end) │
│ Result: List[SegmentList] sorted by position │
└──────────────────────────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────┐
│ 4. find_best_path() │
│ Dynamic programming over SegmentLists: │
│ - Initialize TopArray for each position │
│ - For each segment: register with accumulated score │
│ - Apply gap_penalty() for uncovered regions │
│ - Apply synergies/penalties between adjacent segments │
│ - Track top N paths │
│ Result: List[(path, score)] sorted by score │
└──────────────────────────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────┐
│ 5. fill_segment_path() │
│ For each path: │
│ - Convert Segments to WordInfo objects │
│ - Fill gaps with GAP WordInfo │
│ - Populate meanings, POS from database │
│ Result: List[WordInfo] │
└──────────────────────────────────────────────────────────────┘
│
▼
Output: [(words=[WordInfo, ...], score=1234), ...]
Database Schema
Core Tables
-- Main entry table (one per JMdict entry)
entry (
seq INTEGER PRIMARY KEY, -- JMdict sequence number
content TEXT, -- Original XML
root_p BOOLEAN, -- True if root entry (not synthetic)
n_kanji INTEGER, -- Count of kanji readings
n_kana INTEGER, -- Count of kana readings
primary_nokanji BOOLEAN -- True if primary is kana-only
)
-- Kanji readings
kanji_text (
id INTEGER PRIMARY KEY,
seq INTEGER REFERENCES entry(seq),
text TEXT, -- Kanji text (e.g., "学校")
ord INTEGER, -- Order (0 = primary)
common INTEGER, -- Commonness (lower = more common)
common_tags TEXT, -- Priority tags "[news1][ichi1]"
conjugate_p BOOLEAN, -- Generate conjugations?
nokanji BOOLEAN,
best_kana TEXT
)
-- Kana readings
kana_text (
id INTEGER PRIMARY KEY,
seq INTEGER REFERENCES entry(seq),
text TEXT, -- Kana text (e.g., "がっこう")
ord INTEGER,
common INTEGER,
common_tags TEXT,
conjugate_p BOOLEAN,
nokanji BOOLEAN,
best_kanji TEXT
)
-- Senses (meaning groups)
sense (
id INTEGER PRIMARY KEY,
seq INTEGER REFERENCES entry(seq),
ord INTEGER
)
-- English glosses
gloss (
id INTEGER PRIMARY KEY,
sense_id INTEGER REFERENCES sense(id),
text TEXT, -- English meaning
ord INTEGER
)
-- Sense properties (POS, usage notes)
sense_prop (
id INTEGER PRIMARY KEY,
sense_id INTEGER REFERENCES sense(id),
seq INTEGER,
tag TEXT, -- "pos", "misc", "dial", "field"
text TEXT, -- Property value
ord INTEGER
)
-- Conjugation links
conjugation (
id INTEGER PRIMARY KEY,
seq INTEGER REFERENCES entry(seq), -- Conjugated form
from_seq INTEGER REFERENCES entry(seq), -- Root form
via INTEGER -- Intermediate (secondary conj)
)
-- Conjugation properties
conj_prop (
id INTEGER PRIMARY KEY,
conj_id INTEGER REFERENCES conjugation(id),
conj_type INTEGER, -- Type ID (see constants.py)
pos TEXT, -- Part of speech
neg BOOLEAN, -- Negative form?
fml BOOLEAN -- Formal/polite?
)
-- Source text mappings
conj_source_reading (
id INTEGER PRIMARY KEY,
conj_id INTEGER REFERENCES conjugation(id),
text TEXT, -- Conjugated text
source_text TEXT -- Original/root text
)
Key Indices
-- Fast lookups by text
ix_kanji_text_text, ix_kana_text_text
-- Composite for find_word
ix_kanji_text_text_seq, ix_kana_text_text_seq
-- Conjugation chain traversal
ix_conjugation_seq, ix_conjugation_from
Scoring System
The calc_score() function in lookup.py implements a complex scoring algorithm. Here's how it works:
Score Components
-
Base Score (5-30 points)
- kanji_p (has kanji): +5
- common_p (has commonness): +2 to +20 based on rank
- primary_p (is primary reading): +2 to +10
- particle_p (is particle): +2
- pronoun_p: special handling
-
Length Multiplier
- Uses coefficient sequences based on character type
strongfor kanji/katakana: [0, 1, 8, 24, 40, 60]weakfor hiragana: [0, 1, 4, 9, 16, 25, 36]- Final score = base × coefficient[mora_length]
-
Score Modifiers
- Archaic words: negative modifier
- Weak conjugations: reduced score
- Skip conjugations: excluded entirely
- Kanji break penalty: splitting kanji hurts score
-
Context Modifiers
final: bonus for sentence-final particleskanji_break: penalty for splitting kanji sequences
Scoring Flags (KPCL)
The info dict contains a kpcl tuple: [kanji_p, primary_p, common_p, long_p]
- kanji_p: Contains kanji characters
- primary_p: Is the primary reading for entry
- common_p: Has commonness priority tags
- long_p: Length exceeds threshold
Conjugation System
Conjugation Types
| ID | Constant | Name |
|---|---|---|
| 1 | CONJ_NON_PAST | Non-past |
| 2 | CONJ_PAST | Past (~ta) |
| 3 | CONJ_TE | Conjunctive (~te) |
| 4 | CONJ_PROVISIONAL | Provisional (~eba) |
| 5 | CONJ_POTENTIAL | Potential |
| 6 | CONJ_PASSIVE | Passive |
| 7 | CONJ_CAUSATIVE | Causative |
| 8 | CONJ_CAUSATIVE_PASSIVE | Causative-Passive |
| 9 | CONJ_VOLITIONAL | Volitional |
| 10 | CONJ_IMPERATIVE | Imperative |
| 11 | CONJ_CONDITIONAL | Conditional (~tara) |
| 12 | CONJ_ALTERNATIVE | Alternative (~tari) |
| 13 | CONJ_CONTINUATIVE | Continuative (~i) |
| 50 | CONJ_ADVERBIAL | Adverbial (custom) |
| 51 | CONJ_ADJECTIVE_STEM | Adjective Stem |
| 52 | CONJ_NEGATIVE_STEM | Negative Stem |
| 53 | CONJ_CAUSATIVE_SU | Causative (~su) |
| 54 | CONJ_ADJECTIVE_LITERARY | Old/Literary |
Conjugation Data Structure
@dataclass
class ConjData:
seq: int # Conjugated entry seq
from_seq: int # Root entry seq
via: Optional[int] # Intermediate for secondary conjugations
prop: ConjProp # Type, neg, fml info
src_map: List[Tuple[str, str]] # (conjugated_text, source_text)
Secondary Conjugations
Some conjugations are "secondary" - they go through an intermediate form:
食べさせられる (causative-passive)
└─ via: 食べさせる (causative)
└─ from: 食べる (root)
Suffix and Compound Word Handling
Suffix Cache Structure
The suffix cache (suffixes.py) maps suffix text to handlers:
_suffix_cache = {
'たい': [('tai', KanaText)], # want to
'ている': [('teiru', KanaText)], # ongoing
'ていた': [('teiru', KanaText)], # was doing
'ねえ': [('nai', None)], # abbreviation of ない
# ...
}
Handler System
Each suffix key has a handler function:
def _handler_tai(session, root, suffix, kf):
"""Handle たい suffix - want to."""
return find_word_with_conj_type(session, root, CONJ_CONTINUATIVE)
Compound Words
@dataclass
class CompoundWord:
primary: WordMatch # Main word
words: List[WordMatch] # All parts
text: str # Full text
kana: str # Combined reading
score_mod: int # Score adjustment
Synergies and Segfilters
Synergy System
Synergies give score bonuses to valid grammatical patterns:
def def_generic_synergy(
name: str,
filter_left: Callable, # Filter for left word
filter_right: Callable, # Filter for right word
description: str,
score: Union[int, Callable],
connector: str = " ",
):
"""Define a synergy between adjacent segments."""
Key Synergies:
| Name | Left Filter | Right Filter | Score |
|---|---|---|---|
| noun-particle | is_noun | in NOUN_PARTICLES | 10 + 4×len |
| noun-da | is_noun | seq=2089020 (だ) | 10 |
| na-adj | adj-na POS | な/に | 15 |
| no-adj | adj-no POS | の | 15 |
| to-adv | adv-to POS | と | 10-50 |
Segfilter System
Segfilters enforce hard constraints:
def def_segfilter_must_follow(
name: str,
filter_left: Callable, # What must precede
filter_right: Callable, # What requires the precedence
allow_first: bool = False, # Allow at sentence start?
):
"""Define constraint: filter_right must follow filter_left."""
Key Segfilters:
- Auxiliary verbs must follow continuative form
- ん/んだ can't follow simple particles
- だ + する blocked (だし false match prevention)
Counter Word Recognition
Counter Cache
_counter_cache = {
'匹': [{'counter_text': '匹', 'counter_kana': 'ひき', ...}],
'冊': [{'counter_text': '冊', 'counter_kana': 'さつ', ...}],
# ...
}
Phonetic Rules
Counter words undergo sound changes:
def counter_join(digit, number_kana, counter_kana, digit_opts=None):
"""Apply phonetic rules when joining number + counter."""
- Rendaku (sequential voicing): ひき → びき after certain numbers
- Gemination: さん → さっ before certain sounds
- Handakuten: ひき → ぴき after 1, 6, 8, 10
Special Counters
| Seq | Counter | Special Handling |
|---|---|---|
| 2083110 | 日 (ka) | Days 1-10, 14, 20, 24, 30 use kun readings |
| 2083100 | 日 (nichi) | Other day numbers |
| 2149890 | 人 (nin) | 1人=ひとり, 2人=ふたり |
| 1255430 | 月 (gatsu) | Months use がつ not つき |
Testing Strategy
Test Files
| File | Purpose |
|---|---|
test_characters.py | Character utilities |
test_lookup.py | Dictionary lookup |
test_segment.py | Segmentation algorithm |
test_output.py | Output formatting |
test_cli.py | CLI interface |
test_ichiran_comparison.py | Comparison with ichiran |
test_*_properties.py | Property-based tests (hypothesis) |
Fixtures
@pytest.fixture(scope="module")
def db_session():
"""Module-scoped database session."""
@pytest.fixture(scope="function")
def fresh_session():
"""Function-scoped session for tests that modify state."""
Running Tests
# Run all tests
pytest
# With coverage
pytest --cov=himotoki --cov-report=term-missing
# Run specific test file
pytest tests/test_segment.py
# Run with verbose output
pytest -v
# Run property-based tests with more examples
pytest --hypothesis-show-statistics
Development Commands
Installation
# Install from source with dev dependencies
pip install -e ".[dev]"
# Or using uv
uv pip install -e ".[dev]"
Database Setup
# Interactive setup (downloads JMdict, builds DB)
himotoki setup
# Non-interactive
himotoki setup --yes
# Force rebuild
himotoki setup --force
CLI Usage
# Default output (dictionary info)
himotoki "学校で勉強しています"
# Romanization only
himotoki -r "学校で勉強しています"
# Full output (romanization + dictionary)
himotoki -f "学校で勉強しています"
# Kana with spaces
himotoki -k "学校で勉強しています"
# JSON output
himotoki -j "学校で勉強しています"
Development Tasks
# Run tests
pytest
# Type checking
mypy himotoki
# Linting
ruff check .
# Formatting
black .
isort .
# Compare with ichiran
python scripts/compare.py "test sentence"
# Generate HTML report
python scripts/report.py
Code Conventions
Style
- Black formatter with 100 character line length
- isort for import sorting (black profile)
- Type hints throughout (mypy compatible)
Naming
- snake_case for functions and variables
- PascalCase for classes
- UPPER_CASE for constants
- Prefix private functions with underscore
Documentation
- Docstrings for all public functions (Google style)
- Module-level docstrings explaining purpose
- Comments for complex logic (especially ported from ichiran)
Import Order
# Standard library
from typing import Optional, List, Dict
# Third-party
from sqlalchemy import select, and_
from sqlalchemy.orm import Session
# Local
from himotoki.db.models import Entry, KanjiText
from himotoki.constants import SEQ_SURU, CONJ_TE
Common Patterns
Session Management:
# For single operations
def my_function(session: Optional[Session] = None):
created_session = session is None
if created_session:
session = get_session()
try:
# ... work
finally:
if created_session:
session.close()
# Using context manager
with session_scope() as session:
# ... work (auto-commit/rollback)
Caching:
# Module-level cache with lazy init
_MY_CACHE: Optional[Dict] = None
def ensure_cache(session):
global _MY_CACHE
if _MY_CACHE is None:
_MY_CACHE = build_cache(session)
return _MY_CACHE
Key Constants and SEQ Numbers
Particles
| Constant | SEQ | Text |
|---|---|---|
| SEQ_WA | 2028920 | は |
| SEQ_GA | 2028930 | が |
| SEQ_NI | 2028990 | に |
| SEQ_DE | 2028980 | で |
| SEQ_WO | 2029010 | を |
| SEQ_NO | 1469800 | の |
| SEQ_TO | 1008490 | と |
| SEQ_MO | 2028940 | も |
| SEQ_KA | 2028970 | か |
Common Verbs
| Constant | SEQ | Text |
|---|---|---|
| SEQ_SURU | 1157170 | する |
| SEQ_IRU | 1577980 | いる |
| SEQ_KURU | 1547720 | 来る |
| SEQ_ARU | 1296400 | ある |
| SEQ_NARU | 1375610 | なる |
Skip/Block Lists
# Words that aren't really standalone words
SKIP_WORDS = {2458040, 2822120, 2013800, ...}
# Final particles (only valid at sentence end)
FINAL_PRT = {2017770, 2425930, 2130430, ...}
# Blocked from specific suffix handlers
BLOCKED_NAI_SEQS = {SEQ_IRU, SEQ_KURU}
BLOCKED_NAI_X_SEQS = {SEQ_SURU, SEQ_TOMU}
Common Patterns
Adding a New Suffix Handler
- Add to suffix cache initialization in
suffixes.py:
_load_conjs(session, 'myhandler', MY_SEQ)
- Create handler function:
def _handler_myhandler(session, root, suffix, kf):
"""Handle my suffix."""
return find_word_with_conj_type(session, root, CONJ_TYPE)
- Register in SUFFIX_HANDLERS:
SUFFIX_HANDLERS['myhandler'] = _handler_myhandler
Adding a New Synergy
# In synergies.py, during module initialization
def_generic_synergy(
name="my-synergy",
filter_left=filter_is_pos('adj-i'),
filter_right=filter_in_seq_set(MY_SEQ),
description="i-adjective + something",
score=15,
)
Adding a New Segfilter
def_segfilter_must_follow(
name="my-segfilter",
filter_left=my_left_condition,
filter_right=my_right_condition,
allow_first=False,
)
Adding Database Entries (Errata)
In loading/errata.py:
def add_errata(session):
# Add synthetic entry
entry = Entry(seq=900001, root_p=True, n_kanji=0, n_kana=1)
session.add(entry)
# Add readings, senses, etc.
Troubleshooting Guide
Common Issues
1. Database Not Found
Error: Database not found at /path/to/himotoki.db
Solution: Run himotoki setup to download and build the database.
2. Slow First Analysis
The first analysis is slow due to cache building. Call himotoki.warm_up() at application startup.
3. Segmentation Mismatch with Ichiran Check these in order:
- Sticky positions (find_sticky_positions)
- Candidate words (find_substring_words)
- Individual scores (calc_score)
- Synergy/penalty application
- Path selection (TopArray)
4. Missing Conjugation Check if:
- Entry has
conjugate_p = True - Conjugation CSV files are loaded
- ConjData is being retrieved correctly
5. Counter Not Recognized Verify:
- Counter cache is initialized
- Counter SEQ exists in database
- Number format is valid (kanji or arabic)
Debugging Tips
# Enable SQL echo
from himotoki.db.connection import get_engine
engine = get_engine(echo=True)
# Print scoring details
score, info = calc_score(session, word, final=True)
print(f"Score: {score}, Info: {info}")
# Check suffix cache
from himotoki.suffixes import _suffix_cache
print(_suffix_cache.keys())
# Trace conjugation chain
conj_data = get_conj_data(session, seq)
for cd in conj_data:
print(f"{cd.seq} <- {cd.from_seq} via {cd.via}: {cd.prop}")
Quick Reference
Essential Imports
# Public API
import himotoki
results = himotoki.analyze("日本語")
# Internal (for development)
from himotoki.db.connection import get_session
from himotoki.segment import segment_text
from himotoki.lookup import calc_score, find_word_full
from himotoki.output import dict_segment, WordInfo
from himotoki.constants import SEQ_SURU, CONJ_TE
Key Entry Points
| Purpose | Function | Module |
|---|---|---|
| Analyze text | analyze() | __init__ |
| Segment only | segment_text() | segment |
| Get WordInfo | dict_segment() | output |
| Database lookup | find_word_full() | lookup |
| Score word | calc_score() | lookup |
| Warm caches | warm_up() | __init__ |
Files to Modify By Task
| Task | Primary Files |
|---|---|
| Change scoring | lookup.py |
| Add grammar pattern | synergies.py |
| Add suffix | suffixes.py |
| Fix segmentation | segment.py |
| Change output format | output.py, models.py |
| Add database data | loading/errata.py |
| Add CLI option | cli.py |
Last updated: Generated by AI agent analysis