name: regex-engine description: > Design guide for the Trieste regex engine (include/trieste/regex_engine.h). Describes the architecture, data structures, and conventions of the header-only NFA-based regex engine used for pattern matching in the Trieste library. Use this skill when modifying, extending, or debugging the regex engine. user-invocable: false
Regex Engine Design
The Trieste regex engine is a header-only C++20 NFA-based matcher in
include/trieste/regex_engine.h. It compiles UTF-8 regexes with a
shunting-yard + Thompson pipeline and simulates them with epoch-based state
deduplication.
The engine started as an iregexp-focused implementation, but now includes compatibility extensions used by Trieste parsers and rewrite helpers.
Architecture Overview
The engine processes a regex in four phases:
regex string ──► postfix runes ──► NFA ──► finalization ──► simulation
(shunting-yard) (Thompson) (closures+bitmaps) (parallel)
- Shunting-yard (
regexp_to_postfix_runes): Converts regex text to a postfix rune stream with explicit operators. - Thompson's construction (
postfix_to_nfa): Walks the postfix sequence and builds an NFA fromStateDefnodes using a fragment stack. - Finalization (
precompute_epsilon_closures+finalize_states): Precomputes epsilon closures (with trivial-closure detection) and populates per-state ASCII acceptance bitmaps. - Parallel simulation (
match/find_prefix): Tracks active NFA states using precomputed epsilon closures and per-callMatchContextstate.
Headers
| Header | Purpose |
|---|---|
regex_engine.h | Engine implementation (all phases, RuneClass, sentinels) |
unicode_data.h | Unicode 15.1 General Category range tables + precomputed ASCII bitmaps |
utf8.h | UTF-8 encoding/decoding, rune_t type definition |
regex_engine.h includes both unicode_data.h and utf8.h.
Public API
RegexEngine(const std::string_view& utf8_regexp);
RegexEngine(const std::string_view& utf8_regexp, SyntaxMode mode);
bool ok() const;
size_t num_captures() const;
SyntaxMode syntax_mode() const;
bool match(const std::string_view& utf8_str) const;
bool match(
const std::string_view& utf8_str,
MatchContext& ctx) const;
size_t find_prefix(
const std::string_view& utf8_str,
bool at_start = true) const;
size_t find_prefix(
const std::string_view& utf8_str,
MatchContext& ctx,
bool at_start = true) const;
size_t find_prefix(
const std::string_view& utf8_str,
std::vector<Capture>& captures,
bool at_start = true) const;
size_t find_prefix(
const std::string_view& utf8_str,
std::vector<Capture>& captures,
MatchContext& ctx,
bool at_start = true) const;
SearchResult search(
const std::string_view& utf8_str,
size_t start_pos = 0) const;
SearchResult search(
const std::string_view& utf8_str,
std::vector<Capture>& captures,
MatchContext& ctx,
size_t start_pos = 0) const;
SyntaxMode::Extended(default) enables the full regex syntax.SyntaxMode::IregexpStrictrestricts to RFC 9485 iregexp.ok()returns true when the regex compiled successfully (no errors).match()checks full-string match for the given input.find_prefix()returns the longest matching prefix length, ornpos.search()finds the first match anywhere in the string starting from a byte offset. Returns aSearchResultwithmatch_start,match_len, andfound(). The capture-aware overload fills group spans and reuses aMatchContextacross calls. Used byTRegex::GlobalReplace.- Capture-aware
find_prefix()fills group spans as byte offsets. at_startcontrols whether^can match in this invocation. This is used by wrapper-level scanning logic (for example global replacement probe loops).MatchContextis the reusable per-caller/per-thread scratch state for simulation and stats.
Rune Constants
All constants live in namespace trieste::regex as inline constexpr.
ASCII character constants
Named aliases for ASCII characters that have syntactic meaning in regexes:
| Constant | Value | Purpose |
|---|---|---|
Backslash | '\\' | Escape character |
LParen | '(' | Open group |
RParen | ')' | Close group |
Pipe | `' | '` |
Question | '?' | Zero-or-one syntax |
Asterisk | '*' | Zero-or-more syntax |
Plus | '+' | One-or-more syntax |
Operator sentinel runes
Internal operator tokens placed in the postfix stream. Their values are in the
range 0xAFFF00–0xAFFFFF, well outside Unicode, so they cannot collide with
literal runes (including escaped operators like \*):
| Sentinel | Value | Arity | Precedence |
|---|---|---|---|
Catenation | 0xAFFF00 | Binary | 2 (middle) |
Alternation | 0xAFFF01 | Binary | 1 (lowest) |
ZeroOrOne | 0xAFFF02 | Unary | 3 (highest, emitted immediately) |
ZeroOrMore | 0xAFFF03 | Unary | 3 (highest, emitted immediately) |
OneOrMore | 0xAFFF04 | Unary | 3 (highest, emitted immediately) |
Split | 0xAFFF05 | — | NFA-only (epsilon state) |
WordBoundary | 0xAFFF06 | Unary | Conditional epsilon (\b) |
LazyZeroOrOne | 0xAFFF07 | Unary | Lazy quantifier ?? |
LazyZeroOrMore | 0xAFFF08 | Unary | Lazy quantifier *? |
LazyOneOrMore | 0xAFFF09 | Unary | Lazy quantifier +? |
StartAnchor | 0xAFFF0A | Unary | Conditional epsilon (^) |
EndAnchor | 0xAFFF0B | Unary | Conditional epsilon ($) |
Match | 0xAFFFFF | — | NFA-only (accept state) |
Class-ref sentinels
A label in the range [RuneClassBase, RuneClassMax] (0xBF0000–0xBFFFFF)
is not a literal rune but an index into RegexEngine::rune_classes_. This
allows StateDef to remain unchanged — a class-ref is just a label value
that step() dispatches differently.
| Constant | Value | Purpose |
|---|---|---|
RuneClassBase | 0xBF0000 | Start of class-ref index space |
RuneClassMax | 0xBFFFFF | End of class-ref index space (64K classes) |
Helper functions: is_class_ref(label), class_ref_index(label).
Resource limits (RFC 9485 §8)
| Constant | Value | Enforced in |
|---|---|---|
MaxRepetition | 1000 | Range quantifier parsing |
MaxPostfixSize | 100000 | End of shunting-yard + range expansion |
MaxStates | 100000 | Start of postfix_to_nfa |
MaxCaptures | 64 | Capturing group indexing |
MaxGroupNesting | 256 | Nested group depth during parsing |
MaxClosureCacheEntries | 1000000 | precompute_epsilon_closures |
Utility Functions
decode_rune(utf8_str, pos)
inline std::pair<rune_t, size_t>
decode_rune(const std::string_view& utf8_str, size_t pos);
Decodes one rune from utf8_str at byte offset pos. Returns
{rune_value, bytes_consumed}. ASCII fast path: if the byte at pos is
< 128, returns immediately without calling utf8_to_rune. Used by all
simulation loops — callers must ensure pos < utf8_str.size().
Key Data Structures
RuneClass
A sorted, non-overlapping set of [lo, hi] inclusive rune_t intervals.
Used to represent dot (.), character classes ([a-z]), and Unicode
categories (\p{L}).
Fields:
ranges—std::vector<std::pair<rune_t, rune_t>>of sorted intervals.ascii_bitmap[2]—uint64_t[2]bit-per-codepoint bitmap for runes 0–127.ascii_bitmap[r >> 6] >> (r & 63) & 1tests membership in O(1). Rebuilt automatically bynormalize(),complement(), anddot().
Methods:
contains(rune_t)— ASCII bitmap check forr < 128, else O(log k) binary search onranges.add_range(lo, hi)— push a range without normalizing. Callfinalize()when done.merge(other)— append another class's ranges without normalizing. Callfinalize()when done.finalize()— sort, merge overlapping ranges, and rebuild the ASCII bitmap. Must be called after alladd_range/mergecalls are complete, before the class is used for matching orcomplement().complement()— complement over Unicode scalar values (0–0xD7FF, 0xE000–0x10FFFF), excluding surrogates. Requires finalized input.dot()— static factory returning all Unicode scalar values.
Private:
rebuild_ascii_bitmap()— scansrangesand sets bits for runes 0–127.
RuneClass objects are stored in a std::vector<RuneClass> rune_classes_
member on RegexEngine. make_class_ref(rc) appends a class and returns
the corresponding class-ref sentinel for use in the postfix stream.
StateDef
Each NFA state is a StateDef, stored contiguously in a pre-reserved
std::vector<StateDef> owned_states_. Referenced internally via raw pointer
alias State = StateDef*.
Fields:
label— the rune this state matches, a class-ref sentinel (dispatched viaRuneClass::contains()), orSplit/Matchfor control states.next— primary successor state.next_alt— secondary successor (used only bySplitstates).closure_index— dense state index for closure cache and dedup.ascii_accept[2]—uint64_t[2]per-state bitmap. Populated byfinalize_states(): class-ref states copy their RuneClass'sascii_bitmap, literal states with label < 128 set the single corresponding bit, all others remain zero. Enablesstep()to bypassis_class_ref/class_ref_index/containsfor ASCII runes.trivial_closure—bool, set byprecompute_epsilon_closures(). True when the state's epsilon closure is{self}for all 8 flag combinations (always true for literal/class-ref/Match states). Enablesadd_state()to skip the closure cache lookup.
Ownership/lifetime model:
- The constructor computes the postfix first, then reserves
owned_states_with capacity2 * postfix.size() + 1to guarantee no reallocation during Thompson construction. This keeps rawStatepointers (includingFrag::danglingtargets) stable. create_state()appends aStateDeftoowned_states_and returns a raw pointer to it.- All construction temporaries (the fragment stack, closure-computation epoch vectors) are function-local — they exist only during the constructor call and are destroyed automatically.
RegexEngineis explicitly non-copyable and movable to keep raw-pointer ownership semantics safe and obvious.
MatchContext
MatchContext is public API with private internals (friend to RegexEngine).
It holds per-call mutable state:
noncapturing_current_states,noncapturing_next_states— scratch vectors for non-capturing simulation.visited_states— per-state epoch vector for dedup (used when > 128 states).visited_bits_[2]—uint64_t[2]bitset for dedup when ≤ 128 states (BitsetMaxStates). Faster than epoch-based dedup for small NFAs.use_bitset_— chosen atbind_engine()time based onstate_count.capture_frames— contiguous arena for capture slot storage.capture_traversal_stack_—std::vector<std::pair<State, size_t>>reusable stack for iterativeadd_state_capturing.epoch_counter— monotonic counter for epoch-based dedup.- Optional
MatchStats(gated byTRIESTE_REGEX_ENGINE_ENABLE_STATS).
Reuse model:
- Reuse a context across calls in one caller/thread.
- Do not use the same context concurrently across threads.
- Each match call binds the context to the engine, allowing safe context reuse across different compiled engine instances.
MatchStats
Optional counters (gated by TRIESTE_REGEX_ENGINE_ENABLE_STATS):
rune_steps— total input positions processed.active_states_total— cumulative active states across steps.max_active_states— peak active state count.class_ref_checks— class-ref dispatch count (non-ASCII path only).literal_checks— literal label comparisons (non-ASCII path only).
Capture and Thread
Capturestores[start, end)byte offsets and amatched()helper.- Capture-aware simulation uses
Thread { state, caps_frame }wherecaps_frameindexes a contiguous capture-frame arena. - Capture slot storage is held in MatchContext's frame arena with fixed frame
width
2 * num_captures_. - Capture open/close epsilon transitions clone frame contents, update one slot, and continue traversal so sibling branches keep independent capture state.
Frag
A fragment of an NFA under construction:
start— the entry state of the fragment.dangling— astd::list<State*>of raw pointers into thenextornext_altmembers of states within the fragment.patch()sets them all to point to a given target state.
ClosureSpan
A lightweight view into the flat closure cache:
data— pointer intoclosure_cache_flat_.size— number of states in this closure.
Returned by epsilon_closure_cached().
Phase Details
Phase 1: Shunting-yard (regexp_to_postfix_runes)
Converts infix regex syntax to postfix tokens with:
- implicit concatenation (
Catenation) - alternation (
Alternation) - quantifiers (
?,*,+,{n},{n,m},{n,}) - lazy quantifiers (
??,*?,+?) - assertions (
\b,^,$) - groups (capturing and non-capturing
(?:...)) - class refs (dot, bracket classes, unicode categories, shorthands)
Escape Handling
The parser accepts these escapes:
| Pattern | Result |
|---|---|
\n, \r, \t | Literal runes 0x0A, 0x0D, 0x09 |
\p{Xx} | Unicode General Category lookup via unicode_data.h; emits class-ref |
\P{Xx} | Same but complemented |
\d, \D, \s, \S, \w, \W | Shorthand class refs |
\xNN | Byte-valued hex escape |
\b | Word-boundary assertion |
\(, \), \*, \+, \-, \., \?, \[, \\, \], \^, \$, \{, |, \}, \/, \, | Literal rune |
| Any other escape | malformed |
Trailing \ | malformed |
Inside bracket classes, unicode categories and shorthand classes are merged into the bracket class.
Range Quantifiers
Range quantifiers ({n}, {n,m}, {n,}) are handled by postfix
expansion — the atom's postfix tokens are extracted and replicated. This
keeps postfix_to_nfa unchanged. Tracking variables:
| Variable | Purpose |
|---|---|
atom_start | Index in postfix where the current atom begins |
group_start_stack | Saves (postfix_index, need_concat) on ( |
need_concat_before_atom | Whether a Cat was pushed before this atom |
atom_is_quantified | Detects nested quantifiers (→ malformed) |
Expansion formula (postfix notation, Cat is the Catenation operator):
| Pattern | Postfix expansion |
|---|---|
a{n} | a × n chained with Cat |
a{n,m} | a × n + (a ZeroOrOne Cat) × (m−n) |
a{n,} | a × n + a ZeroOrMore Cat |
a{0} | Atom removed, preceding Cat undone |
Nested quantifiers (e.g. a*{2}) are rejected as malformed. Expansion size
is checked upfront against MaxPostfixSize.
Rejected Syntax
The following patterns are rejected as malformed:
- Bare
]or}outside their context. - Empty character class
[]or[^]. - Inverted ranges
[z-a]. - Range quantifiers exceeding
MaxRepetition. - Unterminated constructs (
[abc,a{,a\). - Unknown/unsupported escapes.
Phase 2: Thompson's construction (postfix_to_nfa)
Walks the postfix runes left-to-right, maintaining a stack of Frag objects:
| Postfix token | Action |
|---|---|
| Literal rune or class-ref | Create a new state, push fragment |
Catenation | Pop two, patch left's dangling to right's start |
Alternation | Pop two, create Split pointing to both starts |
ZeroOrOne | Pop one, create Split with one branch to fragment |
ZeroOrMore | Pop one, create Split looping back |
OneOrMore | Pop one, create Split looping back, entry is original start |
The postfix size is checked against MaxStates before construction begins.
Stack underflow sets error_code_ and returns the accept state as a safe
sentinel.
Phase 3: Finalization
Two functions run after Thompson construction:
precompute_epsilon_closures(): For every state and for all 8 flag
combinations (boundary × at_start × at_end), computes the full epsilon
closure and stores it in a flat array (closure_cache_flat_) indexed via
an offset table (closure_cache_offsets_). Also sets trivial_closure = true on any state whose closure is {self} for all 8 combinations. Total
closure entries are bounded by MaxClosureCacheEntries. The epoch vector
and counter used for cycle detection are local to this function.
finalize_states(): Populates ascii_accept[2] on each state:
class-ref states copy the RuneClass's ascii_bitmap, literal states with
label < 128 set their single bit, all others stay zero. Since
owned_states_ is already contiguous (pre-reserved vector<StateDef>),
no copy or pointer remapping is needed.
Phase 4: Simulation (match, find_prefix)
Uses the standard Thompson NFA simulation with several fast paths:
start_listseeds the initial active set via epsilon closure.- For each input rune,
stepbuilds the next state set. The loop has two branches:- ASCII fast path (
rune < 128): for each active state, test the per-stateascii_acceptbitmap. This single bit-test replaces the entire class-ref dispatch chain (is_class_ref→class_ref_index→rune_classes_[]→RuneClass::contains()). - Non-ASCII fallback: for each active state, dispatch via
is_class_ref(label)+contains(rune)or literal equality.
- ASCII fast path (
- Conditional epsilon states use booleans threaded through simulation:
boundary_matchfor\bat_startfor^at_endfor$
add_state()has a trivial-closure shortcut: ifstate->trivial_closureis true, the state is added directly with dedup, bypassing the closure cache lookup entirely. Otherwise the flat closure cache is consulted.- Input decoding uses
decode_rune()which has an ASCII fast path avoiding the fullutf8_to_runecall for bytes < 128.
Conditional detection: detect_conditional_states() scans the NFA for
WordBoundary, StartAnchor, or EndAnchor. When absent, simulation
skips is_word_char() calls and passes constant false for all
boundary/anchor parameters, providing separate code paths for conditional
and non-conditional patterns.
Capture-aware simulation (add_state_capturing): Iterative (stack-based)
epsilon traversal using capture_traversal_stack_ in MatchContext. Pushes
next_alt before next so next is popped first (greedy DFS ordering).
Capture open/close transitions clone the capture frame and update one slot
before continuing. The stack is reused across calls via MatchContext.
Epoch-based deduplication: For NFAs with > 128 states, MatchContext
tracks per-state epochs in a visited-state vector. A per-context epoch
counter increments before each step, and add_state skips states already
marked in the current epoch. For NFAs with ≤ 128 states, a 128-bit bitset
(visited_bits_[2]) is used instead for faster clear and test.
Error Handling
The engine uses an error_code_ field (type ErrorCode) rather than
exceptions or error tokens:
- Set during
regexp_to_postfix_runes(invalid syntax, invalid escapes, resource limits) orpostfix_to_nfa(stack underflow, state limit). - Once set,
match()returns false immediately. - Callers should check
ok()after construction, or inspecterror_code()anderror()for a human-readable message.
Performance Notes
Compile+match advantage
Trieste's NFA compile is much cheaper than RE2's DFA. On benchmarks, the
compile+match ratio (cm_ratio) is ~0.31x (Trieste ~3x faster than RE2
overall). This makes the engine well-suited for single-use patterns.
Match-only gap
The match-only ratio (m_ratio) is ~2.6x. This is structural: RE2's DFA
does O(1) work per rune after warmup, while NFA simulation is O(active_states)
per rune. The gap is largest on patterns with many steps and high active
state counts.
Key optimizations
- Per-state ASCII bitmap:
ascii_accept[2]on StateDef unifies class-ref and literal dispatch into a single bit-test for runes < 128. Eliminates theis_class_ref→class_ref_index→rune_classes_[]→contains()chain for the common case. - RuneClass ASCII bitmap:
ascii_bitmap[2]on RuneClass provides O(1) membership test for runes < 128, avoiding binary search onranges. Unicode category bitmaps are precomputed at compile time inunicode_data.h. - Trivial-closure shortcut: States whose epsilon closure is always
{self}(literal, class-ref, Match states) skip the closure cache lookup inadd_state(). - ASCII
decode_runebypass:decode_rune()returns immediately for bytes < 128 without callingutf8_to_rune. - Bitset dedup: NFAs with ≤ 128 states use a 128-bit bitset for visited-state tracking instead of epoch-based vector dedup.
- Flat closure cache:
closure_cache_flat_+closure_cache_offsets_store precomputed epsilon closures in contiguous memory, indexed by(closure_index << 3) | flags. - Pre-reserved contiguous storage:
owned_states_is avector<StateDef>reserved upfront based on postfix size, so all states are contiguous in memory for cache-friendly simulation without a separate compaction pass. - Iterative capturing traversal:
add_state_capturinguses an explicit stack (capture_traversal_stack_in MatchContext) instead of recursion, eliminating per-epsilon function-call overhead. - Capture-frame arena: capture-aware simulation avoids per-thread vector copies by storing capture slots in a contiguous arena.
- Non-capturing scratch reuse:
match()and non-capturingfind_prefix()reuse MatchContext vectors to avoid per-call allocations. - Optional stats:
TRIESTE_REGEX_ENGINE_ENABLE_STATSgates counters stored in MatchContext and exposed viaMatchContext::stats().
unicode_data.h Extensions
Beyond the auto-generated Unicode 15.1 General Category range tables,
unicode_data.h contains compile-time ASCII bitmap infrastructure:
AsciiBitmap— struct withuint64_t words[2].compute_ascii_bitmap<N>(ranges)—constexprtemplate that scans a range table and produces anAsciiBitmapfor runes 0–127.- 36
inline constexpr AsciiBitmapconstants (one per Unicode General Category), precomputed at compile time. CategoryInfostruct extended withAsciiBitmap ascii_bitmapfield.find_category()returns precomputed bitmaps alongside range data.
These precomputed bitmaps are used when constructing a RuneClass from a
Unicode category, avoiding runtime bitmap computation.
Behavior Notes
- Empty pattern compiles to an accept-only NFA.
find_prefix()can produce zero-length matches.- Anchors are implemented in-engine (
^,$) and do not require wrapper-only emulation. - The engine operates on UTF-8 codepoint decoding, but capture offsets are byte offsets (matching existing parser/rewrite expectations).
Additional Syntax Coverage
Beyond the core iregexp-compatible pieces, parser support includes:
- Capturing and non-capturing groups:
( ... ),(?: ... ). - Capture operators in postfix/NFA:
CaptureGroup + iwraps fragments with epsilonCaptureOpen + iandCaptureClose + istates. - Lazy quantifiers:
??,*?,+?(implemented by swappedSplitbranch preference in Thompson construction). - POSIX classes inside brackets:
[:alpha:],[:digit:],[:alnum:],[:blank:],[:space:],[:xdigit:],[:upper:],[:lower:],[:print:],[:graph:],[:cntrl:],[:punct:],[:ascii:].
Conventions
- Namespace:
trieste::regexfor the engine;trieste::unicodefor category data. inline constexprfor namespace-scope rune constants and resource limits.StateDef*aliased asState; owned byowned_states_(a pre-reservedvector<StateDef>).- Thread-safety model: Engine matching methods are
constand engine state is read-only after construction. Thread safety depends on context usage: sharing oneMatchContextacross threads is unsafe; using separate contexts per thread is safe. - Header-only: The engine is
regex_engine.h; category data is inunicode_data.h(auto-generated from Unicode 15.1UnicodeData.txt).
Modification Guidelines
When extending the regex engine:
- New set-based matchers (e.g. new Unicode properties): Create a
RuneClass, store it viamake_class_ref(), and emit the class-ref in the postfix stream. No changes toStateDeforpostfix_to_nfaneeded. The ASCII bitmap will be computed automatically bynormalize(). - New operators: Assign a sentinel rune in the
0xAFFF00range and update parsing plus NFA construction. - New conditional epsilon states: Update both
add_stateandadd_state_capturing, and thread required context booleans throughstart_list/step/find_prefixcall paths. Update the 3-bit flag encoding inclosure_flags()and increase the slot multiplier from 8. - Testing: Add test cases to
test/regex_engine_test.cc. Test positive matches, negative matches, and malformed pattern rejection for any new error conditions. For assertions, include both full match and prefix/probe scenarios. - Benchmarking: Run
test/regex_engine_benchmark.ccto verify performance impact. Comparecm_ratioandm_ratioagainst baseline. To produce a chart and markdown report from multiple benchmark passes, run:
This script runs the benchmark N times (default 5, with warmup), computes medians, and outputs a PNG chart + markdown report. Pass.venv/bin/python .github/skills/regex-engine/benchmark_chart.py--input <json>to re-render from previously saved data without re-running benchmarks. - Keep phases separate: Parsing (shunting-yard), construction (Thompson), compaction (closures + arena), and simulation are cleanly separated. Changes should respect these boundaries.
- Unicode data: If upgrading the Unicode version, regenerate
unicode_data.hfrom the newUnicodeData.txt. Remember to also update thecompute_ascii_bitmapconstants andCategoryInfoentries.