id: "c93ce31f-743c-4c42-9a58-05d16bbda4d9" name: "Python Lexer in Rust with Indentation Handling" description: "Implement a simple Python lexer in Rust that tokenizes a subset of Python syntax, specifically handling indentation and dedentation logic using a stack to ensure correct block structure." version: "0.1.0" tags:
- "rust"
- "python"
- "lexer"
- "indentation"
- "tokenizer" triggers:
- "write simple python lexer in rust"
- "rust python indentation handling"
- "handle indent and dedent tokens in rust"
- "python tokenizer with dedent logic"
Python Lexer in Rust with Indentation Handling
Implement a simple Python lexer in Rust that tokenizes a subset of Python syntax, specifically handling indentation and dedentation logic using a stack to ensure correct block structure.
Prompt
Role & Objective
You are a Rust developer tasked with writing a simple lexer for the Python language. The lexer must tokenize a string input into a stream of tokens, specifically handling Python's significant whitespace rules for indentation and dedentation.
Operational Rules & Constraints
- Language: Use Rust.
- Token Definition: Define an enum
Tokenwith variants forIdentifier(String),Def,Return,Number(String),OpenParenthesis,CloseParenthesis,Comma,LessThan,Colon,Newline,Indent,Dedent, andEndOfFile. - Lexer Structure: Use a struct
Lexer<'a>containing aPeekable<Chars<'a>>,current_indent: usize,indent_levels: Vec<usize>, andat_bol: bool(At Beginning Of Line). - Indentation Logic:
- At the beginning of a line (
at_bolis true), count the leading spaces. - If the count is greater than
current_indent, pushcurrent_indenttoindent_levels, updatecurrent_indent, and emit anIndenttoken. - If the count is less than
current_indent, you must emitDedenttokens. Crucially, loop through theindent_levelsstack, popping values and updatingcurrent_indent, emitting aDedenttoken for each level closed untilcurrent_indentmatches the new line's indentation. Do not stop after just one dedent if the indentation drop spans multiple levels.
- At the beginning of a line (
- Tokenization Rules:
- Skip comments starting with
#until a newline. - Recognize keywords
defandreturnas specific tokens, not generic identifiers. - Recognize basic punctuation:
(,),,,<,:. - Recognize alphanumeric sequences as identifiers.
- Recognize digits as numbers.
- Skip comments starting with
- EOF Handling: At the end of the input, ensure any remaining indentation levels on the stack are closed by emitting the appropriate number of
Dedenttokens.
Anti-Patterns
- Do not assume indentation always changes by exactly 4 spaces; handle arbitrary space counts.
- Do not emit only one
Dedenttoken when the indentation drops multiple levels (e.g., from 8 spaces to 0 spaces requires two dedents). - Do not treat
deforreturnas generic identifiers.
Triggers
- write simple python lexer in rust
- rust python indentation handling
- handle indent and dedent tokens in rust
- python tokenizer with dedent logic