-
Notifications
You must be signed in to change notification settings - Fork 1
Add compiler theory foundation pages (pipeline, tokens, lexer/FA) #261
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| @@ -0,0 +1,112 @@ | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| # The Lexer: Regex to Finite Automata | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ## What Does a Lexer Do? | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| A lexer reads a character stream from left to right and emits a token stream. Each scan step | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| finds the longest prefix of the remaining input that matches one of the token class patterns — | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| this is the *maximal munch* rule. When no pattern matches the current position, the lexer throws | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| an error. The result is a flat list of lexemes that the parser consumes next. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ## Regular Languages | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| > **Definition — Regular language:** | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| > A language L ⊆ Σ* is *regular* if it is recognized by a finite automaton (FA). Equivalently, | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| > L can be described by a regular expression over alphabet Σ. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| > Each token class defines a regular language: `NUMBER` defines the set | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| > { "0", "1", ..., "3.14", "100", ... }. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Regex notation is a concise way to specify regular languages. This is why regex is the right | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| tool for token class definitions — token classes have a "look ahead a bounded amount" structure | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| that regular languages capture exactly. More complex patterns such as balanced parentheses | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| require a more powerful formalism (context-free grammars, which the parser handles), but for | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| token recognition, regular expressions are both necessary and sufficient. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ## NFA and DFA: The Conceptual Picture | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Any regular expression can be translated into a finite automaton that accepts the same strings. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| The standard construction proceeds in two steps. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| **Step 1 — NFA (nondeterministic finite automaton).** A regex is converted into an NFA via | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Thompson's construction. An NFA can have multiple possible transitions from a state on the same | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| input, or transitions on the empty string. For simple patterns this is easy to visualize. The | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| `PLUS` token pattern `\+` produces a two-state NFA: | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| | State | Input `+` | Accept? | | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| |-------|-----------|---------| | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| | q₀ | q₁ | No | | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| | q₁ | — | Yes | | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| The machine starts at q₀, consumes a `+`, and moves to q₁ — an accepting state. Any other | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| input from q₀ leads nowhere, meaning the string does not match. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| **Step 2 — DFA (deterministic finite automaton).** An NFA is then converted to a DFA. A DFA | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| has exactly one transition per (state, input-character) pair, with no ambiguity. This matters | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| for performance: a DFA can be executed in O(n) time by reading the input left to right, one | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| character at a time, following the single applicable transition at each step. A DFA is therefore | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| the right runtime data structure for a lexer — no backtracking, no branching. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| > **Definition — Deterministic Finite Automaton (DFA):** | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| > A DFA is a 5-tuple (Q, Σ, δ, q₀, F) where: | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| > - Q is a finite set of states | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| > - Σ is the input alphabet (here: Unicode characters) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| > - δ : Q × Σ → Q is the transition function | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| > - q₀ ∈ Q is the start state | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| > - F ⊆ Q is the set of accepting states | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| > | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| > A DFA accepts a string w if δ*(q₀, w) ∈ F, where δ* is the iterated transition function. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| > In Alpaca's combined lexer DFA, each accepting state also carries a *token label* indicating | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| > which token class was matched. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ## Combining Token Patterns into One Automaton | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| To lex a language with multiple token classes, the standard approach builds one combined DFA. In | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| theory: construct an NFA for each token pattern, connect them all to a new start state with | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| epsilon transitions, then convert the combined NFA to a single DFA. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Alpaca follows the same principle but implements it using Java's regex engine, which is itself | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| backed by NFA/DFA machinery: | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Comment on lines
+66
to
+68
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| - All token patterns are combined into a single Java regex alternation at compile time: | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ``` | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| // Conceptual: how Alpaca combines patterns internally | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| (?<NUMBER>[0-9]+(\.[0-9]+)?)|(?<PLUS>\+)|(?<MINUS>-)|(?<TIMES>\*)|... | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ``` | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| - `java.util.regex.Pattern.compile(...)` is called inside the `lexerImpl` macro at compile | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| time. An invalid regex pattern therefore causes a compile error, not a runtime crash. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| - At runtime, `Tokenization.tokenize()` uses `matcher.lookingAt()` on the combined pattern at | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| the current input position. It then checks which named group matched using | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| `matcher.start(i)` to determine the token class. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| This means Alpaca's lexer runs with the same O(n) guarantee as a hand-built DFA: one pass | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| through the input, no backtracking. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Comment on lines
+82
to
+83
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| This means Alpaca's lexer runs with the same O(n) guarantee as a hand-built DFA: one pass | |
| through the input, no backtracking. | |
| In practice, this combined-pattern approach lets Alpaca's lexer scan the input from left to | |
| right in a single pass, much like a hand-built DFA-based lexer. However, it still relies on | |
| Java's backtracking regex engine internally, so Alpaca does not claim a strict worst-case O(n) | |
| time guarantee or the complete absence of backtracking for arbitrary token patterns. |
Copilot
AI
Feb 20, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The shadowing definition here uses plain language subset L(B) ⊆ L(A), but Alpaca’s implementation checks subset on pattern + ".*" (prefix languages) because matching uses lookingAt and doesn’t require a full-string match. Consider updating the definition to reflect this prefix-based notion so the described behavior matches the actual checker.
| A practical issue with ordered alternation is *shadowing*: pattern A shadows pattern B if every | |
| string matched by B is also matched by A (that is, L(B) ⊆ L(A), meaning every string in B's | |
| language is also in A's language), and A appears before B in the lexer definition. If this | |
| occurs, B will never match — it is dead code. | |
| Alpaca's `RegexChecker` uses the `dregex` library (a Scala/JVM library for decidable regex | |
| operations) to check at compile time whether any pattern's language is a subset of an earlier | |
| pattern's language. If shadowing is detected, the macro throws a `ShadowException` with a | |
| compile error pointing to the offending patterns. | |
| **Example:** If you wrote the integer pattern `"[0-9]+"` before the decimal pattern | |
| `"[0-9]+(\\.[0-9]+)?"`, the integer pattern would shadow the decimal one — every decimal like | |
| `"3.14"` is also matched by `"[0-9]+"` up to the decimal point, but more critically the integer | |
| pattern can match the prefix `"3"` and would consume it first. The `dregex` check catches this | |
| A practical issue with ordered alternation is *shadowing*. Intuitively, pattern A shadows pattern | |
| B if, whenever B could match starting at some input position, A can also match some (possibly | |
| shorter) prefix there, and A appears before B in the lexer definition. In that situation B will | |
| never be the pattern that the lexer chooses — it is effectively dead code. | |
| Formally, because Alpaca's lexer uses `matcher.lookingAt()`, matching is prefix-based rather than | |
| whole-string based. Alpaca's `RegexChecker` therefore uses the `dregex` library (a Scala/JVM | |
| library for decidable regex operations) to check at compile time whether the *prefix-extended* | |
| language of a later pattern (conceptually its language with `".*"` appended) is a subset of the | |
| prefix-extended language of an earlier pattern. If such a subset relation holds, the macro throws | |
| a `ShadowException` with a compile error pointing to the offending patterns. | |
| **Example:** If you wrote the integer pattern `"[0-9]+"` before the decimal pattern | |
| `"[0-9]+(\\.[0-9]+)?"`, the integer pattern would shadow the decimal one — every decimal like | |
| `"3.14"` is also matched by `"[0-9]+"` on its initial digits, and the integer pattern can match | |
| the prefix `"3"` and would consume it first. The prefix-based `dregex` check catches this |
Copilot
AI
Feb 20, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This cross-link points at ../lexer.md, but existing docs link to the generated *.html pages. Consider switching to ../lexer.html so the link works in the rendered ScalaDoc output.
| - See [Lexer](../lexer.md) for the complete `lexer` DSL reference. | |
| - See [Lexer](../lexer.html) for the complete `lexer` DSL reference. |
Copilot
AI
Feb 20, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This link points at tokens.md, but the generated site uses .html pages. Consider linking to tokens.html for a working cross-link in the rendered docs.
| - See [Tokens and Lexemes](tokens.md) for what the lexer produces — the lexeme stream. | |
| - See [Tokens and Lexemes](tokens.html) for what the lexer produces — the lexeme stream. |
Copilot
AI
Feb 20, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The link theory/cfg.md is broken from this page (it would resolve to theory/theory/cfg.md), and there is no CFG page under docs/_docs/theory/. Add the missing page or update this cross-link to an existing target.
| - Next: [Context-Free Grammars](theory/cfg.md) for how token streams are parsed. | |
| - Next: [Context-Free Grammars](cfg.md) for how token streams are parsed. |
| Original file line number | Diff line number | Diff line change | ||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| @@ -0,0 +1,88 @@ | ||||||||||||||||||||||||||||||
| # The Compilation Pipeline | ||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||
| Source text is just a string. A compiler pipeline is a sequence of transformations that turns that string into something structured and meaningful. Each stage takes the output of the previous one, narrowing the representation from raw text to a typed result. | ||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||
| Understanding the pipeline gives you a mental model that applies to every Alpaca program you write — not just calculator expressions, but any language you define with the library. | ||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||
| ## The Four Stages | ||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||
| Most compilers share the same four-stage structure: | ||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||
| 1. **Source text** — the raw input string, e.g., `"3 + 4 * 2"` | ||||||||||||||||||||||||||||||
| 2. **Lexical analysis** — groups characters into tokens: `NUMBER(3.0)`, `PLUS`, `NUMBER(4.0)`, `TIMES`, `NUMBER(2.0)` | ||||||||||||||||||||||||||||||
| 3. **Syntactic analysis** — arranges tokens into a parse tree (concrete syntax tree) that encodes grammatical structure | ||||||||||||||||||||||||||||||
| 4. **Semantic analysis / evaluation** — extracts meaning from the tree, producing a typed result (in a calculator: `Double`) | ||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||
| Some compilers add a fifth stage — code generation — that emits machine code or bytecode. Alpaca stops at stage 4: its pipeline produces a typed Scala value, not machine code. | ||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||
| ## Alpaca's Pipeline | ||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||
| With Alpaca, running the full pipeline takes two calls: | ||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||
| ```scala sc:nocompile | ||||||||||||||||||||||||||||||
| // Full pipeline: source text → typed result | ||||||||||||||||||||||||||||||
| val (_, lexemes) = CalcLexer.tokenize("3 + 4 * 2") | ||||||||||||||||||||||||||||||
| // lexemes: List[Lexeme] — NUMBER(3.0), PLUS, NUMBER(4.0), TIMES, NUMBER(2.0) | ||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||
| val (_, result) = CalcParser.parse(lexemes) | ||||||||||||||||||||||||||||||
| // result: Double | Null = 11.0 | ||||||||||||||||||||||||||||||
| ``` | ||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||
| `CalcLexer.tokenize` handles stages 1–2: it takes the source string and produces a `List[Lexeme]`. `CalcParser.parse` handles stages 3–4: it takes those lexemes, builds the parse tree internally, and returns the typed result. | ||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||
|
Comment on lines
+31
to
+32
|
||||||||||||||||||||||||||||||
| Both `CalcLexer` and `CalcParser` are objects generated by Alpaca's macros. Their definitions live in separate files (see the cross-links at the bottom of this page). | ||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||
| ## Compile-time vs Runtime Boundary | ||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||
| Alpaca draws a sharp line between what happens at compile time and what happens at runtime. This is the most important thing to understand about the library. | ||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||
| > **Compile-time processing:** When you write a `lexer` definition, the Scala 3 macro validates your regex patterns, checks for shadowing, and generates the `Tokenization` object. When you write a `Parser` definition, the macro reads your grammar, builds the LR(1) parse table, and detects any shift/reduce conflicts — all at compile time. At runtime, `tokenize(input)` and `parse(lexemes)` execute the pre-generated code. | ||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||
| In concrete terms: | ||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||
| **Compile time:** | ||||||||||||||||||||||||||||||
| - The `lexer` macro validates regex patterns, detects shadowing (where one pattern makes another unreachable), and emits a `Tokenization` object | ||||||||||||||||||||||||||||||
| - The `Parser` macro reads every `Rule` declaration, constructs the LR(1) parse table, and reports any shift/reduce or reduce/reduce conflicts as compile errors | ||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||
| **Runtime:** | ||||||||||||||||||||||||||||||
| - `tokenize(input)` executes the pre-generated code and returns `List[Lexeme]` | ||||||||||||||||||||||||||||||
| - `parse(lexemes)` executes the pre-built parse table and returns the typed result | ||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||
| The consequence: if your regex is invalid, or your grammar is ambiguous, you get a compile error — not a runtime crash. The pipeline is safe by construction before it ever runs on real input. | ||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||
| Alpaca covers stages 1–3 of the classical pipeline. The "code generation" stage is not part of the library — your Scala semantic actions in the parser rules produce the final typed value directly. | ||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||
| Alpaca covers stages 1–3 of the classical pipeline. The "code generation" stage is not part of the library — your Scala semantic actions in the parser rules produce the final typed value directly. | |
| Alpaca covers stages 1–4 of the classical pipeline. The "code generation" stage is not part of the library — your Scala semantic actions in the parser rules produce the final typed value directly. |
Copilot
AI
Feb 20, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The formal signature parse ∘ tokenize : String → R doesn’t match the API described elsewhere on the page (parse yields R | Null, and both tokenize/parse return (ctx, ...)). Consider adjusting the definition to something like String → (Ctx, R | Null) or explicitly stating that you’re projecting away the ctx and Null cases for the mathematical abstraction.
| > Alpaca's pipeline: `parse ∘ tokenize : String → R` where R is the root non-terminal's result type. | |
| > Alpaca's pipeline (projecting away the threaded context and the `Null` failure case) can be viewed as `parse ∘ tokenize : String → R`, where `R` is the root non-terminal's result type. |
Copilot
AI
Feb 20, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These “What Comes Next” links use theory/... from within the theory/ directory, so they resolve to theory/theory/... and will be broken. Also, the docs site generates .html pages, so these should likely link to tokens.html and lexer-fa.html (no extra theory/ prefix).
| - Next: [Tokens & Lexemes](theory/tokens.md) — what the lexer produces: token classes, token instances, and how they are represented in Alpaca | |
| - [The Lexer: Regex to Finite Automata](theory/lexer-fa.md) — how regular expressions define token classes and how Alpaca compiles them | |
| - Next: [Tokens & Lexemes](tokens.html) — what the lexer produces: token classes, token instances, and how they are represented in Alpaca | |
| - [The Lexer: Regex to Finite Automata](lexer-fa.html) — how regular expressions define token classes and how Alpaca compiles them |
Copilot
AI
Feb 20, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These reference links likely won’t work in the generated ScalaDoc site as written: from theory/, they should be relative to the parent directory, and point at the generated .html pages (e.g. ../lexer.html, ../parser.html) rather than .md sources.
| - Next: [Tokens & Lexemes](theory/tokens.md) — what the lexer produces: token classes, token instances, and how they are represented in Alpaca | |
| - [The Lexer: Regex to Finite Automata](theory/lexer-fa.md) — how regular expressions define token classes and how Alpaca compiles them | |
| For the full API, see the reference pages: | |
| - See [Lexer](lexer.md) for how `CalcLexer` is defined. | |
| - See [Parser](parser.md) for how `CalcParser` is defined and how grammar rules produce a typed result. | |
| - Next: [Tokens & Lexemes](../tokens.html) — what the lexer produces: token classes, token instances, and how they are represented in Alpaca | |
| - [The Lexer: Regex to Finite Automata](../lexer-fa.html) — how regular expressions define token classes and how Alpaca compiles them | |
| For the full API, see the reference pages: | |
| - See [Lexer](../lexer.html) for how `CalcLexer` is defined. | |
| - See [Parser](../parser.html) for how `CalcParser` is defined and how grammar rules produce a typed result. |
| Original file line number | Diff line number | Diff line change | ||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| @@ -0,0 +1,123 @@ | ||||||||||||||||||||||||||||||
| # Tokens and Lexemes | ||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||
| A lexer transforms raw source text into a sequence of structured tokens — the first stage of | ||||||||||||||||||||||||||||||
| compilation. Before writing a lexer, it helps to understand the formal vocabulary: what a token | ||||||||||||||||||||||||||||||
| class is, how individual matches relate to it, and what a lexeme carries. | ||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||
| ## Terminal Symbols | ||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||
| In formal grammar, a *terminal symbol* is an atomic element that cannot be broken down further. | ||||||||||||||||||||||||||||||
| It is the end of the line for derivation — terminals represent the actual characters or strings | ||||||||||||||||||||||||||||||
| that appear in source text. In a lexer, each token class acts as a terminal: it names a category | ||||||||||||||||||||||||||||||
| of strings, and no lexer-level expansion applies below it. | ||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||
| See [Context-Free Grammars](theory/cfg.md) for how terminals fit into production rules. | ||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||
| See [Context-Free Grammars](theory/cfg.md) for how terminals fit into production rules. | |
| See the discussion of context-free grammars for how terminals fit into production rules. |
Copilot
AI
Feb 20, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The example lexeme positions here are 0-based absolute offsets (pos=0/2/4), but Alpaca’s default lexer context tracks position as a 1-based column within the current line (and it’s updated after each match). Consider aligning these examples/terminology with Alpaca’s actual position semantics to avoid confusion.
Copilot
AI
Feb 20, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This section says a lexeme carries “four pieces of information”, but Alpaca lexemes also include the matched string as lexeme.text (and more generally expose all lexer context fields via dynamic selection). Consider updating this description to include text / clarify that additional fields come from the lexer context.
| In Alpaca, each matched token is represented as a `Lexeme[Name, Value]`. A lexeme carries four | |
| pieces of information: | |
| - `name` — the token class name string, e.g., `"NUMBER"` or `"PLUS"` | |
| - `value` — the extracted value with its Scala type, e.g., `3.14: Double` for NUMBER, `(): Unit` | |
| for PLUS | |
| In Alpaca, each matched token is represented as a `Lexeme[Name, Value]`. A lexeme carries five | |
| core pieces of information: | |
| - `name` — the token class name string, e.g., `"NUMBER"` or `"PLUS"` | |
| - `value` — the extracted value with its Scala type, e.g., `3.14: Double` for NUMBER, `(): Unit` | |
| for PLUS | |
| - `text` — the matched source substring (also available as `lexeme.text`) |
Copilot
AI
Feb 20, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the default LexerCtx, position is a 1-based column within the current line (updated after each match), not an absolute character offset into the entire source. Rewording this bullet would better match the actual meaning of position in Alpaca.
Copilot
AI
Feb 20, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For consistency with the rest of the docs (which link to *.html pages), this cross-link should likely be ../lexer.html rather than ../lexer.md so it works in the generated site.
| - See [Lexer](../lexer.md) for the full `lexer` DSL reference and all token forms. | |
| - See [Lexer](../lexer.html) for the full `lexer` DSL reference and all token forms. |
Copilot
AI
Feb 20, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This link points at the Markdown source (lexer-fa.md), but the generated documentation site uses .html pages (as in the rest of docs/_docs). Consider linking to lexer-fa.html so the cross-link works in the rendered site.
| `Tokenization` object. See [The Lexer: Regex to Finite Automata](lexer-fa.md) for what the | |
| macro does internally. | |
| ## Cross-links | |
| - See [Lexer](../lexer.md) for the full `lexer` DSL reference and all token forms. | |
| - See [The Lexer: Regex to Finite Automata](lexer-fa.md) for how regex patterns define token | |
| `Tokenization` object. See [The Lexer: Regex to Finite Automata](lexer-fa.html) for what the | |
| macro does internally. | |
| ## Cross-links | |
| - See [Lexer](../lexer.md) for the full `lexer` DSL reference and all token forms. | |
| - See [The Lexer: Regex to Finite Automata](lexer-fa.html) for how regex patterns define token |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This describes lexing as “longest prefix … maximal munch”, but Alpaca’s lexer is explicitly “patterns are tried in order; the first match wins” (see
docs/_docs/lexer.md). Given thelookingAt+ ordered alternation implementation, this section should describe ordered-first matching rather than maximal-munch/longest-match.