halotukozak · halotukozak · Feb 20, 2026 · Feb 20, 2026 · Feb 20, 2026 · Copilot
diff --git a/docs/_docs/theory/lexer-fa.md b/docs/_docs/theory/lexer-fa.md
@@ -0,0 +1,112 @@
+# The Lexer: Regex to Finite Automata
+
+## What Does a Lexer Do?
+
+A lexer reads a character stream from left to right and emits a token stream. Each scan step
+finds the longest prefix of the remaining input that matches one of the token class patterns —
+this is the *maximal munch* rule. When no pattern matches the current position, the lexer throws
+an error. The result is a flat list of lexemes that the parser consumes next.
-A lexer reads a character stream from left to right and emits a token stream. Each scan step
-finds the longest prefix of the remaining input that matches one of the token class patterns —
-this is the *maximal munch* rule. When no pattern matches the current position, the lexer throws
-an error. The result is a flat list of lexemes that the parser consumes next.
+A lexer reads a character stream from left to right and emits a token stream. At each scan step,
+it tries the token class patterns in a fixed order and picks the first pattern whose regex
+matches at the current position, consuming that matched prefix. When no pattern matches the
+current position, the lexer throws an error. The result is a flat list of lexemes that the parser
+consumes next.
-A lexer reads a character stream from left to right and emits a token stream. Each scan step
-finds the longest prefix of the remaining input that matches one of the token class patterns —
-this is the *maximal munch* rule. When no pattern matches the current position, the lexer throws
-an error. The result is a flat list of lexemes that the parser consumes next.
+A lexer reads a character stream from left to right and emits a token stream. At each scan step,
+it tries the token class patterns in a fixed order and picks the first pattern whose regex
+matches at the current position, consuming that matched prefix. When no pattern matches the
+current position, the lexer throws an error. The result is a flat list of lexemes that the parser
+consumes next.
+
+## Regular Languages
+
+> **Definition — Regular language:**
+> A language L ⊆ Σ* is *regular* if it is recognized by a finite automaton (FA). Equivalently,
+> L can be described by a regular expression over alphabet Σ.
+> Each token class defines a regular language: `NUMBER` defines the set
+> { "0", "1", ..., "3.14", "100", ... }.
+
+Regex notation is a concise way to specify regular languages. This is why regex is the right
+tool for token class definitions — token classes have a "look ahead a bounded amount" structure
+that regular languages capture exactly. More complex patterns such as balanced parentheses
+require a more powerful formalism (context-free grammars, which the parser handles), but for
+token recognition, regular expressions are both necessary and sufficient.
+
+## NFA and DFA: The Conceptual Picture
+
+Any regular expression can be translated into a finite automaton that accepts the same strings.
+The standard construction proceeds in two steps.
+
+**Step 1 — NFA (nondeterministic finite automaton).** A regex is converted into an NFA via
+Thompson's construction. An NFA can have multiple possible transitions from a state on the same
+input, or transitions on the empty string. For simple patterns this is easy to visualize. The
+`PLUS` token pattern `\+` produces a two-state NFA:
+
+| State | Input `+` | Accept? |
+|-------|-----------|---------|
+| q₀ | q₁ | No |
+| q₁ | — | Yes |
+
+The machine starts at q₀, consumes a `+`, and moves to q₁ — an accepting state. Any other
+input from q₀ leads nowhere, meaning the string does not match.
+
+**Step 2 — DFA (deterministic finite automaton).** An NFA is then converted to a DFA. A DFA
+has exactly one transition per (state, input-character) pair, with no ambiguity. This matters
+for performance: a DFA can be executed in O(n) time by reading the input left to right, one
+character at a time, following the single applicable transition at each step. A DFA is therefore
+the right runtime data structure for a lexer — no backtracking, no branching.
+
+> **Definition — Deterministic Finite Automaton (DFA):**
+> A DFA is a 5-tuple (Q, Σ, δ, q₀, F) where:
+> - Q is a finite set of states
+> - Σ is the input alphabet (here: Unicode characters)
+> - δ : Q × Σ → Q is the transition function
+> - q₀ ∈ Q is the start state
+> - F ⊆ Q is the set of accepting states
+>
+> A DFA accepts a string w if δ*(q₀, w) ∈ F, where δ* is the iterated transition function.
+> In Alpaca's combined lexer DFA, each accepting state also carries a *token label* indicating
+> which token class was matched.
+
+## Combining Token Patterns into One Automaton
+
+To lex a language with multiple token classes, the standard approach builds one combined DFA. In
+theory: construct an NFA for each token pattern, connect them all to a new start state with
+epsilon transitions, then convert the combined NFA to a single DFA.
+
+Alpaca follows the same principle but implements it using Java's regex engine, which is itself
+backed by NFA/DFA machinery:
+
+- All token patterns are combined into a single Java regex alternation at compile time:
+
+```
+// Conceptual: how Alpaca combines patterns internally
+(?<NUMBER>[0-9]+(\.[0-9]+)?)|(?<PLUS>\+)|(?<MINUS>-)|(?<TIMES>\*)|...
+```
+
+- `java.util.regex.Pattern.compile(...)` is called inside the `lexerImpl` macro at compile
+  time. An invalid regex pattern therefore causes a compile error, not a runtime crash.
+- At runtime, `Tokenization.tokenize()` uses `matcher.lookingAt()` on the combined pattern at
+  the current input position. It then checks which named group matched using
+  `matcher.start(i)` to determine the token class.
+
+This means Alpaca's lexer runs with the same O(n) guarantee as a hand-built DFA: one pass
+through the input, no backtracking.
-This means Alpaca's lexer runs with the same O(n) guarantee as a hand-built DFA: one pass
-through the input, no backtracking.
+In practice, this combined-pattern approach lets Alpaca's lexer scan the input from left to
+right in a single pass, much like a hand-built DFA-based lexer. However, it still relies on
+Java's backtracking regex engine internally, so Alpaca does not claim a strict worst-case O(n)
+time guarantee or the complete absence of backtracking for arbitrary token patterns.
-This means Alpaca's lexer runs with the same O(n) guarantee as a hand-built DFA: one pass
-through the input, no backtracking.
+In practice, this combined-pattern approach lets Alpaca's lexer scan the input from left to
+right in a single pass, much like a hand-built DFA-based lexer. However, it still relies on
+Java's backtracking regex engine internally, so Alpaca does not claim a strict worst-case O(n)
+time guarantee or the complete absence of backtracking for arbitrary token patterns.
+
+## Shadowing Detection
+
+A practical issue with ordered alternation is *shadowing*: pattern A shadows pattern B if every
+string matched by B is also matched by A (that is, L(B) ⊆ L(A), meaning every string in B's
+language is also in A's language), and A appears before B in the lexer definition. If this
+occurs, B will never match — it is dead code.
+
+Alpaca's `RegexChecker` uses the `dregex` library (a Scala/JVM library for decidable regex
+operations) to check at compile time whether any pattern's language is a subset of an earlier
+pattern's language. If shadowing is detected, the macro throws a `ShadowException` with a
+compile error pointing to the offending patterns.
+
+**Example:** If you wrote the integer pattern `"[0-9]+"` before the decimal pattern
+`"[0-9]+(\\.[0-9]+)?"`, the integer pattern would shadow the decimal one — every decimal like
+`"3.14"` is also matched by `"[0-9]+"` up to the decimal point, but more critically the integer
+pattern can match the prefix `"3"` and would consume it first. The `dregex` check catches this
-A practical issue with ordered alternation is *shadowing*: pattern A shadows pattern B if every
-string matched by B is also matched by A (that is, L(B) ⊆ L(A), meaning every string in B's
-language is also in A's language), and A appears before B in the lexer definition. If this
-occurs, B will never match — it is dead code.
-
-Alpaca's `RegexChecker` uses the `dregex` library (a Scala/JVM library for decidable regex
-operations) to check at compile time whether any pattern's language is a subset of an earlier
-pattern's language. If shadowing is detected, the macro throws a `ShadowException` with a
-compile error pointing to the offending patterns.
-
-**Example:** If you wrote the integer pattern `"[0-9]+"` before the decimal pattern
-`"[0-9]+(\\.[0-9]+)?"`, the integer pattern would shadow the decimal one — every decimal like
-`"3.14"` is also matched by `"[0-9]+"` up to the decimal point, but more critically the integer
-pattern can match the prefix `"3"` and would consume it first. The `dregex` check catches this
+A practical issue with ordered alternation is *shadowing*. Intuitively, pattern A shadows pattern
+B if, whenever B could match starting at some input position, A can also match some (possibly
+shorter) prefix there, and A appears before B in the lexer definition. In that situation B will
+never be the pattern that the lexer chooses — it is effectively dead code.
+
+Formally, because Alpaca's lexer uses `matcher.lookingAt()`, matching is prefix-based rather than
+whole-string based. Alpaca's `RegexChecker` therefore uses the `dregex` library (a Scala/JVM
+library for decidable regex operations) to check at compile time whether the *prefix-extended*
+language of a later pattern (conceptually its language with `".*"` appended) is a subset of the
+prefix-extended language of an earlier pattern. If such a subset relation holds, the macro throws
+a `ShadowException` with a compile error pointing to the offending patterns.
+
+**Example:** If you wrote the integer pattern `"[0-9]+"` before the decimal pattern
+`"[0-9]+(\\.[0-9]+)?"`, the integer pattern would shadow the decimal one — every decimal like
+`"3.14"` is also matched by `"[0-9]+"` on its initial digits, and the integer pattern can match
+the prefix `"3"` and would consume it first. The prefix-based `dregex` check catches this
-A practical issue with ordered alternation is *shadowing*: pattern A shadows pattern B if every
-string matched by B is also matched by A (that is, L(B) ⊆ L(A), meaning every string in B's
-language is also in A's language), and A appears before B in the lexer definition. If this
-occurs, B will never match — it is dead code.
-
-Alpaca's `RegexChecker` uses the `dregex` library (a Scala/JVM library for decidable regex
-operations) to check at compile time whether any pattern's language is a subset of an earlier
-pattern's language. If shadowing is detected, the macro throws a `ShadowException` with a
-compile error pointing to the offending patterns.
-
-**Example:** If you wrote the integer pattern `"[0-9]+"` before the decimal pattern
-`"[0-9]+(\\.[0-9]+)?"`, the integer pattern would shadow the decimal one — every decimal like
-`"3.14"` is also matched by `"[0-9]+"` up to the decimal point, but more critically the integer
-pattern can match the prefix `"3"` and would consume it first. The `dregex` check catches this
+A practical issue with ordered alternation is *shadowing*. Intuitively, pattern A shadows pattern
+B if, whenever B could match starting at some input position, A can also match some (possibly
+shorter) prefix there, and A appears before B in the lexer definition. In that situation B will
+never be the pattern that the lexer chooses — it is effectively dead code.
+
+Formally, because Alpaca's lexer uses `matcher.lookingAt()`, matching is prefix-based rather than
+whole-string based. Alpaca's `RegexChecker` therefore uses the `dregex` library (a Scala/JVM
+library for decidable regex operations) to check at compile time whether the *prefix-extended*
+language of a later pattern (conceptually its language with `".*"` appended) is a subset of the
+prefix-extended language of an earlier pattern. If such a subset relation holds, the macro throws
+a `ShadowException` with a compile error pointing to the offending patterns.
+
+**Example:** If you wrote the integer pattern `"[0-9]+"` before the decimal pattern
+`"[0-9]+(\\.[0-9]+)?"`, the integer pattern would shadow the decimal one — every decimal like
+`"3.14"` is also matched by `"[0-9]+"` on its initial digits, and the integer pattern can match
+the prefix `"3"` and would consume it first. The prefix-based `dregex` check catches this
+ordering mistake at compile time rather than silently producing wrong output at runtime.
+
+In `CalcLexer`, the decimal pattern `"[0-9]+(\\.[0-9]+)?"` is listed first, before any simpler
+integer-only pattern, so no shadowing occurs.
+
+> **Compile-time processing:** The `lexer` macro validates all regex patterns, combines them into a single alternation pattern, and checks for shadowing using `dregex` — all at compile time. If a regex is invalid or one pattern shadows another, you get a compile error. At runtime, the generated `Tokenization` object runs the pre-compiled combined regex against your input string.
+
+## Cross-links
+
+- See [Lexer](../lexer.md) for the complete `lexer` DSL reference.
- See [Lexer](../lexer.md) for the complete `lexer` DSL reference.
+- See [Lexer](../lexer.html) for the complete `lexer` DSL reference.
- See [Lexer](../lexer.md) for the complete `lexer` DSL reference.
+- See [Lexer](../lexer.html) for the complete `lexer` DSL reference.
+- See [Tokens and Lexemes](tokens.md) for what the lexer produces — the lexeme stream.
- See [Tokens and Lexemes](tokens.md) for what the lexer produces — the lexeme stream.
+- See [Tokens and Lexemes](tokens.html) for what the lexer produces — the lexeme stream.
- See [Tokens and Lexemes](tokens.md) for what the lexer produces — the lexeme stream.
+- See [Tokens and Lexemes](tokens.html) for what the lexer produces — the lexeme stream.
+- Next: [Context-Free Grammars](theory/cfg.md) for how token streams are parsed.
- Next: [Context-Free Grammars](theory/cfg.md) for how token streams are parsed.
+- Next: [Context-Free Grammars](cfg.md) for how token streams are parsed.
- Next: [Context-Free Grammars](theory/cfg.md) for how token streams are parsed.
+- Next: [Context-Free Grammars](cfg.md) for how token streams are parsed.
diff --git a/docs/_docs/theory/pipeline.md b/docs/_docs/theory/pipeline.md
@@ -0,0 +1,88 @@
+# The Compilation Pipeline
+
+Source text is just a string. A compiler pipeline is a sequence of transformations that turns that string into something structured and meaningful. Each stage takes the output of the previous one, narrowing the representation from raw text to a typed result.
+
+Understanding the pipeline gives you a mental model that applies to every Alpaca program you write — not just calculator expressions, but any language you define with the library.
+
+## The Four Stages
+
+Most compilers share the same four-stage structure:
+
+1. **Source text** — the raw input string, e.g., `"3 + 4 * 2"`
+2. **Lexical analysis** — groups characters into tokens: `NUMBER(3.0)`, `PLUS`, `NUMBER(4.0)`, `TIMES`, `NUMBER(2.0)`
+3. **Syntactic analysis** — arranges tokens into a parse tree (concrete syntax tree) that encodes grammatical structure
+4. **Semantic analysis / evaluation** — extracts meaning from the tree, producing a typed result (in a calculator: `Double`)
+
+Some compilers add a fifth stage — code generation — that emits machine code or bytecode. Alpaca stops at stage 4: its pipeline produces a typed Scala value, not machine code.
+
+## Alpaca's Pipeline
+
+With Alpaca, running the full pipeline takes two calls:
+
+```scala sc:nocompile
+// Full pipeline: source text → typed result
+val (_, lexemes) = CalcLexer.tokenize("3 + 4 * 2")
+// lexemes: List[Lexeme] — NUMBER(3.0), PLUS, NUMBER(4.0), TIMES, NUMBER(2.0)
+
+val (_, result) = CalcParser.parse(lexemes)
+// result: Double | Null = 11.0
+```
+
+`CalcLexer.tokenize` handles stages 1–2: it takes the source string and produces a `List[Lexeme]`. `CalcParser.parse` handles stages 3–4: it takes those lexemes, builds the parse tree internally, and returns the typed result.
+
+Both `CalcLexer` and `CalcParser` are objects generated by Alpaca's macros. Their definitions live in separate files (see the cross-links at the bottom of this page).
+
+## Compile-time vs Runtime Boundary
+
+Alpaca draws a sharp line between what happens at compile time and what happens at runtime. This is the most important thing to understand about the library.
+
+> **Compile-time processing:** When you write a `lexer` definition, the Scala 3 macro validates your regex patterns, checks for shadowing, and generates the `Tokenization` object. When you write a `Parser` definition, the macro reads your grammar, builds the LR(1) parse table, and detects any shift/reduce conflicts — all at compile time. At runtime, `tokenize(input)` and `parse(lexemes)` execute the pre-generated code.
+
+In concrete terms:
+
+**Compile time:**
+- The `lexer` macro validates regex patterns, detects shadowing (where one pattern makes another unreachable), and emits a `Tokenization` object
+- The `Parser` macro reads every `Rule` declaration, constructs the LR(1) parse table, and reports any shift/reduce or reduce/reduce conflicts as compile errors
+
+**Runtime:**
+- `tokenize(input)` executes the pre-generated code and returns `List[Lexeme]`
+- `parse(lexemes)` executes the pre-built parse table and returns the typed result
+
+The consequence: if your regex is invalid, or your grammar is ambiguous, you get a compile error — not a runtime crash. The pipeline is safe by construction before it ever runs on real input.
+
+Alpaca covers stages 1–3 of the classical pipeline. The "code generation" stage is not part of the library — your Scala semantic actions in the parser rules produce the final typed value directly.
-Alpaca covers stages 1–3 of the classical pipeline. The "code generation" stage is not part of the library — your Scala semantic actions in the parser rules produce the final typed value directly.
+Alpaca covers stages 1–4 of the classical pipeline. The "code generation" stage is not part of the library — your Scala semantic actions in the parser rules produce the final typed value directly.
-Alpaca covers stages 1–3 of the classical pipeline. The "code generation" stage is not part of the library — your Scala semantic actions in the parser rules produce the final typed value directly.
+Alpaca covers stages 1–4 of the classical pipeline. The "code generation" stage is not part of the library — your Scala semantic actions in the parser rules produce the final typed value directly.
+
+## Formal Definition
+
+> **Definition — Compilation pipeline:**
+> A compiler pipeline is a composition of transformations f₁ ∘ f₂ ∘ ... ∘ fₙ where each fᵢ maps the output of fᵢ₋₁ to a more structured representation.
+> Alpaca's pipeline: `parse ∘ tokenize : String → R` where R is the root non-terminal's result type.
-> Alpaca's pipeline: `parse ∘ tokenize : String → R` where R is the root non-terminal's result type.
+> Alpaca's pipeline (projecting away the threaded context and the `Null` failure case) can be viewed as `parse ∘ tokenize : String → R`, where `R` is the root non-terminal's result type.
-> Alpaca's pipeline: `parse ∘ tokenize : String → R` where R is the root non-terminal's result type.
+> Alpaca's pipeline (projecting away the threaded context and the `Null` failure case) can be viewed as `parse ∘ tokenize : String → R`, where `R` is the root non-terminal's result type.
+
+For the calculator example, `R` is `Double`. For a JSON parser, `R` might be `Any` or a custom AST type. The pipeline shape is always the same; only the result type changes.
+
+The parser internally appends a special `Lexeme.EOF` marker to the lexeme list before running the shift/reduce loop. This is an implementation detail — you do not need to add it yourself.
+
+## Mapping the Stages to Alpaca Types
+
+Each pipeline stage corresponds to a concrete Alpaca type:
+
+| Stage | Input | Output | Alpaca Type |
+|-------|-------|--------|-------------|
+| Source text | — | `String` | `String` (plain Scala) |
+| Lexical analysis | `String` | token stream | `List[Lexeme]` |
+| Syntactic analysis | `List[Lexeme]` | parse tree (internal) | LR(1) stack (internal) |
+| Semantic analysis | parse tree | typed result | `R \| Null` (your root type) |
+
+The parse tree is never exposed directly — Alpaca builds it internally and immediately evaluates your semantic actions (the `=>` expressions in `rule` definitions). What you get back from `parse` is the final typed value, not an intermediate tree.
+
+## What Comes Next
+
+The rest of the Compiler Theory Tutorial builds on this mental model:
+
+- Next: [Tokens & Lexemes](theory/tokens.md) — what the lexer produces: token classes, token instances, and how they are represented in Alpaca
+- [The Lexer: Regex to Finite Automata](theory/lexer-fa.md) — how regular expressions define token classes and how Alpaca compiles them
- Next: [Tokens & Lexemes](theory/tokens.md) — what the lexer produces: token classes, token instances, and how they are represented in Alpaca
- [The Lexer: Regex to Finite Automata](theory/lexer-fa.md) — how regular expressions define token classes and how Alpaca compiles them
+- Next: [Tokens & Lexemes](tokens.html) — what the lexer produces: token classes, token instances, and how they are represented in Alpaca
+- [The Lexer: Regex to Finite Automata](lexer-fa.html) — how regular expressions define token classes and how Alpaca compiles them
- Next: [Tokens & Lexemes](theory/tokens.md) — what the lexer produces: token classes, token instances, and how they are represented in Alpaca
- [The Lexer: Regex to Finite Automata](theory/lexer-fa.md) — how regular expressions define token classes and how Alpaca compiles them
+- Next: [Tokens & Lexemes](tokens.html) — what the lexer produces: token classes, token instances, and how they are represented in Alpaca
+- [The Lexer: Regex to Finite Automata](lexer-fa.html) — how regular expressions define token classes and how Alpaca compiles them
+
+For the full API, see the reference pages:
+
+- See [Lexer](lexer.md) for how `CalcLexer` is defined.
+- See [Parser](parser.md) for how `CalcParser` is defined and how grammar rules produce a typed result.
- Next: [Tokens & Lexemes](theory/tokens.md) — what the lexer produces: token classes, token instances, and how they are represented in Alpaca
- [The Lexer: Regex to Finite Automata](theory/lexer-fa.md) — how regular expressions define token classes and how Alpaca compiles them
-
-For the full API, see the reference pages:
-
- See [Lexer](lexer.md) for how `CalcLexer` is defined.
- See [Parser](parser.md) for how `CalcParser` is defined and how grammar rules produce a typed result.
+- Next: [Tokens & Lexemes](../tokens.html) — what the lexer produces: token classes, token instances, and how they are represented in Alpaca
+- [The Lexer: Regex to Finite Automata](../lexer-fa.html) — how regular expressions define token classes and how Alpaca compiles them
+
+For the full API, see the reference pages:
+
+- See [Lexer](../lexer.html) for how `CalcLexer` is defined.
+- See [Parser](../parser.html) for how `CalcParser` is defined and how grammar rules produce a typed result.
- Next: [Tokens & Lexemes](theory/tokens.md) — what the lexer produces: token classes, token instances, and how they are represented in Alpaca
- [The Lexer: Regex to Finite Automata](theory/lexer-fa.md) — how regular expressions define token classes and how Alpaca compiles them
-
-For the full API, see the reference pages:
-
- See [Lexer](lexer.md) for how `CalcLexer` is defined.
- See [Parser](parser.md) for how `CalcParser` is defined and how grammar rules produce a typed result.
+- Next: [Tokens & Lexemes](../tokens.html) — what the lexer produces: token classes, token instances, and how they are represented in Alpaca
+- [The Lexer: Regex to Finite Automata](../lexer-fa.html) — how regular expressions define token classes and how Alpaca compiles them
+
+For the full API, see the reference pages:
+
+- See [Lexer](../lexer.html) for how `CalcLexer` is defined.
+- See [Parser](../parser.html) for how `CalcParser` is defined and how grammar rules produce a typed result.
diff --git a/docs/_docs/theory/tokens.md b/docs/_docs/theory/tokens.md
@@ -0,0 +1,123 @@
+# Tokens and Lexemes
+
+A lexer transforms raw source text into a sequence of structured tokens — the first stage of
+compilation. Before writing a lexer, it helps to understand the formal vocabulary: what a token
+class is, how individual matches relate to it, and what a lexeme carries.
+
+## Terminal Symbols
+
+In formal grammar, a *terminal symbol* is an atomic element that cannot be broken down further.
+It is the end of the line for derivation — terminals represent the actual characters or strings
+that appear in source text. In a lexer, each token class acts as a terminal: it names a category
+of strings, and no lexer-level expansion applies below it.
+
+See [Context-Free Grammars](theory/cfg.md) for how terminals fit into production rules.
-See [Context-Free Grammars](theory/cfg.md) for how terminals fit into production rules.
+See the discussion of context-free grammars for how terminals fit into production rules.
-See [Context-Free Grammars](theory/cfg.md) for how terminals fit into production rules.
+See the discussion of context-free grammars for how terminals fit into production rules.
+
+## Token Classes vs Token Instances
+
+It is useful to distinguish three levels:
+
+**Token class** — defines a category of strings by a regular expression. For example, the
+`NUMBER` class matches any string of the form `[0-9]+(\.[0-9]+)?`: the integers `"3"`, `"42"`,
+and the decimals `"3.14"`, `"0.5"`.
+
+**Token instance** — a specific string found in the input that belongs to a token class. When
+the lexer scans `"3 + 4"`, it finds three token instances: the string `"3"` (a NUMBER), the
+string `"+"` (a PLUS), and the string `"4"` (another NUMBER).
+
+**Lexeme** — the full record of a token instance: the token class, the matched text, and its
+position in the source. Parsing `"3 + 4"` produces three lexemes:
+
+- `NUMBER("3", pos=0)`
+- `PLUS("+", pos=2)`
+- `NUMBER("4", pos=4)`
+
+The word *lexeme* is used throughout this documentation to mean this complete record.
+
+> **Definition — Lexeme:**
+> A *lexeme* is a triple (T, w, pos) where T is a token class, w ∈ L(T) is the matched string
+> (a member of the language defined by T's regex), and pos is the position of the end of the
+> match in the source text.
+> In Alpaca: `Lexeme[Name, Value]` where `Name` is the token class name (a string literal type)
+> and `Value` is the Scala type of the extracted value.
+
+## Alpaca's Lexeme Type
+
+In Alpaca, each matched token is represented as a `Lexeme[Name, Value]`. A lexeme carries four
+pieces of information:
+
+- `name` — the token class name string, e.g., `"NUMBER"` or `"PLUS"`
+- `value` — the extracted value with its Scala type, e.g., `3.14: Double` for NUMBER, `(): Unit`
+  for PLUS
-In Alpaca, each matched token is represented as a `Lexeme[Name, Value]`. A lexeme carries four
-pieces of information:
-
- `name` — the token class name string, e.g., `"NUMBER"` or `"PLUS"`
- `value` — the extracted value with its Scala type, e.g., `3.14: Double` for NUMBER, `(): Unit`
-  for PLUS
+In Alpaca, each matched token is represented as a `Lexeme[Name, Value]`. A lexeme carries five
+core pieces of information:
+
+- `name` — the token class name string, e.g., `"NUMBER"` or `"PLUS"`
+- `value` — the extracted value with its Scala type, e.g., `3.14: Double` for NUMBER, `(): Unit`
+  for PLUS
+- `text` — the matched source substring (also available as `lexeme.text`)
-In Alpaca, each matched token is represented as a `Lexeme[Name, Value]`. A lexeme carries four
-pieces of information:
-
- `name` — the token class name string, e.g., `"NUMBER"` or `"PLUS"`
- `value` — the extracted value with its Scala type, e.g., `3.14: Double` for NUMBER, `(): Unit`
-  for PLUS
+In Alpaca, each matched token is represented as a `Lexeme[Name, Value]`. A lexeme carries five
+core pieces of information:
+
+- `name` — the token class name string, e.g., `"NUMBER"` or `"PLUS"`
+- `value` — the extracted value with its Scala type, e.g., `3.14: Double` for NUMBER, `(): Unit`
+  for PLUS
+- `text` — the matched source substring (also available as `lexeme.text`)
+- `position` — the character offset at the end of the match
+- `line` — the line number at the end of the match
+
+The tokenization output for a simple expression illustrates this:
+
+```scala sc:nocompile
+val (_, lexemes) = CalcLexer.tokenize("3 + 4 * 2")
+// lexemes: List[Lexeme] =
+//   NUMBER(3.0), PLUS, NUMBER(4.0), TIMES, NUMBER(2.0)
+//
+// Each Lexeme carries:
+//   .name     — token class name (e.g., "NUMBER")
+//   .value    — extracted value  (e.g., 3.0: Double)
+//   .position — character offset at end of match
+//   .line     — line number at end of match
+```
+
+Whitespace matches `Token.Ignored` and does not produce a lexeme — it disappears from the stream.
+
+## CalcLexer Token Class Table
+
+The `CalcLexer` running example defines seven token classes:
+
+| Token Class | Regex Pattern | Value Type | Example Match |
+|-------------|--------------|------------|---------------|
+| `NUMBER` | `[0-9]+(\.[0-9]+)?` | `Double` | `"3.14"` → `3.14` |
+| `PLUS` | `\+` | `Unit` | `"+"` |
+| `MINUS` | `-` | `Unit` | `"-"` |
+| `TIMES` | `\*` | `Unit` | `"*"` |
+| `DIVIDE` | `/` | `Unit` | `"/"` |
+| `LPAREN` | `\(` | `Unit` | `"("` |
+| `RPAREN` | `\)` | `Unit` | `")"` |
+
+Whitespace is ignored (`Token.Ignored`) and does not appear in the lexeme stream.
+
+`NUMBER` is the only value-bearing token: the macro uses the `@` binding to convert the matched
+string to a `Double`. The remaining six tokens carry `Unit` — their presence in the stream is
+enough; no value needs to be extracted.
+
+## Full CalcLexer Definition
+
+The canonical CalcLexer definition, which appears throughout the documentation as a running
+example:
+
+```scala sc:nocompile
+import alpaca.*
+
+val CalcLexer = lexer:
+  case num @ "[0-9]+(\\.[0-9]+)?" => Token["NUMBER"](num.toDouble)
+  case "\\+" => Token["PLUS"]
+  case "-" => Token["MINUS"]
+  case "\\*" => Token["TIMES"]
+  case "/" => Token["DIVIDE"]
+  case "\\(" => Token["LPAREN"]
+  case "\\)" => Token["RPAREN"]
+  case "\\s+" => Token.Ignored
+```
+
+Each `case` arm maps a Java regex pattern to a token constructor. Patterns are tested in order;
+the first match wins. The `num @` binding in the first arm captures the matched text as a
+`String`, which `num.toDouble` converts to a `Double` before it is stored in the lexeme.
+
+This definition uses `sc:nocompile` because `lexer` is a Scala 3 macro: the macro runs at
+compile time, validates all regex patterns, checks for shadowing, and generates the
+`Tokenization` object. See [The Lexer: Regex to Finite Automata](lexer-fa.md) for what the
+macro does internally.
+
+## Cross-links
+
+- See [Lexer](../lexer.md) for the full `lexer` DSL reference and all token forms.
- See [Lexer](../lexer.md) for the full `lexer` DSL reference and all token forms.
+- See [Lexer](../lexer.html) for the full `lexer` DSL reference and all token forms.
- See [Lexer](../lexer.md) for the full `lexer` DSL reference and all token forms.
+- See [Lexer](../lexer.html) for the full `lexer` DSL reference and all token forms.
+- See [The Lexer: Regex to Finite Automata](lexer-fa.md) for how regex patterns define token
-`Tokenization` object. See [The Lexer: Regex to Finite Automata](lexer-fa.md) for what the
-macro does internally.
-
-## Cross-links
-
- See [Lexer](../lexer.md) for the full `lexer` DSL reference and all token forms.
- See [The Lexer: Regex to Finite Automata](lexer-fa.md) for how regex patterns define token
+`Tokenization` object. See [The Lexer: Regex to Finite Automata](lexer-fa.html) for what the
+macro does internally.
+
+## Cross-links
+
+- See [Lexer](../lexer.md) for the full `lexer` DSL reference and all token forms.
+- See [The Lexer: Regex to Finite Automata](lexer-fa.html) for how regex patterns define token
-`Tokenization` object. See [The Lexer: Regex to Finite Automata](lexer-fa.md) for what the
-macro does internally.
-
-## Cross-links
-
- See [Lexer](../lexer.md) for the full `lexer` DSL reference and all token forms.
- See [The Lexer: Regex to Finite Automata](lexer-fa.md) for how regex patterns define token
+`Tokenization` object. See [The Lexer: Regex to Finite Automata](lexer-fa.html) for what the
+macro does internally.
+
+## Cross-links
+
+- See [Lexer](../lexer.md) for the full `lexer` DSL reference and all token forms.
+- See [The Lexer: Regex to Finite Automata](lexer-fa.html) for how regex patterns define token
+  classes formally.