diff --git a/docs/_docs/theory/lexer-fa.md b/docs/_docs/theory/lexer-fa.md new file mode 100644 index 00000000..8cf73e67 --- /dev/null +++ b/docs/_docs/theory/lexer-fa.md @@ -0,0 +1,112 @@ +# The Lexer: Regex to Finite Automata + +## What Does a Lexer Do? + +A lexer reads a character stream from left to right and emits a token stream. Each scan step +finds the longest prefix of the remaining input that matches one of the token class patterns — +this is the *maximal munch* rule. When no pattern matches the current position, the lexer throws +an error. The result is a flat list of lexemes that the parser consumes next. + +## Regular Languages + +> **Definition — Regular language:** +> A language L ⊆ Σ* is *regular* if it is recognized by a finite automaton (FA). Equivalently, +> L can be described by a regular expression over alphabet Σ. +> Each token class defines a regular language: `NUMBER` defines the set +> { "0", "1", ..., "3.14", "100", ... }. + +Regex notation is a concise way to specify regular languages. This is why regex is the right +tool for token class definitions — token classes have a "look ahead a bounded amount" structure +that regular languages capture exactly. More complex patterns such as balanced parentheses +require a more powerful formalism (context-free grammars, which the parser handles), but for +token recognition, regular expressions are both necessary and sufficient. + +## NFA and DFA: The Conceptual Picture + +Any regular expression can be translated into a finite automaton that accepts the same strings. +The standard construction proceeds in two steps. + +**Step 1 — NFA (nondeterministic finite automaton).** A regex is converted into an NFA via +Thompson's construction. An NFA can have multiple possible transitions from a state on the same +input, or transitions on the empty string. For simple patterns this is easy to visualize. The +`PLUS` token pattern `\+` produces a two-state NFA: + +| State | Input `+` | Accept? | +|-------|-----------|---------| +| q₀ | q₁ | No | +| q₁ | — | Yes | + +The machine starts at q₀, consumes a `+`, and moves to q₁ — an accepting state. Any other +input from q₀ leads nowhere, meaning the string does not match. + +**Step 2 — DFA (deterministic finite automaton).** An NFA is then converted to a DFA. A DFA +has exactly one transition per (state, input-character) pair, with no ambiguity. This matters +for performance: a DFA can be executed in O(n) time by reading the input left to right, one +character at a time, following the single applicable transition at each step. A DFA is therefore +the right runtime data structure for a lexer — no backtracking, no branching. + +> **Definition — Deterministic Finite Automaton (DFA):** +> A DFA is a 5-tuple (Q, Σ, δ, q₀, F) where: +> - Q is a finite set of states +> - Σ is the input alphabet (here: Unicode characters) +> - δ : Q × Σ → Q is the transition function +> - q₀ ∈ Q is the start state +> - F ⊆ Q is the set of accepting states +> +> A DFA accepts a string w if δ*(q₀, w) ∈ F, where δ* is the iterated transition function. +> In Alpaca's combined lexer DFA, each accepting state also carries a *token label* indicating +> which token class was matched. + +## Combining Token Patterns into One Automaton + +To lex a language with multiple token classes, the standard approach builds one combined DFA. In +theory: construct an NFA for each token pattern, connect them all to a new start state with +epsilon transitions, then convert the combined NFA to a single DFA. + +Alpaca follows the same principle but implements it using Java's regex engine, which is itself +backed by NFA/DFA machinery: + +- All token patterns are combined into a single Java regex alternation at compile time: + +``` +// Conceptual: how Alpaca combines patterns internally +(?[0-9]+(\.[0-9]+)?)|(?\+)|(?-)|(?\*)|... +``` + +- `java.util.regex.Pattern.compile(...)` is called inside the `lexerImpl` macro at compile + time. An invalid regex pattern therefore causes a compile error, not a runtime crash. +- At runtime, `Tokenization.tokenize()` uses `matcher.lookingAt()` on the combined pattern at + the current input position. It then checks which named group matched using + `matcher.start(i)` to determine the token class. + +This means Alpaca's lexer runs with the same O(n) guarantee as a hand-built DFA: one pass +through the input, no backtracking. + +## Shadowing Detection + +A practical issue with ordered alternation is *shadowing*: pattern A shadows pattern B if every +string matched by B is also matched by A (that is, L(B) ⊆ L(A), meaning every string in B's +language is also in A's language), and A appears before B in the lexer definition. If this +occurs, B will never match — it is dead code. + +Alpaca's `RegexChecker` uses the `dregex` library (a Scala/JVM library for decidable regex +operations) to check at compile time whether any pattern's language is a subset of an earlier +pattern's language. If shadowing is detected, the macro throws a `ShadowException` with a +compile error pointing to the offending patterns. + +**Example:** If you wrote the integer pattern `"[0-9]+"` before the decimal pattern +`"[0-9]+(\\.[0-9]+)?"`, the integer pattern would shadow the decimal one — every decimal like +`"3.14"` is also matched by `"[0-9]+"` up to the decimal point, but more critically the integer +pattern can match the prefix `"3"` and would consume it first. The `dregex` check catches this +ordering mistake at compile time rather than silently producing wrong output at runtime. + +In `CalcLexer`, the decimal pattern `"[0-9]+(\\.[0-9]+)?"` is listed first, before any simpler +integer-only pattern, so no shadowing occurs. + +> **Compile-time processing:** The `lexer` macro validates all regex patterns, combines them into a single alternation pattern, and checks for shadowing using `dregex` — all at compile time. If a regex is invalid or one pattern shadows another, you get a compile error. At runtime, the generated `Tokenization` object runs the pre-compiled combined regex against your input string. + +## Cross-links + +- See [Lexer](../lexer.md) for the complete `lexer` DSL reference. +- See [Tokens and Lexemes](tokens.md) for what the lexer produces — the lexeme stream. +- Next: [Context-Free Grammars](theory/cfg.md) for how token streams are parsed. diff --git a/docs/_docs/theory/pipeline.md b/docs/_docs/theory/pipeline.md new file mode 100644 index 00000000..ee051733 --- /dev/null +++ b/docs/_docs/theory/pipeline.md @@ -0,0 +1,88 @@ +# The Compilation Pipeline + +Source text is just a string. A compiler pipeline is a sequence of transformations that turns that string into something structured and meaningful. Each stage takes the output of the previous one, narrowing the representation from raw text to a typed result. + +Understanding the pipeline gives you a mental model that applies to every Alpaca program you write — not just calculator expressions, but any language you define with the library. + +## The Four Stages + +Most compilers share the same four-stage structure: + +1. **Source text** — the raw input string, e.g., `"3 + 4 * 2"` +2. **Lexical analysis** — groups characters into tokens: `NUMBER(3.0)`, `PLUS`, `NUMBER(4.0)`, `TIMES`, `NUMBER(2.0)` +3. **Syntactic analysis** — arranges tokens into a parse tree (concrete syntax tree) that encodes grammatical structure +4. **Semantic analysis / evaluation** — extracts meaning from the tree, producing a typed result (in a calculator: `Double`) + +Some compilers add a fifth stage — code generation — that emits machine code or bytecode. Alpaca stops at stage 4: its pipeline produces a typed Scala value, not machine code. + +## Alpaca's Pipeline + +With Alpaca, running the full pipeline takes two calls: + +```scala sc:nocompile +// Full pipeline: source text → typed result +val (_, lexemes) = CalcLexer.tokenize("3 + 4 * 2") +// lexemes: List[Lexeme] — NUMBER(3.0), PLUS, NUMBER(4.0), TIMES, NUMBER(2.0) + +val (_, result) = CalcParser.parse(lexemes) +// result: Double | Null = 11.0 +``` + +`CalcLexer.tokenize` handles stages 1–2: it takes the source string and produces a `List[Lexeme]`. `CalcParser.parse` handles stages 3–4: it takes those lexemes, builds the parse tree internally, and returns the typed result. + +Both `CalcLexer` and `CalcParser` are objects generated by Alpaca's macros. Their definitions live in separate files (see the cross-links at the bottom of this page). + +## Compile-time vs Runtime Boundary + +Alpaca draws a sharp line between what happens at compile time and what happens at runtime. This is the most important thing to understand about the library. + +> **Compile-time processing:** When you write a `lexer` definition, the Scala 3 macro validates your regex patterns, checks for shadowing, and generates the `Tokenization` object. When you write a `Parser` definition, the macro reads your grammar, builds the LR(1) parse table, and detects any shift/reduce conflicts — all at compile time. At runtime, `tokenize(input)` and `parse(lexemes)` execute the pre-generated code. + +In concrete terms: + +**Compile time:** +- The `lexer` macro validates regex patterns, detects shadowing (where one pattern makes another unreachable), and emits a `Tokenization` object +- The `Parser` macro reads every `Rule` declaration, constructs the LR(1) parse table, and reports any shift/reduce or reduce/reduce conflicts as compile errors + +**Runtime:** +- `tokenize(input)` executes the pre-generated code and returns `List[Lexeme]` +- `parse(lexemes)` executes the pre-built parse table and returns the typed result + +The consequence: if your regex is invalid, or your grammar is ambiguous, you get a compile error — not a runtime crash. The pipeline is safe by construction before it ever runs on real input. + +Alpaca covers stages 1–3 of the classical pipeline. The "code generation" stage is not part of the library — your Scala semantic actions in the parser rules produce the final typed value directly. + +## Formal Definition + +> **Definition — Compilation pipeline:** +> A compiler pipeline is a composition of transformations f₁ ∘ f₂ ∘ ... ∘ fₙ where each fᵢ maps the output of fᵢ₋₁ to a more structured representation. +> Alpaca's pipeline: `parse ∘ tokenize : String → R` where R is the root non-terminal's result type. + +For the calculator example, `R` is `Double`. For a JSON parser, `R` might be `Any` or a custom AST type. The pipeline shape is always the same; only the result type changes. + +The parser internally appends a special `Lexeme.EOF` marker to the lexeme list before running the shift/reduce loop. This is an implementation detail — you do not need to add it yourself. + +## Mapping the Stages to Alpaca Types + +Each pipeline stage corresponds to a concrete Alpaca type: + +| Stage | Input | Output | Alpaca Type | +|-------|-------|--------|-------------| +| Source text | — | `String` | `String` (plain Scala) | +| Lexical analysis | `String` | token stream | `List[Lexeme]` | +| Syntactic analysis | `List[Lexeme]` | parse tree (internal) | LR(1) stack (internal) | +| Semantic analysis | parse tree | typed result | `R \| Null` (your root type) | + +The parse tree is never exposed directly — Alpaca builds it internally and immediately evaluates your semantic actions (the `=>` expressions in `rule` definitions). What you get back from `parse` is the final typed value, not an intermediate tree. + +## What Comes Next + +The rest of the Compiler Theory Tutorial builds on this mental model: + +- Next: [Tokens & Lexemes](theory/tokens.md) — what the lexer produces: token classes, token instances, and how they are represented in Alpaca +- [The Lexer: Regex to Finite Automata](theory/lexer-fa.md) — how regular expressions define token classes and how Alpaca compiles them + +For the full API, see the reference pages: + +- See [Lexer](lexer.md) for how `CalcLexer` is defined. +- See [Parser](parser.md) for how `CalcParser` is defined and how grammar rules produce a typed result. diff --git a/docs/_docs/theory/tokens.md b/docs/_docs/theory/tokens.md new file mode 100644 index 00000000..fe725007 --- /dev/null +++ b/docs/_docs/theory/tokens.md @@ -0,0 +1,123 @@ +# Tokens and Lexemes + +A lexer transforms raw source text into a sequence of structured tokens — the first stage of +compilation. Before writing a lexer, it helps to understand the formal vocabulary: what a token +class is, how individual matches relate to it, and what a lexeme carries. + +## Terminal Symbols + +In formal grammar, a *terminal symbol* is an atomic element that cannot be broken down further. +It is the end of the line for derivation — terminals represent the actual characters or strings +that appear in source text. In a lexer, each token class acts as a terminal: it names a category +of strings, and no lexer-level expansion applies below it. + +See [Context-Free Grammars](theory/cfg.md) for how terminals fit into production rules. + +## Token Classes vs Token Instances + +It is useful to distinguish three levels: + +**Token class** — defines a category of strings by a regular expression. For example, the +`NUMBER` class matches any string of the form `[0-9]+(\.[0-9]+)?`: the integers `"3"`, `"42"`, +and the decimals `"3.14"`, `"0.5"`. + +**Token instance** — a specific string found in the input that belongs to a token class. When +the lexer scans `"3 + 4"`, it finds three token instances: the string `"3"` (a NUMBER), the +string `"+"` (a PLUS), and the string `"4"` (another NUMBER). + +**Lexeme** — the full record of a token instance: the token class, the matched text, and its +position in the source. Parsing `"3 + 4"` produces three lexemes: + +- `NUMBER("3", pos=0)` +- `PLUS("+", pos=2)` +- `NUMBER("4", pos=4)` + +The word *lexeme* is used throughout this documentation to mean this complete record. + +> **Definition — Lexeme:** +> A *lexeme* is a triple (T, w, pos) where T is a token class, w ∈ L(T) is the matched string +> (a member of the language defined by T's regex), and pos is the position of the end of the +> match in the source text. +> In Alpaca: `Lexeme[Name, Value]` where `Name` is the token class name (a string literal type) +> and `Value` is the Scala type of the extracted value. + +## Alpaca's Lexeme Type + +In Alpaca, each matched token is represented as a `Lexeme[Name, Value]`. A lexeme carries four +pieces of information: + +- `name` — the token class name string, e.g., `"NUMBER"` or `"PLUS"` +- `value` — the extracted value with its Scala type, e.g., `3.14: Double` for NUMBER, `(): Unit` + for PLUS +- `position` — the character offset at the end of the match +- `line` — the line number at the end of the match + +The tokenization output for a simple expression illustrates this: + +```scala sc:nocompile +val (_, lexemes) = CalcLexer.tokenize("3 + 4 * 2") +// lexemes: List[Lexeme] = +// NUMBER(3.0), PLUS, NUMBER(4.0), TIMES, NUMBER(2.0) +// +// Each Lexeme carries: +// .name — token class name (e.g., "NUMBER") +// .value — extracted value (e.g., 3.0: Double) +// .position — character offset at end of match +// .line — line number at end of match +``` + +Whitespace matches `Token.Ignored` and does not produce a lexeme — it disappears from the stream. + +## CalcLexer Token Class Table + +The `CalcLexer` running example defines seven token classes: + +| Token Class | Regex Pattern | Value Type | Example Match | +|-------------|--------------|------------|---------------| +| `NUMBER` | `[0-9]+(\.[0-9]+)?` | `Double` | `"3.14"` → `3.14` | +| `PLUS` | `\+` | `Unit` | `"+"` | +| `MINUS` | `-` | `Unit` | `"-"` | +| `TIMES` | `\*` | `Unit` | `"*"` | +| `DIVIDE` | `/` | `Unit` | `"/"` | +| `LPAREN` | `\(` | `Unit` | `"("` | +| `RPAREN` | `\)` | `Unit` | `")"` | + +Whitespace is ignored (`Token.Ignored`) and does not appear in the lexeme stream. + +`NUMBER` is the only value-bearing token: the macro uses the `@` binding to convert the matched +string to a `Double`. The remaining six tokens carry `Unit` — their presence in the stream is +enough; no value needs to be extracted. + +## Full CalcLexer Definition + +The canonical CalcLexer definition, which appears throughout the documentation as a running +example: + +```scala sc:nocompile +import alpaca.* + +val CalcLexer = lexer: + case num @ "[0-9]+(\\.[0-9]+)?" => Token["NUMBER"](num.toDouble) + case "\\+" => Token["PLUS"] + case "-" => Token["MINUS"] + case "\\*" => Token["TIMES"] + case "/" => Token["DIVIDE"] + case "\\(" => Token["LPAREN"] + case "\\)" => Token["RPAREN"] + case "\\s+" => Token.Ignored +``` + +Each `case` arm maps a Java regex pattern to a token constructor. Patterns are tested in order; +the first match wins. The `num @` binding in the first arm captures the matched text as a +`String`, which `num.toDouble` converts to a `Double` before it is stored in the lexeme. + +This definition uses `sc:nocompile` because `lexer` is a Scala 3 macro: the macro runs at +compile time, validates all regex patterns, checks for shadowing, and generates the +`Tokenization` object. See [The Lexer: Regex to Finite Automata](lexer-fa.md) for what the +macro does internally. + +## Cross-links + +- See [Lexer](../lexer.md) for the full `lexer` DSL reference and all token forms. +- See [The Lexer: Regex to Finite Automata](lexer-fa.md) for how regex patterns define token + classes formally.