-
Notifications
You must be signed in to change notification settings - Fork 1
Theory application #288
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Theory application #288
Changes from all commits
9f23a3e
62153dc
433bcbf
1983cc3
32ba021
6ac2409
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,109 @@ | ||
| Context-free grammars are the backbone of syntactic analysis. A grammar defines a language by specifying how symbols can be combined and rewritten — "context-free" means each rule applies regardless of surrounding context. If the lexer is the vocabulary of a language, the grammar is its syntax. | ||
|
|
||
| ## What is a Context-Free Grammar? | ||
|
|
||
| A grammar consists of a set of non-terminal symbols (grammar variables that can be expanded), a set of terminal symbols (the tokens the lexer produces), a set of production rules (rewrite rules), and a start symbol. A derivation starts from the start symbol and repeatedly replaces non-terminals with production right-hand sides until only terminals remain. The language of a grammar G is the set of all terminal strings reachable from the start symbol. | ||
|
|
||
| > **Definition — Context-Free Grammar:** | ||
| > A CFG is a 4-tuple G = (V, Σ, R, S) where: | ||
| > - V is a finite set of non-terminal symbols (grammar variables) | ||
| > - Σ is a finite set of terminal symbols (tokens), V ∩ Σ = ∅ | ||
| > - R ⊆ V × (V ∪ Σ)* is a finite set of production rules | ||
| > - S ∈ V is the start symbol | ||
| > | ||
| > A production rule A → α means the non-terminal A can be replaced by the symbol string α. | ||
| > A grammar generates the language L(G) = { w ∈ Σ* | S ⇒* w } — all terminal strings | ||
| > derivable from S in zero or more steps. | ||
|
|
||
| ## BNF Notation | ||
|
|
||
| Production rules are written in Backus-Naur Form (BNF): `A → α` means A can be rewritten as α. The vertical bar `|` separates alternatives, so `A → α | β` is shorthand for two rules. Non-terminals are written in CamelCase; terminals are UPPERCASE (matching Alpaca's token name conventions). | ||
|
|
||
| EBNF (Extended BNF) adds optional elements `[...]`, repetition `{...}`, and grouping `(...)`. These shorthands can always be translated into plain BNF, but are useful for compact notation. This page uses BNF throughout for clarity; Alpaca's DSL maps directly to BNF productions. | ||
|
|
||
| ## The Calculator Grammar | ||
|
|
||
| The calculator grammar is the running example for the entire Compiler Theory Tutorial. It defines arithmetic expressions with four operators and parentheses: | ||
|
|
||
| ``` | ||
| Expr → Expr PLUS Expr | ||
| | Expr MINUS Expr | ||
| | Expr TIMES Expr | ||
| | Expr DIVIDE Expr | ||
| | LPAREN Expr RPAREN | ||
| | NUMBER | ||
|
|
||
| root → Expr | ||
| ``` | ||
|
|
||
| Identifying the 4-tuple components: | ||
|
|
||
| - V = {Expr, root} — two non-terminals | ||
| - Σ = {NUMBER, PLUS, MINUS, TIMES, DIVIDE, LPAREN, RPAREN} — seven terminal symbols, produced by CalcLexer | ||
| - R = the 7 production rules above | ||
| - S = root — the start symbol | ||
|
|
||
| Note: this grammar is **ambiguous** — the expression `1 + 2 * 3` can be parsed in two ways depending on which `Expr` is expanded first. We will see how Alpaca resolves ambiguities on the [Conflict Resolution](../conflict-resolution.md) page. | ||
|
|
||
| ## Derivation | ||
|
|
||
| A *derivation* is a sequence of rewriting steps from the start symbol to a terminal string. Each step replaces the leftmost non-terminal with one of its production alternatives (leftmost derivation). | ||
|
|
||
| Leftmost derivation for `1 + 2`: | ||
|
|
||
| ``` | ||
| root ⇒ Expr | ||
| ⇒ Expr PLUS Expr (apply: Expr → Expr PLUS Expr) | ||
| ⇒ NUMBER PLUS Expr (apply: Expr → NUMBER, leftmost) | ||
| ⇒ NUMBER PLUS NUMBER (apply: Expr → NUMBER, leftmost) | ||
| ``` | ||
|
|
||
| The first step applies `root → Expr`; the second expands the leftmost `Expr` using the `Expr PLUS Expr` production; the third and fourth substitute the literal `NUMBER` terminal for each remaining `Expr`. | ||
|
|
||
| ## Parse Trees | ||
|
|
||
| A parse tree captures the grammatical structure of a derivation as a tree. Each internal node is a non-terminal; each leaf is a terminal. The parse tree for `1 + 2`: | ||
|
|
||
| ``` | ||
| root | ||
| | | ||
| Expr | ||
| / | \ | ||
| Expr PLUS Expr | ||
| | | | ||
| NUMBER NUMBER | ||
| (1.0) (2.0) | ||
| ``` | ||
|
|
||
| Note: In Alpaca, the parse tree is never exposed to user code. The `Parser` macro builds it internally during the shift-reduce parse, and immediately evaluates your semantic actions (the `=>` expressions in `rule` definitions) as each node is reduced. What `parse()` returns is the typed result — a `Double` in the calculator case — not an intermediate tree object. (See [The Compilation Pipeline](pipeline.md) for the full picture.) | ||
|
|
||
| ## Alpaca DSL Mapping | ||
|
|
||
| The calculator grammar maps directly to an Alpaca `Parser` definition. Each production rule becomes a case clause in a `rule(...)` call; the right-hand side pattern matches the grammatical structure, and the right-hand side expression computes the result. | ||
|
|
||
| ```scala sc:nocompile | ||
| import alpaca.* | ||
|
|
||
| object CalcParser extends Parser: | ||
| val Expr: Rule[Double] = rule( | ||
| { case (Expr(a), CalcLexer.PLUS(_), Expr(b)) => a + b }, | ||
| { case (Expr(a), CalcLexer.MINUS(_), Expr(b)) => a - b }, | ||
| { case (Expr(a), CalcLexer.TIMES(_), Expr(b)) => a * b }, | ||
| { case (Expr(a), CalcLexer.DIVIDE(_), Expr(b)) => a / b }, | ||
| { case (CalcLexer.LPAREN(_), Expr(e), CalcLexer.RPAREN(_)) => e }, | ||
| { case CalcLexer.NUMBER(n) => n.value }, | ||
| ) | ||
| val root: Rule[Double] = rule: | ||
| case Expr(v) => v | ||
| ``` | ||
|
|
||
| Each `case` clause corresponds to one production rule. `Expr(a)` matches a reduced `Expr` non-terminal with value `a`. `CalcLexer.PLUS(_)` matches the PLUS terminal (the `_` discards the lexeme value since PLUS carries `Unit`). `CalcLexer.NUMBER(n)` matches a NUMBER terminal; `n.value` accesses the `Double` extracted by the lexer. The grammar's non-terminals (`Expr`, `root`) become `Rule[Double]` values; the type parameter is the result type of each reduction. | ||
|
|
||
| > **Compile-time processing:** When you define `object CalcParser extends Parser`, the Alpaca macro reads every `rule` declaration and constructs the LR(1) parse table at compile time. | ||
|
|
||
| ## Cross-links | ||
|
|
||
| - See [Tokens and Lexemes](tokens.md) for how the terminal symbols (NUMBER, PLUS, etc.) are produced by the lexer. | ||
| - Next: [Why LR?](why-lr.md) — why LR parsing was chosen over top-down alternatives. | ||
| - See [Parser](../parser.md) for the complete `rule` DSL reference and all extractor forms. | ||
| - See [Conflict Resolution](../conflict-resolution.md) for how Alpaca resolves ambiguity in the calculator grammar. | ||
| Original file line number | Diff line number | Diff line change | ||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| @@ -0,0 +1,112 @@ | ||||||||||||||||||||
| # The Lexer: Regex to Finite Automata | ||||||||||||||||||||
|
|
||||||||||||||||||||
| ## What Does a Lexer Do? | ||||||||||||||||||||
|
|
||||||||||||||||||||
| A lexer reads a character stream from left to right and emits a token stream. Each scan step | ||||||||||||||||||||
| finds the longest prefix of the remaining input that matches one of the token class patterns — | ||||||||||||||||||||
| this is the *maximal munch* rule. When no pattern matches the current position, the lexer throws | ||||||||||||||||||||
| an error. The result is a flat list of lexemes that the parser consumes next. | ||||||||||||||||||||
|
Comment on lines
+5
to
+8
|
||||||||||||||||||||
| A lexer reads a character stream from left to right and emits a token stream. Each scan step | |
| finds the longest prefix of the remaining input that matches one of the token class patterns — | |
| this is the *maximal munch* rule. When no pattern matches the current position, the lexer throws | |
| an error. The result is a flat list of lexemes that the parser consumes next. | |
| A lexer reads a character stream from left to right and emits a token stream. At each position, | |
| it tries the token class patterns in their specified order and picks the first one whose regex | |
| matches a prefix of the remaining input — patterns are tried in order; first match wins. When no | |
| pattern matches the current position, the lexer throws an error. The result is a flat list of | |
| lexemes that the parser consumes next. |
Copilot
AI
Mar 3, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The claim that using Java regex implies an O(n) lexer with "no backtracking" is not accurate. Tokenization.tokenize() uses java.util.regex.Pattern/Matcher.lookingAt(), and Java's regex engine can backtrack with super-linear (even exponential) worst-cases depending on patterns; additionally, ctx.text = ctx.text.from(m.end) will create new subsequences for String inputs. Please soften this to an implementation description (combined ordered regex) without promising DFA-like complexity guarantees unless you can justify them for all supported patterns.
| This means Alpaca's lexer runs with the same O(n) guarantee as a hand-built DFA: one pass | |
| through the input, no backtracking. | |
| In practice, this means Alpaca's lexer uses a single pre-compiled combined regex and scans | |
| through the input from left to right, matching at each position with `lookingAt()` and using | |
| the named capturing groups to determine the token class; the exact performance and any | |
| backtracking behavior are determined by the Java regex engine and the specific token patterns. |
Copilot
AI
Mar 3, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The cross-link theory/cfg.md is incorrect from within docs/_docs/theory/lexer-fa.md (it resolves to .../theory/theory/cfg.md). It should link to cfg.md (same directory).
| - Next: [Context-Free Grammars](theory/cfg.md) for how token streams are parsed. | |
| - Next: [Context-Free Grammars](cfg.md) for how token streams are parsed. |
| Original file line number | Diff line number | Diff line change | ||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| @@ -0,0 +1,88 @@ | ||||||||||||||||||
| # The Compilation Pipeline | ||||||||||||||||||
|
|
||||||||||||||||||
| Source text is just a string. A compiler pipeline is a sequence of transformations that turns that string into something structured and meaningful. Each stage takes the output of the previous one, narrowing the representation from raw text to a typed result. | ||||||||||||||||||
|
|
||||||||||||||||||
| Understanding the pipeline gives you a mental model that applies to every Alpaca program you write — not just calculator expressions, but any language you define with the library. | ||||||||||||||||||
|
|
||||||||||||||||||
| ## The Four Stages | ||||||||||||||||||
|
|
||||||||||||||||||
| Most compilers share the same four-stage structure: | ||||||||||||||||||
|
|
||||||||||||||||||
| 1. **Source text** — the raw input string, e.g., `"3 + 4 * 2"` | ||||||||||||||||||
| 2. **Lexical analysis** — groups characters into tokens: `NUMBER(3.0)`, `PLUS`, `NUMBER(4.0)`, `TIMES`, `NUMBER(2.0)` | ||||||||||||||||||
| 3. **Syntactic analysis** — arranges tokens into a parse tree (concrete syntax tree) that encodes grammatical structure | ||||||||||||||||||
| 4. **Semantic analysis / evaluation** — extracts meaning from the tree, producing a typed result (in a calculator: `Double`) | ||||||||||||||||||
|
|
||||||||||||||||||
| Some compilers add a fifth stage — code generation — that emits machine code or bytecode. Alpaca stops at stage 4: its pipeline produces a typed Scala value, not machine code. | ||||||||||||||||||
|
|
||||||||||||||||||
| ## Alpaca's Pipeline | ||||||||||||||||||
|
|
||||||||||||||||||
| With Alpaca, running the full pipeline takes two calls: | ||||||||||||||||||
|
|
||||||||||||||||||
| ```scala sc:nocompile | ||||||||||||||||||
| // Full pipeline: source text → typed result | ||||||||||||||||||
| val (_, lexemes) = CalcLexer.tokenize("3 + 4 * 2") | ||||||||||||||||||
| // lexemes: List[Lexeme] — NUMBER(3.0), PLUS, NUMBER(4.0), TIMES, NUMBER(2.0) | ||||||||||||||||||
|
|
||||||||||||||||||
| val (_, result) = CalcParser.parse(lexemes) | ||||||||||||||||||
| // result: Double | Null = 11.0 | ||||||||||||||||||
| ``` | ||||||||||||||||||
|
|
||||||||||||||||||
| `CalcLexer.tokenize` handles stages 1–2: it takes the source string and produces a `List[Lexeme]`. `CalcParser.parse` handles stages 3–4: it takes those lexemes, builds the parse tree internally, and returns the typed result. | ||||||||||||||||||
|
||||||||||||||||||
| `CalcLexer.tokenize` handles stages 1–2: it takes the source string and produces a `List[Lexeme]`. `CalcParser.parse` handles stages 3–4: it takes those lexemes, builds the parse tree internally, and returns the typed result. | |
| `CalcLexer.tokenize` handles stages 1–2: it takes the source string and produces a `List[Lexeme]`. `CalcParser.parse` handles stages 3–4: it consumes those lexemes using the generated LR(1) parse table and your semantic actions to compute the typed result, without constructing an explicit parse tree data structure. |
Copilot
AI
Mar 3, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The stage mapping and following paragraph continue to describe the syntactic stage as producing a "parse tree (internal)". This conflicts with the actual parser runtime, which only maintains an LR stack and computes results on reductions without constructing a tree object. Consider describing this as a conceptual parse structure or as "LR(1) stack + reductions" rather than an internal parse tree.
| | Syntactic analysis | `List[Lexeme]` | parse tree (internal) | LR(1) stack (internal) | | |
| | Semantic analysis | parse tree | typed result | `R \| Null` (your root type) | | |
| The parse tree is never exposed directly — Alpaca builds it internally and immediately evaluates your semantic actions (the `=>` expressions in `rule` definitions). What you get back from `parse` is the final typed value, not an intermediate tree. | |
| | Syntactic analysis | `List[Lexeme]` | conceptual parse structure (via LR(1) stack + reductions) | LR(1) stack + reductions (internal) | | |
| | Semantic analysis | conceptual parse structure | typed result | `R \| Null` (your root type) | | |
| Alpaca never constructs or returns an explicit parse tree object. Instead, it uses an LR(1) stack and applies your semantic actions (the `=>` expressions in `rule` definitions) on each reduction, so what you get back from `parse` is the final typed value, not an intermediate tree. |
Copilot
AI
Mar 3, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These cross-links use theory/... paths even though this file already lives in docs/_docs/theory/. As written, they resolve to .../theory/theory/... and break. They should link to tokens.md and lexer-fa.md (same directory).
| - Next: [Tokens & Lexemes](theory/tokens.md) — what the lexer produces: token classes, token instances, and how they are represented in Alpaca | |
| - [The Lexer: Regex to Finite Automata](theory/lexer-fa.md) — how regular expressions define token classes and how Alpaca compiles them | |
| - Next: [Tokens & Lexemes](tokens.md) — what the lexer produces: token classes, token instances, and how they are represented in Alpaca | |
| - [The Lexer: Regex to Finite Automata](lexer-fa.md) — how regular expressions define token classes and how Alpaca compiles them |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This note says the
Parsermacro "builds" a parse tree internally. The runtime implementation performs shift/reduce and immediately evaluates semantic actions; it does not construct or retain a parse tree object (seesrc/alpaca/internal/parser/Parser.scala). Please rephrase to avoid contradicting the shift/reduce page (which states no parse tree is constructed).