-
Notifications
You must be signed in to change notification settings - Fork 1
Complete v1.2: Cookbook pages and tech debt cleanup #266
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: grammar-theory
Are you sure you want to change the base?
Changes from all commits
a869eea
a0c1845
aa2ef2a
fd11eeb
a472040
3dbf94a
838b08e
dfed268
49c67e6
67c4173
ecc8c59
104f572
6c0394a
6c51ba8
eb60c89
c706729
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,93 @@ | ||
| # Error Messages | ||
|
|
||
| Alpaca surfaces errors at three distinct points -- compile time, lex time, and parse time -- each with different behavior and handling strategies. | ||
|
|
||
| > **Compile-time processing:** The `lexer` block is a Scala 3 macro. `ShadowException`, invalid regex patterns, and unsupported guards are all detected at compile time and reported as compiler errors, not runtime exceptions. These errors cannot be caught with `try`/`catch` -- they prevent compilation entirely. | ||
|
|
||
| ## Compile-Time Errors | ||
|
|
||
| Compile-time errors are emitted by the Alpaca macro when it processes your `lexer` or `parser` definition. They appear as ordinary compiler errors in your IDE or build output. Because they occur at compile time, there is no way to handle them at runtime -- you must fix the definition and recompile. | ||
|
|
||
| ### ShadowException | ||
|
|
||
| A `ShadowException` occurs when an earlier pattern always matches everything a later pattern would match, making the later pattern unreachable. The macro performs pairwise regex inclusion checks and fails compilation if any pattern is shadowed. | ||
|
|
||
| ```scala sc:nocompile | ||
| import alpaca.* | ||
|
|
||
| // This does NOT compile -- ShadowException | ||
| val BadLexer = lexer: | ||
| case "[a-zA-Z_][a-zA-Z0-9_]*" => Token["IDENTIFIER"] // general pattern | ||
| case "[a-zA-Z]+" => Token["ALPHABETIC"] // ERROR: shadowed by IDENTIFIER | ||
|
|
||
| // Fix: more-specific patterns before more-general ones | ||
| val GoodLexer = lexer: | ||
| case "if" => Token["IF"] // keyword first | ||
| case "[a-zA-Z_][a-zA-Z0-9_]*" => Token["IDENTIFIER"] // general pattern last | ||
| case "\\s+" => Token.Ignored | ||
| ``` | ||
|
|
||
| The compile error reads: `Pattern [a-zA-Z]+ is shadowed by [a-zA-Z_][a-zA-Z0-9_]*`. The fix is always the same: move the more specific pattern before the more general one. | ||
|
|
||
| ### Guards Are Not Supported | ||
|
|
||
| Scala pattern guards (`case "regex" if condition =>`) are not supported in lexer rule definitions. Using one produces a compile-time error: | ||
|
|
||
| ```scala sc:nocompile | ||
| import alpaca.* | ||
|
|
||
| // WRONG -- compile error: "Guards are not supported yet" | ||
| case class MyCtx(var text: CharSequence = "", var flag: Boolean = false) extends LexerCtx | ||
| val GuardedLexer = lexer[MyCtx]: | ||
| case "token" if ctx.flag => Token["A"] | ||
|
|
||
| // Fix: move the condition inside the rule body | ||
| val CorrectLexer = lexer[MyCtx]: | ||
| case "token" => | ||
| if ctx.flag then Token["A"] else Token["B"] | ||
| ``` | ||
|
|
||
| ## Runtime Lexer Errors | ||
|
|
||
| If `tokenize()` encounters a character that does not match any pattern, it throws a `RuntimeException` immediately. There is no skip-and-continue behavior -- lexing stops at the first unrecognized character and the exception propagates to the caller. | ||
|
|
||
| ```scala sc:nocompile | ||
| import alpaca.* | ||
|
|
||
| val NumLexer = lexer: | ||
| case num @ "[0-9]+" => Token["NUM"](num.toInt) | ||
| case "\\s+" => Token.Ignored | ||
|
|
||
| try | ||
| val (_, lexemes) = NumLexer.tokenize("42 abc") | ||
| catch | ||
| case e: RuntimeException => | ||
| println(e.getMessage) // "Unexpected character: 'a'" | ||
| ``` | ||
|
|
||
| The exception message contains the unexpected character but not its position in the input. For position information, use a context that tracks position -- see [Lexer Context](../lexer-context.html). | ||
|
|
||
| There is no custom error handler API yet ([GitHub issue #21](https://github.com/bkozak-scancode/alpaca/issues/21) is open). | ||
|
|
||
| ## Parser Failure | ||
|
|
||
| `parse()` returns `T | Null`. A `null` result means the input token sequence did not match the grammar. This is not an exception -- it is a normal return value. Always check for `null` before using the result. | ||
|
|
||
| ```scala sc:nocompile | ||
| import alpaca.* | ||
|
|
||
| val (_, lexemes) = CalcLexer.tokenize("1 + + 2") | ||
| val (_, result) = CalcParser.parse(lexemes) | ||
| if result == null then | ||
| println("Parse failed: input did not match the grammar") | ||
| else | ||
| println(s"Result: $result") | ||
| ``` | ||
|
|
||
| There is no structured parser error reporting yet -- `null` is the only signal that parsing failed ([GitHub issue #51](https://github.com/bkozak-scancode/alpaca/issues/51), [#65](https://github.com/bkozak-scancode/alpaca/issues/65) are open). | ||
|
|
||
| ## See Also | ||
|
|
||
| - [Lexer Error Recovery](../lexer-error-recovery.html) -- full reference: `ShadowException`, runtime errors, pattern ordering | ||
| - [Lexer Context](../lexer-context.html) -- `PositionTracking` and `LineTracking` for position-aware error reporting | ||
| - [Parser](../parser.html) -- `parse()` return type, `T | Null` contract |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,84 @@ | ||
| # Expression Evaluator | ||
|
|
||
| Alpaca's `before`/`after` DSL resolves operator precedence conflicts in the LR parse table at compile time, letting you build a fully evaluated expression parser with correct precedence and associativity. | ||
|
|
||
| > **Compile-time processing:** When you declare `override val resolutions = Set(...)`, the Alpaca macro bakes your precedence rules directly into the LR(1) parse table during compilation. No precedence checks happen at runtime -- the parser executes deterministically from a pre-resolved table. | ||
|
|
||
| ## The Problem | ||
|
|
||
| Arithmetic grammars are ambiguous without explicit precedence declarations. The expression `1 + 2 * 3` can parse as `(1 + 2) * 3 = 9` or `1 + (2 * 3) = 11`, and the LR algorithm cannot choose between them on its own. Alpaca reports these as shift/reduce conflicts at compile time and gives you the `before`/`after` DSL to resolve them by declaring which productions take priority. | ||
|
|
||
| ## Define the Lexer | ||
|
|
||
| ```scala sc:nocompile | ||
| import alpaca.* | ||
|
|
||
| val CalcLexer = lexer: | ||
| case num @ "[0-9]+(\\.[0-9]+)?" => Token["NUMBER"](num.toDouble) | ||
| case "\\+" => Token["PLUS"] | ||
| case "-" => Token["MINUS"] | ||
| case "\\*" => Token["TIMES"] | ||
| case "/" => Token["DIVIDE"] | ||
| case "\\(" => Token["LPAREN"] | ||
| case "\\)" => Token["RPAREN"] | ||
| case "\\s+" => Token.Ignored | ||
| ``` | ||
|
|
||
| The regex `[0-9]+(\.[0-9]+)?` matches both integers and decimals. `num.toDouble` converts the matched string to a `Double`, so `Token["NUMBER"]` carries a `Double` value -- this is what makes `Rule[Double]` the right type for the parser. | ||
|
|
||
| ## Define the Parser | ||
|
|
||
| ```scala sc:nocompile | ||
| import alpaca.* | ||
|
|
||
| object CalcParser extends Parser: | ||
| val Expr: Rule[Double] = rule( | ||
| "plus" { case (Expr(a), CalcLexer.PLUS(_), Expr(b)) => a + b }, | ||
| "minus" { case (Expr(a), CalcLexer.MINUS(_), Expr(b)) => a - b }, | ||
| "times" { case (Expr(a), CalcLexer.TIMES(_), Expr(b)) => a * b }, | ||
| "div" { case (Expr(a), CalcLexer.DIVIDE(_), Expr(b)) => a / b }, | ||
| { case (CalcLexer.`\(`(_), Expr(e), CalcLexer.`\)`(_)) => e }, | ||
| { case CalcLexer.NUMBER(n) => n.value }, | ||
| ) | ||
| val root: Rule[Double] = rule: | ||
| case Expr(e) => e | ||
|
|
||
| override val resolutions = Set( | ||
| production.plus.before(CalcLexer.PLUS, CalcLexer.MINUS), | ||
| production.plus.after(CalcLexer.TIMES, CalcLexer.DIVIDE), | ||
| production.minus.before(CalcLexer.PLUS, CalcLexer.MINUS), | ||
| production.minus.after(CalcLexer.TIMES, CalcLexer.DIVIDE), | ||
| production.times.before(CalcLexer.TIMES, CalcLexer.DIVIDE, CalcLexer.PLUS, CalcLexer.MINUS), | ||
| production.div.before(CalcLexer.TIMES, CalcLexer.DIVIDE, CalcLexer.PLUS, CalcLexer.MINUS), | ||
| ) | ||
| ``` | ||
|
|
||
| Reading `production.plus.before(CalcLexer.PLUS, CalcLexer.MINUS)`: when the parser has reduced the `plus` production and the next token is `+` or `-`, prefer the reduction. This gives `+` left associativity and equal precedence with `-`. | ||
|
|
||
| Reading `production.plus.after(CalcLexer.TIMES, CalcLexer.DIVIDE)`: when the conflict is between reducing `plus` and shifting `*` or `/`, prefer shifting. This makes `*` and `/` bind tighter. | ||
|
|
||
| ## Run It | ||
|
|
||
| ```scala sc:nocompile | ||
| import alpaca.* | ||
|
|
||
| val (_, lexemes) = CalcLexer.tokenize("3 + 4 * 2") | ||
| val (_, result) = CalcParser.parse(lexemes) | ||
| // result: Double | Null -- 11.0 (not 14.0, because * binds tighter than +) | ||
| ``` | ||
|
|
||
| Always check for `null` before using the result -- `null` means the input did not match the grammar. | ||
|
|
||
| ## Key Points | ||
|
|
||
| - `Rule[Double]` because `NUMBER` yields `Double` (`num.toDouble` in the lexer). | ||
| - `n.value` extracts the `Double` from the matched lexeme -- `n` is a `Lexeme`, not a `Double` directly. | ||
| - `resolutions` must be the **last `val`** in the parser object -- the macro reads top-to-bottom and must have seen all rule declarations before processing `resolutions`. | ||
| - Use `before`/`after` (not `alwaysBefore`/`alwaysAfter` -- the compiler error message suggests those names but they do not exist in the API). | ||
| - `production` is a `@compileTimeOnly` compile-time construct: valid only inside the `resolutions` value. | ||
|
|
||
| ## See Also | ||
|
|
||
| - [Conflict Resolution](../conflict-resolution.html) -- `before`/`after` DSL reference, `Production(symbols*)` selector, token-side resolution | ||
| - [Parser](../parser.html) -- rule syntax, `root` requirement, `Rule[T]` types | ||
| - [Lexer](../lexer.html) -- token definition, `Token["NAME"](value)` constructor | ||
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
| @@ -0,0 +1,74 @@ | ||||||
| # Multi-Pass Processing | ||||||
|
|
||||||
| Alpaca has no dedicated multi-pass API; multi-pass is a composition pattern -- tokenize the input with a first lexer, transform the resulting `List[Lexeme]` in plain Scala, then parse or re-lex as needed. | ||||||
|
|
||||||
| > **Compile-time processing:** Both the lexer and parser macros are compiled independently; the `List[Lexeme]` boundary between them is an ordinary runtime value you can inspect and transform with any Scala collection operations. | ||||||
|
|
||||||
| ## The Pattern | ||||||
|
|
||||||
| `tokenize()` returns a named tuple `(ctx, lexemes: List[Lexeme])`; `lexemes` is an ordinary `List` you can `filter`, `map`, or chain to a second stage. | ||||||
| `parse()` accepts any `List[Lexeme]` directly -- the type refinement is widened at the call site, so filtered or re-ordered lists are compatible without any casting. | ||||||
| Each `Lexeme` has a `name: String` field (the token name) and a `value: Any` field (the extracted value) that you can inspect during transformation. | ||||||
|
|
||||||
| Important constraint: the `Lexeme` constructor is private to the `alpaca` package. You cannot create new `Lexeme` instances. Multi-pass works by transforming the list of existing lexemes -- filter, reorder, or re-lex string values using a second lexer call. | ||||||
|
|
||||||
| ## Example: Comment Stripping | ||||||
|
|
||||||
| The most common multi-pass pattern: lex input that contains comments, strip the comment tokens from the list, then parse the clean token stream. | ||||||
|
|
||||||
| ```scala sc:nocompile | ||||||
| import alpaca.* | ||||||
|
|
||||||
| // Stage 1: lex with comments | ||||||
| val Stage1 = lexer: | ||||||
| case "#.*" => Token["COMMENT"] | ||||||
| case num @ "[0-9]+" => Token["NUM"](num.toInt) | ||||||
| case "\\+" => Token["PLUS"] | ||||||
| case "\\s+" => Token.Ignored | ||||||
|
|
||||||
| object SumParser extends Parser: | ||||||
| val Sum: Rule[Int] = rule( | ||||||
| { case (Sum(a), Stage1.PLUS(_), Sum(b)) => a + b }, | ||||||
| { case Stage1.NUM(n) => n.value.asInstanceOf[Int] }, | ||||||
|
||||||
| { case Stage1.NUM(n) => n.value.asInstanceOf[Int] }, | |
| { case Stage1.NUM(n) => n.value }, |
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
| @@ -0,0 +1,80 @@ | ||||||
| # Whitespace-Sensitive Lexing | ||||||
|
|
||||||
| Use a custom `LexerCtx` to track indentation depth and emit `INDENT` or `DEDENT` tokens when the indentation level changes between lines. | ||||||
|
|
||||||
| > **Compile-time processing:** The `lexer[MyCtx]` macro inspects `MyCtx` at compile time; it auto-composes `BetweenStages` hooks from parent traits; the `ctx` value is available in rule bodies as a compile-time alias that is replaced by field accesses in the generated code. | ||||||
|
|
||||||
| ## The LexerCtx Contract | ||||||
|
|
||||||
| A valid custom context must satisfy three rules: | ||||||
|
|
||||||
| 1. It must be a **case class** -- `LexerCtx` has a `this: Product =>` self-type; the auto-derivation machinery requires a `Product` instance and regular classes do not satisfy it. | ||||||
| 2. It must include **`var text: CharSequence = ""`** -- `LexerCtx` declares this field as abstract; omitting it produces a compile error. | ||||||
| 3. **All fields must have default values** -- the `Empty[T]` derivation macro reads default parameter values from the companion object to construct the initial context. | ||||||
|
|
||||||
| Mutable state fields must be `var` so the lexer can assign to them directly. | ||||||
|
|
||||||
| ## Tracking Indentation | ||||||
|
|
||||||
| Define a context with `currentIndent` and `prevIndent` fields; when a newline followed by spaces is matched, count the spaces to determine the new indentation level and compare it against the previous level. | ||||||
| Guards are not supported in lexer rules (`case "regex" if condition =>` is a compile error); check the condition inside the rule body instead. | ||||||
| Emit `Token["INDENT"]` when indentation increases, `Token["DEDENT"]` when it decreases, and `Token.Ignored` when it stays the same. | ||||||
|
|
||||||
| ```scala sc:nocompile | ||||||
| import alpaca.* | ||||||
|
|
||||||
| case class IndentCtx( | ||||||
| var text: CharSequence = "", | ||||||
| var currentIndent: Int = 0, | ||||||
| var prevIndent: Int = 0, | ||||||
| ) extends LexerCtx | ||||||
|
|
||||||
| val IndentLexer = lexer[IndentCtx]: | ||||||
| case "\\n( *)" => | ||||||
| val newIndent = ctx.text.toString.count(_ == ' ') | ||||||
| val prev = ctx.prevIndent | ||||||
|
Comment on lines
+33
to
+35
|
||||||
| ctx.prevIndent = newIndent | ||||||
| ctx.currentIndent = newIndent | ||||||
| // Guards are not supported -- check condition in body | ||||||
| if newIndent > prev then Token["INDENT"](newIndent) | ||||||
| else if newIndent < prev then Token["DEDENT"](newIndent) | ||||||
| else Token.Ignored | ||||||
| case word @ "[a-z_][a-z0-9_]*" => Token["WORD"](word) | ||||||
| case "\\s+" => Token.Ignored | ||||||
| ``` | ||||||
|
|
||||||
| The `\\n( *)` pattern matches a newline followed by zero or more spaces. | ||||||
| `ctx.text` contains the full match text at the time the rule body runs; counting spaces in it gives the new indentation level. | ||||||
| `Token["INDENT"](newIndent)` and `Token["DEDENT"](newIndent)` carry the new depth as their value, which the parser can read. | ||||||
|
Comment on lines
+46
to
+48
|
||||||
| Because guards are not supported, the `if`/`else` is inside the rule body rather than after the pattern. | ||||||
|
|
||||||
| ## Reading INDENT and DEDENT in the Parser | ||||||
|
|
||||||
| The parser sees `INDENT` and `DEDENT` tokens in the lexeme list just like any other token. | ||||||
| Use `IndentLexer.INDENT(n)` to extract the new depth from the lexeme value -- `n` is a `Lexeme` and `n.value` is the `Int` depth passed to `Token["INDENT"](newIndent)`. | ||||||
|
|
||||||
| ```scala sc:nocompile | ||||||
| import alpaca.* | ||||||
|
|
||||||
| object IndentParser extends Parser: | ||||||
| val Block: Rule[List[String]] = rule( | ||||||
| { case (IndentLexer.INDENT(_), Block(inner), IndentLexer.DEDENT(_)) => inner }, | ||||||
| { case IndentLexer.WORD(w) => List(w.value.asInstanceOf[String]) }, | ||||||
|
||||||
| { case IndentLexer.WORD(w) => List(w.value.asInstanceOf[String]) }, | |
| { case IndentLexer.WORD(w) => List(w.value) }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the parser example, the token accessors for parentheses are written as
CalcLexer.(/ `CalcLexer.`\)but here they appear asCalcLexer.(with only a single backslash ((/)). That token name won’t match the lexer definition above (`case "\\(" => Token["LPAREN"]`, etc.) and conflicts with the accessor form documented elsewhere (`CalcLexer.`\\(). Update the snippet to use the correct backticked accessor names for LPAREN/RPAREN (or useCalcLexer.LPAREN/CalcLexer.RPARENconsistently).