-
Notifications
You must be signed in to change notification settings - Fork 1
Cookbook pages #286
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
halotukozak
wants to merge
4
commits into
tech-debt-cleanup
Choose a base branch
from
cookbook-pages
base: tech-debt-cleanup
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Cookbook pages #286
Changes from all commits
Commits
Show all changes
4 commits
Select commit
Hold shift + click to select a range
104f572
feat(14-02): write multi-pass cookbook page (CB-03)
halotukozak 6c0394a
feat(14-01): write expression-evaluator cookbook page (CB-01)
halotukozak 6c51ba8
feat(14-02): write whitespace-sensitive lexing cookbook page (CB-04)
halotukozak eb60c89
feat(14-01): write error-messages cookbook page (CB-02)
halotukozak File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,93 @@ | ||
| # Error Messages | ||
|
|
||
| Alpaca surfaces errors at three distinct points -- compile time, lex time, and parse time -- each with different behavior and handling strategies. | ||
|
|
||
| > **Compile-time processing:** The `lexer` block is a Scala 3 macro. `ShadowException`, invalid regex patterns, and unsupported guards are all detected at compile time and reported as compiler errors, not runtime exceptions. These errors cannot be caught with `try`/`catch` -- they prevent compilation entirely. | ||
|
|
||
| ## Compile-Time Errors | ||
|
|
||
| Compile-time errors are emitted by the Alpaca macro when it processes your `lexer` or `parser` definition. They appear as ordinary compiler errors in your IDE or build output. Because they occur at compile time, there is no way to handle them at runtime -- you must fix the definition and recompile. | ||
|
|
||
| ### ShadowException | ||
|
|
||
| A `ShadowException` occurs when an earlier pattern always matches everything a later pattern would match, making the later pattern unreachable. The macro performs pairwise regex inclusion checks and fails compilation if any pattern is shadowed. | ||
|
|
||
| ```scala sc:nocompile | ||
| import alpaca.* | ||
|
|
||
| // This does NOT compile -- ShadowException | ||
| val BadLexer = lexer: | ||
| case "[a-zA-Z_][a-zA-Z0-9_]*" => Token["IDENTIFIER"] // general pattern | ||
| case "[a-zA-Z]+" => Token["ALPHABETIC"] // ERROR: shadowed by IDENTIFIER | ||
|
|
||
| // Fix: more-specific patterns before more-general ones | ||
| val GoodLexer = lexer: | ||
| case "if" => Token["IF"] // keyword first | ||
| case "[a-zA-Z_][a-zA-Z0-9_]*" => Token["IDENTIFIER"] // general pattern last | ||
| case "\\s+" => Token.Ignored | ||
| ``` | ||
|
|
||
| The compile error reads: `Pattern [a-zA-Z]+ is shadowed by [a-zA-Z_][a-zA-Z0-9_]*`. The fix is always the same: move the more specific pattern before the more general one. | ||
|
|
||
| ### Guards Are Not Supported | ||
|
|
||
| Scala pattern guards (`case "regex" if condition =>`) are not supported in lexer rule definitions. Using one produces a compile-time error: | ||
|
|
||
| ```scala sc:nocompile | ||
| import alpaca.* | ||
|
|
||
| // WRONG -- compile error: "Guards are not supported yet" | ||
| case class MyCtx(var text: CharSequence = "", var flag: Boolean = false) extends LexerCtx | ||
| val GuardedLexer = lexer[MyCtx]: | ||
| case "token" if ctx.flag => Token["A"] | ||
|
|
||
| // Fix: move the condition inside the rule body | ||
| val CorrectLexer = lexer[MyCtx]: | ||
| case "token" => | ||
| if ctx.flag then Token["A"] else Token["B"] | ||
| ``` | ||
|
|
||
| ## Runtime Lexer Errors | ||
|
|
||
| If `tokenize()` encounters a character that does not match any pattern, it throws a `RuntimeException` immediately. There is no skip-and-continue behavior -- lexing stops at the first unrecognized character and the exception propagates to the caller. | ||
|
|
||
| ```scala sc:nocompile | ||
| import alpaca.* | ||
|
|
||
| val NumLexer = lexer: | ||
| case num @ "[0-9]+" => Token["NUM"](num.toInt) | ||
| case "\\s+" => Token.Ignored | ||
|
|
||
| try | ||
| val (_, lexemes) = NumLexer.tokenize("42 abc") | ||
| catch | ||
| case e: RuntimeException => | ||
| println(e.getMessage) // "Unexpected character: 'a'" | ||
| ``` | ||
|
|
||
| The exception message contains the unexpected character but not its position in the input. For position information, use a context that tracks position -- see [Lexer Context](../lexer-context.html). | ||
|
|
||
| There is no custom error handler API yet ([GitHub issue #21](https://github.com/bkozak-scancode/alpaca/issues/21) is open). | ||
|
|
||
| ## Parser Failure | ||
|
|
||
| `parse()` returns `T | Null`. A `null` result means the input token sequence did not match the grammar. This is not an exception -- it is a normal return value. Always check for `null` before using the result. | ||
|
|
||
| ```scala sc:nocompile | ||
| import alpaca.* | ||
|
|
||
| val (_, lexemes) = CalcLexer.tokenize("1 + + 2") | ||
| val (_, result) = CalcParser.parse(lexemes) | ||
| if result == null then | ||
| println("Parse failed: input did not match the grammar") | ||
| else | ||
| println(s"Result: $result") | ||
| ``` | ||
|
|
||
| There is no structured parser error reporting yet -- `null` is the only signal that parsing failed ([GitHub issue #51](https://github.com/bkozak-scancode/alpaca/issues/51), [#65](https://github.com/bkozak-scancode/alpaca/issues/65) are open). | ||
|
|
||
| ## See Also | ||
|
|
||
| - [Lexer Error Recovery](../lexer-error-recovery.html) -- full reference: `ShadowException`, runtime errors, pattern ordering | ||
| - [Lexer Context](../lexer-context.html) -- `PositionTracking` and `LineTracking` for position-aware error reporting | ||
| - [Parser](../parser.html) -- `parse()` return type, `T | Null` contract |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,84 @@ | ||
| # Expression Evaluator | ||
|
|
||
| Alpaca's `before`/`after` DSL resolves operator precedence conflicts in the LR parse table at compile time, letting you build a fully evaluated expression parser with correct precedence and associativity. | ||
|
|
||
| > **Compile-time processing:** When you declare `override val resolutions = Set(...)`, the Alpaca macro bakes your precedence rules directly into the LR(1) parse table during compilation. No precedence checks happen at runtime -- the parser executes deterministically from a pre-resolved table. | ||
|
|
||
| ## The Problem | ||
|
|
||
| Arithmetic grammars are ambiguous without explicit precedence declarations. The expression `1 + 2 * 3` can parse as `(1 + 2) * 3 = 9` or `1 + (2 * 3) = 11`, and the LR algorithm cannot choose between them on its own. Alpaca reports these as shift/reduce conflicts at compile time and gives you the `before`/`after` DSL to resolve them by declaring which productions take priority. | ||
|
|
||
| ## Define the Lexer | ||
|
|
||
| ```scala sc:nocompile | ||
| import alpaca.* | ||
|
|
||
| val CalcLexer = lexer: | ||
| case num @ "[0-9]+(\\.[0-9]+)?" => Token["NUMBER"](num.toDouble) | ||
| case "\\+" => Token["PLUS"] | ||
| case "-" => Token["MINUS"] | ||
| case "\\*" => Token["TIMES"] | ||
| case "/" => Token["DIVIDE"] | ||
| case "\\(" => Token["LPAREN"] | ||
| case "\\)" => Token["RPAREN"] | ||
| case "\\s+" => Token.Ignored | ||
| ``` | ||
|
|
||
| The regex `[0-9]+(\.[0-9]+)?` matches both integers and decimals. `num.toDouble` converts the matched string to a `Double`, so `Token["NUMBER"]` carries a `Double` value -- this is what makes `Rule[Double]` the right type for the parser. | ||
|
|
||
| ## Define the Parser | ||
|
|
||
| ```scala sc:nocompile | ||
| import alpaca.* | ||
|
|
||
| object CalcParser extends Parser: | ||
| val Expr: Rule[Double] = rule( | ||
| "plus" { case (Expr(a), CalcLexer.PLUS(_), Expr(b)) => a + b }, | ||
| "minus" { case (Expr(a), CalcLexer.MINUS(_), Expr(b)) => a - b }, | ||
| "times" { case (Expr(a), CalcLexer.TIMES(_), Expr(b)) => a * b }, | ||
| "div" { case (Expr(a), CalcLexer.DIVIDE(_), Expr(b)) => a / b }, | ||
| { case (CalcLexer.`\(`(_), Expr(e), CalcLexer.`\)`(_)) => e }, | ||
| { case CalcLexer.NUMBER(n) => n.value }, | ||
| ) | ||
| val root: Rule[Double] = rule: | ||
| case Expr(e) => e | ||
|
|
||
| override val resolutions = Set( | ||
| production.plus.before(CalcLexer.PLUS, CalcLexer.MINUS), | ||
| production.plus.after(CalcLexer.TIMES, CalcLexer.DIVIDE), | ||
| production.minus.before(CalcLexer.PLUS, CalcLexer.MINUS), | ||
| production.minus.after(CalcLexer.TIMES, CalcLexer.DIVIDE), | ||
| production.times.before(CalcLexer.TIMES, CalcLexer.DIVIDE, CalcLexer.PLUS, CalcLexer.MINUS), | ||
| production.div.before(CalcLexer.TIMES, CalcLexer.DIVIDE, CalcLexer.PLUS, CalcLexer.MINUS), | ||
| ) | ||
| ``` | ||
|
|
||
| Reading `production.plus.before(CalcLexer.PLUS, CalcLexer.MINUS)`: when the parser has reduced the `plus` production and the next token is `+` or `-`, prefer the reduction. This gives `+` left associativity and equal precedence with `-`. | ||
|
|
||
| Reading `production.plus.after(CalcLexer.TIMES, CalcLexer.DIVIDE)`: when the conflict is between reducing `plus` and shifting `*` or `/`, prefer shifting. This makes `*` and `/` bind tighter. | ||
|
|
||
| ## Run It | ||
|
|
||
| ```scala sc:nocompile | ||
| import alpaca.* | ||
|
|
||
| val (_, lexemes) = CalcLexer.tokenize("3 + 4 * 2") | ||
| val (_, result) = CalcParser.parse(lexemes) | ||
| // result: Double | Null -- 11.0 (not 14.0, because * binds tighter than +) | ||
| ``` | ||
|
|
||
| Always check for `null` before using the result -- `null` means the input did not match the grammar. | ||
|
|
||
| ## Key Points | ||
|
|
||
| - `Rule[Double]` because `NUMBER` yields `Double` (`num.toDouble` in the lexer). | ||
| - `n.value` extracts the `Double` from the matched lexeme -- `n` is a `Lexeme`, not a `Double` directly. | ||
| - `resolutions` must be the **last `val`** in the parser object -- the macro reads top-to-bottom and must have seen all rule declarations before processing `resolutions`. | ||
| - Use `before`/`after` (not `alwaysBefore`/`alwaysAfter` -- the compiler error message suggests those names but they do not exist in the API). | ||
| - `production` is a `@compileTimeOnly` compile-time construct: valid only inside the `resolutions` value. | ||
|
|
||
| ## See Also | ||
|
|
||
| - [Conflict Resolution](../conflict-resolution.html) -- `before`/`after` DSL reference, `Production(symbols*)` selector, token-side resolution | ||
| - [Parser](../parser.html) -- rule syntax, `root` requirement, `Rule[T]` types | ||
| - [Lexer](../lexer.html) -- token definition, `Token["NAME"](value)` constructor | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,74 @@ | ||
| # Multi-Pass Processing | ||
|
|
||
| Alpaca has no dedicated multi-pass API; multi-pass is a composition pattern -- tokenize the input with a first lexer, transform the resulting `List[Lexeme]` in plain Scala, then parse or re-lex as needed. | ||
|
|
||
| > **Compile-time processing:** Both the lexer and parser macros are compiled independently; the `List[Lexeme]` boundary between them is an ordinary runtime value you can inspect and transform with any Scala collection operations. | ||
|
|
||
| ## The Pattern | ||
|
|
||
| `tokenize()` returns a named tuple `(ctx, lexemes: List[Lexeme])`; `lexemes` is an ordinary `List` you can `filter`, `map`, or chain to a second stage. | ||
| `parse()` accepts any `List[Lexeme]` directly -- the type refinement is widened at the call site, so filtered or re-ordered lists are compatible without any casting. | ||
| Each `Lexeme` has a `name: String` field (the token name) and a `value: Any` field (the extracted value) that you can inspect during transformation. | ||
|
|
||
| Important constraint: the `Lexeme` constructor is private to the `alpaca` package. You cannot create new `Lexeme` instances. Multi-pass works by transforming the list of existing lexemes -- filter, reorder, or re-lex string values using a second lexer call. | ||
|
|
||
| ## Example: Comment Stripping | ||
|
|
||
| The most common multi-pass pattern: lex input that contains comments, strip the comment tokens from the list, then parse the clean token stream. | ||
|
|
||
| ```scala sc:nocompile | ||
| import alpaca.* | ||
|
|
||
| // Stage 1: lex with comments | ||
| val Stage1 = lexer: | ||
| case "#.*" => Token["COMMENT"] | ||
| case num @ "[0-9]+" => Token["NUM"](num.toInt) | ||
| case "\\+" => Token["PLUS"] | ||
| case "\\s+" => Token.Ignored | ||
|
|
||
| object SumParser extends Parser: | ||
| val Sum: Rule[Int] = rule( | ||
| { case (Sum(a), Stage1.PLUS(_), Sum(b)) => a + b }, | ||
| { case Stage1.NUM(n) => n.value.asInstanceOf[Int] }, | ||
| ) | ||
| val root: Rule[Int] = rule: | ||
| case Sum(s) => s | ||
|
|
||
| // Multi-pass: lex, filter, parse | ||
| val (_, stage1Lexemes) = Stage1.tokenize("1 + # ignore this\n2") | ||
| val filtered = stage1Lexemes.filter(_.name != "COMMENT") | ||
| val (_, result) = SumParser.parse(filtered) | ||
| // result: Int | Null -- 3 | ||
| ``` | ||
|
|
||
| ## Example: Re-Lexing Values | ||
|
|
||
| For more advanced cases, string values extracted from stage 1 tokens can be tokenized again by a second lexer and then `flatMap`-ed back into the stream: | ||
|
|
||
| ```scala sc:nocompile | ||
| import alpaca.* | ||
|
|
||
| // Advanced: re-lex string values extracted from stage 1 tokens | ||
| val (_, tokens) = IdentLexer.tokenize(source) | ||
| val expanded = tokens.flatMap: | ||
| case lex if lex.name == "MACRO" => | ||
| MacroLexer.tokenize(expandMacro(lex.value.asInstanceOf[String])).lexemes | ||
| case lex => List(lex) | ||
| val (_, result) = MainParser.parse(expanded) | ||
| ``` | ||
|
|
||
| `lex.value` is `Any`; cast to the expected type. `expandMacro` is application-defined. | ||
|
|
||
| ## Key Points | ||
|
|
||
| - `tokenize()` returns `(ctx, lexemes)`; use `.lexemes` or destructure to get the `List[Lexeme]` | ||
| - `Lexeme.name` is a `String` (the token name), `Lexeme.value` is `Any` (the extracted value) | ||
| - The `Lexeme` constructor is private -- you cannot construct new `Lexeme` instances; work with the existing list | ||
| - `parse()` accepts any `List[Lexeme[?, ?]]`; the type refinement is widened at the call site | ||
| - `Token.Ignored` tokens produce no lexemes and are never in the list | ||
|
|
||
| ## See Also | ||
|
|
||
| - [Between Stages](../between-stages.html) -- Lexeme structure, `tokenize()` return type, `BetweenStages` hook | ||
| - [Lexer](../lexer.html) -- `tokenize()` API, `Token["NAME"](value)` constructor | ||
| - [Parser](../parser.html) -- `parse()` API, `Rule[T]` types |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,80 @@ | ||
| # Whitespace-Sensitive Lexing | ||
|
|
||
| Use a custom `LexerCtx` to track indentation depth and emit `INDENT` or `DEDENT` tokens when the indentation level changes between lines. | ||
|
|
||
| > **Compile-time processing:** The `lexer[MyCtx]` macro inspects `MyCtx` at compile time; it auto-composes `BetweenStages` hooks from parent traits; the `ctx` value is available in rule bodies as a compile-time alias that is replaced by field accesses in the generated code. | ||
|
|
||
| ## The LexerCtx Contract | ||
|
|
||
| A valid custom context must satisfy three rules: | ||
|
|
||
| 1. It must be a **case class** -- `LexerCtx` has a `this: Product =>` self-type; the auto-derivation machinery requires a `Product` instance and regular classes do not satisfy it. | ||
| 2. It must include **`var text: CharSequence = ""`** -- `LexerCtx` declares this field as abstract; omitting it produces a compile error. | ||
| 3. **All fields must have default values** -- the `Empty[T]` derivation macro reads default parameter values from the companion object to construct the initial context. | ||
|
|
||
| Mutable state fields must be `var` so the lexer can assign to them directly. | ||
|
|
||
| ## Tracking Indentation | ||
|
|
||
| Define a context with `currentIndent` and `prevIndent` fields; when a newline followed by spaces is matched, count the spaces to determine the new indentation level and compare it against the previous level. | ||
| Guards are not supported in lexer rules (`case "regex" if condition =>` is a compile error); check the condition inside the rule body instead. | ||
| Emit `Token["INDENT"]` when indentation increases, `Token["DEDENT"]` when it decreases, and `Token.Ignored` when it stays the same. | ||
|
|
||
| ```scala sc:nocompile | ||
| import alpaca.* | ||
|
|
||
| case class IndentCtx( | ||
| var text: CharSequence = "", | ||
| var currentIndent: Int = 0, | ||
| var prevIndent: Int = 0, | ||
| ) extends LexerCtx | ||
|
|
||
| val IndentLexer = lexer[IndentCtx]: | ||
| case "\\n( *)" => | ||
| val newIndent = ctx.text.toString.count(_ == ' ') | ||
| val prev = ctx.prevIndent | ||
| ctx.prevIndent = newIndent | ||
| ctx.currentIndent = newIndent | ||
| // Guards are not supported -- check condition in body | ||
| if newIndent > prev then Token["INDENT"](newIndent) | ||
| else if newIndent < prev then Token["DEDENT"](newIndent) | ||
| else Token.Ignored | ||
| case word @ "[a-z_][a-z0-9_]*" => Token["WORD"](word) | ||
| case "\\s+" => Token.Ignored | ||
| ``` | ||
|
|
||
| The `\\n( *)` pattern matches a newline followed by zero or more spaces. | ||
| `ctx.text` contains the full match text at the time the rule body runs; counting spaces in it gives the new indentation level. | ||
| `Token["INDENT"](newIndent)` and `Token["DEDENT"](newIndent)` carry the new depth as their value, which the parser can read. | ||
| Because guards are not supported, the `if`/`else` is inside the rule body rather than after the pattern. | ||
|
|
||
| ## Reading INDENT and DEDENT in the Parser | ||
|
|
||
| The parser sees `INDENT` and `DEDENT` tokens in the lexeme list just like any other token. | ||
| Use `IndentLexer.INDENT(n)` to extract the new depth from the lexeme value -- `n` is a `Lexeme` and `n.value` is the `Int` depth passed to `Token["INDENT"](newIndent)`. | ||
|
|
||
| ```scala sc:nocompile | ||
| import alpaca.* | ||
|
|
||
| object IndentParser extends Parser: | ||
| val Block: Rule[List[String]] = rule( | ||
| { case (IndentLexer.INDENT(_), Block(inner), IndentLexer.DEDENT(_)) => inner }, | ||
| { case IndentLexer.WORD(w) => List(w.value.asInstanceOf[String]) }, | ||
| ) | ||
| val root: Rule[List[String]] = rule: | ||
| case Block(b) => b | ||
| ``` | ||
|
|
||
| ## Key Points | ||
|
|
||
| - `case class` with `var text: CharSequence = ""` is mandatory; other mutable fields must be `var` | ||
| - Guards (`case "regex" if condition =>`) are a compile error; move conditions into the rule body | ||
| - `ctx` is a compile-time construct available only inside lexer rule bodies | ||
| - `Token["INDENT"](value)` and `Token["DEDENT"](value)` are distinct named token types; the value carries the new depth for use in the parser | ||
| - `Token.Ignored` produces no lexeme; the newline pattern emits either `INDENT`, `DEDENT`, or nothing depending on the depth change | ||
|
|
||
| ## See Also | ||
|
|
||
| - [Lexer Context](../lexer-context.html) -- full `LexerCtx` reference: case class contract, `BetweenStages`, `PositionTracking`, `LineTracking` | ||
| - [Lexer Error Recovery](../lexer-error-recovery.html) -- guards limitation and workaround | ||
| - [Lexer](../lexer.html) -- `Token["NAME"](value)` constructor, `Token.Ignored` |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This production uses backticked token accessors like
CalcLexer.`\(``` /CalcLexer.\)``` but the lexer in this same example definesLPAREN/RPARENtokens (and, if you do intend to name tokens as"\("/"\)", the backslashes need to be doubled as shown inextractors.md). Consider changing this toCalcLexer.LPAREN/CalcLexer.RPAREN` (recommended) or updating the token naming/escaping consistently.