Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
93 changes: 93 additions & 0 deletions docs/_docs/cookbook/error-messages.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,93 @@
# Error Messages

Alpaca surfaces errors at three distinct points -- compile time, lex time, and parse time -- each with different behavior and handling strategies.

> **Compile-time processing:** The `lexer` block is a Scala 3 macro. `ShadowException`, invalid regex patterns, and unsupported guards are all detected at compile time and reported as compiler errors, not runtime exceptions. These errors cannot be caught with `try`/`catch` -- they prevent compilation entirely.

## Compile-Time Errors

Compile-time errors are emitted by the Alpaca macro when it processes your `lexer` or `parser` definition. They appear as ordinary compiler errors in your IDE or build output. Because they occur at compile time, there is no way to handle them at runtime -- you must fix the definition and recompile.

### ShadowException

A `ShadowException` occurs when an earlier pattern always matches everything a later pattern would match, making the later pattern unreachable. The macro performs pairwise regex inclusion checks and fails compilation if any pattern is shadowed.

```scala sc:nocompile
import alpaca.*

// This does NOT compile -- ShadowException
val BadLexer = lexer:
case "[a-zA-Z_][a-zA-Z0-9_]*" => Token["IDENTIFIER"] // general pattern
case "[a-zA-Z]+" => Token["ALPHABETIC"] // ERROR: shadowed by IDENTIFIER

// Fix: more-specific patterns before more-general ones
val GoodLexer = lexer:
case "if" => Token["IF"] // keyword first
case "[a-zA-Z_][a-zA-Z0-9_]*" => Token["IDENTIFIER"] // general pattern last
case "\\s+" => Token.Ignored
```

The compile error reads: `Pattern [a-zA-Z]+ is shadowed by [a-zA-Z_][a-zA-Z0-9_]*`. The fix is always the same: move the more specific pattern before the more general one.

### Guards Are Not Supported

Scala pattern guards (`case "regex" if condition =>`) are not supported in lexer rule definitions. Using one produces a compile-time error:

```scala sc:nocompile
import alpaca.*

// WRONG -- compile error: "Guards are not supported yet"
case class MyCtx(var text: CharSequence = "", var flag: Boolean = false) extends LexerCtx
val GuardedLexer = lexer[MyCtx]:
case "token" if ctx.flag => Token["A"]

// Fix: move the condition inside the rule body
val CorrectLexer = lexer[MyCtx]:
case "token" =>
if ctx.flag then Token["A"] else Token["B"]
```

## Runtime Lexer Errors

If `tokenize()` encounters a character that does not match any pattern, it throws a `RuntimeException` immediately. There is no skip-and-continue behavior -- lexing stops at the first unrecognized character and the exception propagates to the caller.

```scala sc:nocompile
import alpaca.*

val NumLexer = lexer:
case num @ "[0-9]+" => Token["NUM"](num.toInt)
case "\\s+" => Token.Ignored

try
val (_, lexemes) = NumLexer.tokenize("42 abc")
catch
case e: RuntimeException =>
println(e.getMessage) // "Unexpected character: 'a'"
```

The exception message contains the unexpected character but not its position in the input. For position information, use a context that tracks position -- see [Lexer Context](../lexer-context.html).

There is no custom error handler API yet ([GitHub issue #21](https://github.com/bkozak-scancode/alpaca/issues/21) is open).

## Parser Failure

`parse()` returns `T | Null`. A `null` result means the input token sequence did not match the grammar. This is not an exception -- it is a normal return value. Always check for `null` before using the result.

```scala sc:nocompile
import alpaca.*

val (_, lexemes) = CalcLexer.tokenize("1 + + 2")
val (_, result) = CalcParser.parse(lexemes)
if result == null then
println("Parse failed: input did not match the grammar")
else
println(s"Result: $result")
```

There is no structured parser error reporting yet -- `null` is the only signal that parsing failed ([GitHub issue #51](https://github.com/bkozak-scancode/alpaca/issues/51), [#65](https://github.com/bkozak-scancode/alpaca/issues/65) are open).

## See Also

- [Lexer Error Recovery](../lexer-error-recovery.html) -- full reference: `ShadowException`, runtime errors, pattern ordering
- [Lexer Context](../lexer-context.html) -- `PositionTracking` and `LineTracking` for position-aware error reporting
- [Parser](../parser.html) -- `parse()` return type, `T | Null` contract
84 changes: 84 additions & 0 deletions docs/_docs/cookbook/expression-evaluator.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
# Expression Evaluator

Alpaca's `before`/`after` DSL resolves operator precedence conflicts in the LR parse table at compile time, letting you build a fully evaluated expression parser with correct precedence and associativity.

> **Compile-time processing:** When you declare `override val resolutions = Set(...)`, the Alpaca macro bakes your precedence rules directly into the LR(1) parse table during compilation. No precedence checks happen at runtime -- the parser executes deterministically from a pre-resolved table.

## The Problem

Arithmetic grammars are ambiguous without explicit precedence declarations. The expression `1 + 2 * 3` can parse as `(1 + 2) * 3 = 9` or `1 + (2 * 3) = 11`, and the LR algorithm cannot choose between them on its own. Alpaca reports these as shift/reduce conflicts at compile time and gives you the `before`/`after` DSL to resolve them by declaring which productions take priority.

## Define the Lexer

```scala sc:nocompile
import alpaca.*

val CalcLexer = lexer:
case num @ "[0-9]+(\\.[0-9]+)?" => Token["NUMBER"](num.toDouble)
case "\\+" => Token["PLUS"]
case "-" => Token["MINUS"]
case "\\*" => Token["TIMES"]
case "/" => Token["DIVIDE"]
case "\\(" => Token["LPAREN"]
case "\\)" => Token["RPAREN"]
case "\\s+" => Token.Ignored
```

The regex `[0-9]+(\.[0-9]+)?` matches both integers and decimals. `num.toDouble` converts the matched string to a `Double`, so `Token["NUMBER"]` carries a `Double` value -- this is what makes `Rule[Double]` the right type for the parser.

## Define the Parser

```scala sc:nocompile
import alpaca.*

object CalcParser extends Parser:
val Expr: Rule[Double] = rule(
"plus" { case (Expr(a), CalcLexer.PLUS(_), Expr(b)) => a + b },
"minus" { case (Expr(a), CalcLexer.MINUS(_), Expr(b)) => a - b },
"times" { case (Expr(a), CalcLexer.TIMES(_), Expr(b)) => a * b },
"div" { case (Expr(a), CalcLexer.DIVIDE(_), Expr(b)) => a / b },
{ case (CalcLexer.`\(`(_), Expr(e), CalcLexer.`\)`(_)) => e },
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This production uses backticked token accessors like CalcLexer.`\(``` / CalcLexer.\)``` but the lexer in this same example defines LPAREN/RPARENtokens (and, if you do intend to name tokens as"\("/"\)", the backslashes need to be doubled as shown in extractors.md). Consider changing this to CalcLexer.LPAREN/CalcLexer.RPAREN` (recommended) or updating the token naming/escaping consistently.

Suggested change
{ case (CalcLexer.`\(`(_), Expr(e), CalcLexer.`\)`(_)) => e },
{ case (CalcLexer.LPAREN(_), Expr(e), CalcLexer.RPAREN(_)) => e },

Copilot uses AI. Check for mistakes.
{ case CalcLexer.NUMBER(n) => n.value },
)
val root: Rule[Double] = rule:
case Expr(e) => e

override val resolutions = Set(
production.plus.before(CalcLexer.PLUS, CalcLexer.MINUS),
production.plus.after(CalcLexer.TIMES, CalcLexer.DIVIDE),
production.minus.before(CalcLexer.PLUS, CalcLexer.MINUS),
production.minus.after(CalcLexer.TIMES, CalcLexer.DIVIDE),
production.times.before(CalcLexer.TIMES, CalcLexer.DIVIDE, CalcLexer.PLUS, CalcLexer.MINUS),
production.div.before(CalcLexer.TIMES, CalcLexer.DIVIDE, CalcLexer.PLUS, CalcLexer.MINUS),
)
```

Reading `production.plus.before(CalcLexer.PLUS, CalcLexer.MINUS)`: when the parser has reduced the `plus` production and the next token is `+` or `-`, prefer the reduction. This gives `+` left associativity and equal precedence with `-`.

Reading `production.plus.after(CalcLexer.TIMES, CalcLexer.DIVIDE)`: when the conflict is between reducing `plus` and shifting `*` or `/`, prefer shifting. This makes `*` and `/` bind tighter.

## Run It

```scala sc:nocompile
import alpaca.*

val (_, lexemes) = CalcLexer.tokenize("3 + 4 * 2")
val (_, result) = CalcParser.parse(lexemes)
// result: Double | Null -- 11.0 (not 14.0, because * binds tighter than +)
```

Always check for `null` before using the result -- `null` means the input did not match the grammar.

## Key Points

- `Rule[Double]` because `NUMBER` yields `Double` (`num.toDouble` in the lexer).
- `n.value` extracts the `Double` from the matched lexeme -- `n` is a `Lexeme`, not a `Double` directly.
- `resolutions` must be the **last `val`** in the parser object -- the macro reads top-to-bottom and must have seen all rule declarations before processing `resolutions`.
- Use `before`/`after` (not `alwaysBefore`/`alwaysAfter` -- the compiler error message suggests those names but they do not exist in the API).
- `production` is a `@compileTimeOnly` compile-time construct: valid only inside the `resolutions` value.

## See Also

- [Conflict Resolution](../conflict-resolution.html) -- `before`/`after` DSL reference, `Production(symbols*)` selector, token-side resolution
- [Parser](../parser.html) -- rule syntax, `root` requirement, `Rule[T]` types
- [Lexer](../lexer.html) -- token definition, `Token["NAME"](value)` constructor
74 changes: 74 additions & 0 deletions docs/_docs/cookbook/multi-pass.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
# Multi-Pass Processing

Alpaca has no dedicated multi-pass API; multi-pass is a composition pattern -- tokenize the input with a first lexer, transform the resulting `List[Lexeme]` in plain Scala, then parse or re-lex as needed.

> **Compile-time processing:** Both the lexer and parser macros are compiled independently; the `List[Lexeme]` boundary between them is an ordinary runtime value you can inspect and transform with any Scala collection operations.

## The Pattern

`tokenize()` returns a named tuple `(ctx, lexemes: List[Lexeme])`; `lexemes` is an ordinary `List` you can `filter`, `map`, or chain to a second stage.
`parse()` accepts any `List[Lexeme]` directly -- the type refinement is widened at the call site, so filtered or re-ordered lists are compatible without any casting.
Each `Lexeme` has a `name: String` field (the token name) and a `value: Any` field (the extracted value) that you can inspect during transformation.

Important constraint: the `Lexeme` constructor is private to the `alpaca` package. You cannot create new `Lexeme` instances. Multi-pass works by transforming the list of existing lexemes -- filter, reorder, or re-lex string values using a second lexer call.

## Example: Comment Stripping

The most common multi-pass pattern: lex input that contains comments, strip the comment tokens from the list, then parse the clean token stream.

```scala sc:nocompile
import alpaca.*

// Stage 1: lex with comments
val Stage1 = lexer:
case "#.*" => Token["COMMENT"]
case num @ "[0-9]+" => Token["NUM"](num.toInt)
case "\\+" => Token["PLUS"]
case "\\s+" => Token.Ignored

object SumParser extends Parser:
val Sum: Rule[Int] = rule(
{ case (Sum(a), Stage1.PLUS(_), Sum(b)) => a + b },
{ case Stage1.NUM(n) => n.value.asInstanceOf[Int] },
)
val root: Rule[Int] = rule:
case Sum(s) => s

// Multi-pass: lex, filter, parse
val (_, stage1Lexemes) = Stage1.tokenize("1 + # ignore this\n2")
val filtered = stage1Lexemes.filter(_.name != "COMMENT")
val (_, result) = SumParser.parse(filtered)
// result: Int | Null -- 3
```

## Example: Re-Lexing Values

For more advanced cases, string values extracted from stage 1 tokens can be tokenized again by a second lexer and then `flatMap`-ed back into the stream:

```scala sc:nocompile
import alpaca.*

// Advanced: re-lex string values extracted from stage 1 tokens
val (_, tokens) = IdentLexer.tokenize(source)
val expanded = tokens.flatMap:
case lex if lex.name == "MACRO" =>
MacroLexer.tokenize(expandMacro(lex.value.asInstanceOf[String])).lexemes
case lex => List(lex)
val (_, result) = MainParser.parse(expanded)
```

`lex.value` is `Any`; cast to the expected type. `expandMacro` is application-defined.

## Key Points

- `tokenize()` returns `(ctx, lexemes)`; use `.lexemes` or destructure to get the `List[Lexeme]`
- `Lexeme.name` is a `String` (the token name), `Lexeme.value` is `Any` (the extracted value)
- The `Lexeme` constructor is private -- you cannot construct new `Lexeme` instances; work with the existing list
- `parse()` accepts any `List[Lexeme[?, ?]]`; the type refinement is widened at the call site
- `Token.Ignored` tokens produce no lexemes and are never in the list

## See Also

- [Between Stages](../between-stages.html) -- Lexeme structure, `tokenize()` return type, `BetweenStages` hook
- [Lexer](../lexer.html) -- `tokenize()` API, `Token["NAME"](value)` constructor
- [Parser](../parser.html) -- `parse()` API, `Rule[T]` types
80 changes: 80 additions & 0 deletions docs/_docs/cookbook/whitespace-sensitive.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
# Whitespace-Sensitive Lexing

Use a custom `LexerCtx` to track indentation depth and emit `INDENT` or `DEDENT` tokens when the indentation level changes between lines.

> **Compile-time processing:** The `lexer[MyCtx]` macro inspects `MyCtx` at compile time; it auto-composes `BetweenStages` hooks from parent traits; the `ctx` value is available in rule bodies as a compile-time alias that is replaced by field accesses in the generated code.

## The LexerCtx Contract

A valid custom context must satisfy three rules:

1. It must be a **case class** -- `LexerCtx` has a `this: Product =>` self-type; the auto-derivation machinery requires a `Product` instance and regular classes do not satisfy it.
2. It must include **`var text: CharSequence = ""`** -- `LexerCtx` declares this field as abstract; omitting it produces a compile error.
3. **All fields must have default values** -- the `Empty[T]` derivation macro reads default parameter values from the companion object to construct the initial context.

Mutable state fields must be `var` so the lexer can assign to them directly.

## Tracking Indentation

Define a context with `currentIndent` and `prevIndent` fields; when a newline followed by spaces is matched, count the spaces to determine the new indentation level and compare it against the previous level.
Guards are not supported in lexer rules (`case "regex" if condition =>` is a compile error); check the condition inside the rule body instead.
Emit `Token["INDENT"]` when indentation increases, `Token["DEDENT"]` when it decreases, and `Token.Ignored` when it stays the same.

```scala sc:nocompile
import alpaca.*

case class IndentCtx(
var text: CharSequence = "",
var currentIndent: Int = 0,
var prevIndent: Int = 0,
) extends LexerCtx

val IndentLexer = lexer[IndentCtx]:
case "\\n( *)" =>
val newIndent = ctx.text.toString.count(_ == ' ')
val prev = ctx.prevIndent
ctx.prevIndent = newIndent
ctx.currentIndent = newIndent
// Guards are not supported -- check condition in body
if newIndent > prev then Token["INDENT"](newIndent)
else if newIndent < prev then Token["DEDENT"](newIndent)
else Token.Ignored
case word @ "[a-z_][a-z0-9_]*" => Token["WORD"](word)
case "\\s+" => Token.Ignored
```

The `\\n( *)` pattern matches a newline followed by zero or more spaces.
`ctx.text` contains the full match text at the time the rule body runs; counting spaces in it gives the new indentation level.
`Token["INDENT"](newIndent)` and `Token["DEDENT"](newIndent)` carry the new depth as their value, which the parser can read.
Because guards are not supported, the `if`/`else` is inside the rule body rather than after the pattern.

## Reading INDENT and DEDENT in the Parser

The parser sees `INDENT` and `DEDENT` tokens in the lexeme list just like any other token.
Use `IndentLexer.INDENT(n)` to extract the new depth from the lexeme value -- `n` is a `Lexeme` and `n.value` is the `Int` depth passed to `Token["INDENT"](newIndent)`.

```scala sc:nocompile
import alpaca.*

object IndentParser extends Parser:
val Block: Rule[List[String]] = rule(
{ case (IndentLexer.INDENT(_), Block(inner), IndentLexer.DEDENT(_)) => inner },
{ case IndentLexer.WORD(w) => List(w.value.asInstanceOf[String]) },
)
val root: Rule[List[String]] = rule:
case Block(b) => b
```

## Key Points

- `case class` with `var text: CharSequence = ""` is mandatory; other mutable fields must be `var`
- Guards (`case "regex" if condition =>`) are a compile error; move conditions into the rule body
- `ctx` is a compile-time construct available only inside lexer rule bodies
- `Token["INDENT"](value)` and `Token["DEDENT"](value)` are distinct named token types; the value carries the new depth for use in the parser
- `Token.Ignored` produces no lexeme; the newline pattern emits either `INDENT`, `DEDENT`, or nothing depending on the depth change

## See Also

- [Lexer Context](../lexer-context.html) -- full `LexerCtx` reference: case class contract, `BetweenStages`, `PositionTracking`, `LineTracking`
- [Lexer Error Recovery](../lexer-error-recovery.html) -- guards limitation and workaround
- [Lexer](../lexer.html) -- `Token["NAME"](value)` constructor, `Token.Ignored`