Skip to content

Tech debt cleanup#287

Open
halotukozak wants to merge 11 commits intotheory-applicationfrom
tech-debt-cleanup
Open

Tech debt cleanup#287
halotukozak wants to merge 11 commits intotheory-applicationfrom
tech-debt-cleanup

Conversation

@halotukozak
Copy link
Copy Markdown
Owner

@halotukozak halotukozak commented Mar 3, 2026

Summary

Add theory application pages and fix documentation tech debt accumulated across earlier phases.

Theory application pages

  • theory/conflicts.md — Parse table conflicts: formal conflict definition, shift/reduce and reduce/reduce examples with Alpaca error messages
  • theory/semantic-actions.md — Semantic actions: how Alpaca's production macro attaches user code to grammar rules, AST construction patterns
  • theory/full-example.md — Capstone narrative: complete walkthrough from grammar definition through lexing, parsing, and AST construction

Documentation infrastructure

  • Added "Compiler Theory" nested section to sidebar navigation
  • Fixed 6 broken cross-link paths in theory pages

Tech debt fixes

  • Fixed alwaysBefore/alwaysAfterbefore/after correction note in reduce/reduce section
  • Added navigation bullets ("Next: ...") to theory pages for reading flow
  • Corrected IntDouble type comments in parser.md
  • Added debug-settings cross-links to lexer.md and parser.md
  • Fixed conflicts.md link text and backtick notation in extractors.md

🤖 Generated with Claude Code

- Formal definition block for Parse Table Conflict (state/symbol pair collision)
- Shift/reduce conflict section with CalcParser 1+2+3 example and real Alpaca error message
- alwaysBefore/alwaysAfter discrepancy note immediately after error block
- Reduce/reduce conflict section with Integer/Float example and error message
- LR(1) lookahead disambiguation section
- Resolution by priority section with minimal sc:nocompile example (production.plus only)
- Compile-time detection section with standard blockquote callout
- Cross-links to cfg.md, shift-reduce.md, ../conflict-resolution.md, semantic-actions.md, full-example.md
- Six-step narrative from bare grammar to working calculator (7.0)
- CalcLexer definition, bare CalcParser with ShiftReduceConflict error
- Resolved CalcParser with all 6 resolutions using production.div (not production.divide)
- Pipeline evaluation: 1+2*3=7.0, (1+2)*3=9.0 with null-check note
- Semantic action trace for 1+2*3 showing 2*3 reduces before 1+...
- Formal definition block, compile-time callout blockquote
- Theory-to-code mapping table with cross-links to all theory pages
- Formal definition block for Semantic Action (S-attributed scheme)
- Syntax-directed translation section with synthesized attribute explanation
- Extractor pattern section with complete 7-production CalcParser action table
- No-parse-tree section grounded in Parser.scala loop() implementation
- Typed results section explaining Rule[Double] compile-time type checking
- Compile-time processing callout
- Cross-links to shift-reduce.md, conflicts.md, ../extractors.md, ../parser.md, full-example.md
- No Rule[Int], no n.value.toDouble, no inherited attribute, no L-attributed
- Add 'Compiler Theory' subsection with 9 theory pages in pipeline order
- Pages use theory/pagename.md format resolving to docs/_docs/theory/
- Order: pipeline, tokens, lexer-fa, cfg, why-lr, shift-reduce, conflicts, semantic-actions, full-example
- pipeline.md: tokens.md and lexer-fa.md sibling links no longer use theory/ prefix
- pipeline.md: lexer.md and parser.md reference doc links now use ../ prefix
- tokens.md: cfg.md sibling link no longer uses theory/ prefix
- lexer-fa.md: cfg.md sibling link no longer uses theory/ prefix
…duce section

- Inserted identical correction blockquote after the reduce/reduce compiler output block
- Readers who encounter only the RR error message now learn that alwaysBefore/alwaysAfter do not exist in Alpaca API
- Correct methods are before/after per conflict-resolution.md
…y pages

- semantic-actions.md: replace backtick code span with functional [Parser](../parser.md) hyperlink
- shift-reduce.md: add Next: [Conflicts and Disambiguation](conflicts.md) bullet to Cross-links
- tokens.md: add Next: [The Lexer: Regex to Finite Automata](lexer-fa.md) bullet to Cross-links
- Line 102: n.value: Int -> n.value: Double (CalcLexer.NUMBER yields Double)
- Line 117: where an Int -> where a Double (matching type)
- Line 245: Rule[Int] -> Rule[Double] in conflict-resolution example
- Append See [Debug Settings](debug-settings.html) paragraph at end of lexer.md
- Append See [Debug Settings](debug-settings.html) paragraph at end of parser.md
- Fix [cfg.md](cfg.md) to [Context-Free Grammars](cfg.md) on line 24 (TD-05)
- Add Next: prefix to Semantic Actions bullet in conflicts.md Cross-links (TD-04)
- Add Next: prefix to Full Example bullet in semantic-actions.md Cross-links (TD-04)
- Line 22: n.value: Int -> n.value: Double (CalcLexer.NUMBER yields Double)
- Line 33: where an Int -> where a Double (matching type)
- Line 62: Rule[Int] -> Rule[Double] (CalcLexer.NUMBER binding)
- Line 67: v: Int -> v: Double (matching type annotation in comment)
- Lines 28-29: single-backslash backtick names (\+, \(, \)) -> double-backslash
  (\+, \(, \)) to match parser.md and lexer.md Naming Table style
@halotukozak halotukozak self-assigned this Mar 3, 2026
Copilot AI review requested due to automatic review settings March 3, 2026 19:54
@halotukozak halotukozak changed the base branch from master to theory-application March 3, 2026 20:04
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR expands the documentation set by introducing a new “Compiler Theory” tutorial section and updating existing reference docs to align examples with the calculator’s Double-based NUMBER token.

Changes:

  • Adds a new “Compiler Theory” section to the docs sidebar with nine new tutorial pages under docs/_docs/theory/.
  • Updates parser/extractor docs examples to reflect NUMBER producing Double (not Int) and adds a cross-link to Debug Settings.
  • Adds a Debug Settings cross-link to the lexer documentation.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 14 comments.

Show a summary per file
File Description
docs/sidebar.yml Adds “Compiler Theory” section and links to new theory pages.
docs/_docs/theory/pipeline.md Introduces the compile pipeline overview and cross-links into the theory section.
docs/_docs/theory/tokens.md Defines tokens/lexemes and shows CalcLexer examples.
docs/_docs/theory/lexer-fa.md Explains regex → automata concepts and how Alpaca tokenization is implemented.
docs/_docs/theory/cfg.md Introduces context-free grammars with the calculator grammar example.
docs/_docs/theory/why-lr.md Explains LR vs LL and motivates LR(1) for Alpaca.
docs/_docs/theory/shift-reduce.md Walks through a shift/reduce trace and maps it to runtime behavior.
docs/_docs/theory/conflicts.md Explains shift/reduce + reduce/reduce conflicts and resolution.
docs/_docs/theory/semantic-actions.md Explains semantic actions and typed results in Alpaca.
docs/_docs/theory/full-example.md Assembles the full calculator lexer+parser+resolutions example.
docs/_docs/parser.md Updates examples/comments to Double and adds a Debug Settings link.
docs/_docs/lexer.md Adds a Debug Settings link.
docs/_docs/extractors.md Updates examples/comments to Double and aligns special-token accessor examples.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +1 to +3
A parser recognizes whether an input string belongs to a grammar — it accepts or rejects. But most programs need to *compute* something from the input, not just verify it. Semantic actions bridge structure and computation: they attach a computation to each production rule, so that the parser produces a typed value as a direct outcome of parsing, rather than a parse tree.

## Syntax-Directed Translation
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This new page doesn’t start with an H1 heading (# ...). Existing docs pages in this repo start with an H1 (e.g., docs/_docs/parser.md:1). Add a top-level heading like # Semantic Actions so the page has a visible title.

Copilot uses AI. Check for mistakes.
Comment on lines +1 to +3
A grammar is ambiguous if a string can be parsed in more than one way. In LR parsing, ambiguity manifests as a conflict: the parse table has two valid entries for the same (state, symbol) pair, and the parser cannot proceed deterministically.

## What is a Parse Table Conflict?
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This new page doesn’t start with an H1 heading (# ...). Existing docs pages in this repo consistently start with an H1 (e.g., docs/_docs/parser.md:1). Add a top-level heading like # Conflicts and Disambiguation to ensure the page has a proper title.

Copilot uses AI. Check for mistakes.
Comment on lines 1 to 3
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This new page doesn’t start with an H1 heading (# ...). Most docs pages in this repo do (e.g., docs/_docs/parser.md:1). Add a top-level heading like # Context-Free Grammars so the page renders with a clear title.

Copilot uses AI. Check for mistakes.

This is why `CalcParser.parse(lexemes)` returns a named tuple `(ctx: Ctx, result: Double | Null)` — not a tree. The semantic actions produce the final value during the parse itself.

Decision confirmed: "Parse tree never exposed in Alpaca — semantic actions evaluated immediately during LR(1) reduce; parse() returns typed value directly." (STATE.md)
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sentence references STATE.md, but there is no STATE.md file in the repository. Consider removing the reference or replacing it with a link to the relevant source/doc page (e.g., src/alpaca/internal/parser/Parser.scala or docs/_docs/parser.md) so readers can actually follow it.

Suggested change
Decision confirmed: "Parse tree never exposed in Alpaca — semantic actions evaluated immediately during LR(1) reduce; parse() returns typed value directly." (STATE.md)
Decision confirmed: "Parse tree never exposed in Alpaca — semantic actions evaluated immediately during LR(1) reduce; parse() returns typed value directly."

Copilot uses AI. Check for mistakes.
Comment on lines +82 to +88
- Next: [Tokens & Lexemes](tokens.md) — what the lexer produces: token classes, token instances, and how they are represented in Alpaca
- [The Lexer: Regex to Finite Automata](lexer-fa.md) — how regular expressions define token classes and how Alpaca compiles them

For the full API, see the reference pages:

- See [Lexer](lexer.md) for how `CalcLexer` is defined.
- See [Parser](parser.md) for how `CalcParser` is defined and how grammar rules produce a typed result.
- See [Lexer](../lexer.md) for how `CalcLexer` is defined.
- See [Parser](../parser.md) for how `CalcParser` is defined and how grammar rules produce a typed result.
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These cross-links (and other new theory pages) use .md extensions (e.g., tokens.md, ../lexer.md). The rest of the docs consistently link to the rendered .html pages (e.g., docs/_docs/parser.md links to lexer.html). If the site is published as HTML, these .md links will 404. Consider switching the links here (and across the theory section) to .html for consistency with the existing documentation.

Suggested change
- Next: [Tokens & Lexemes](tokens.md) — what the lexer produces: token classes, token instances, and how they are represented in Alpaca
- [The Lexer: Regex to Finite Automata](lexer-fa.md) — how regular expressions define token classes and how Alpaca compiles them
For the full API, see the reference pages:
- See [Lexer](lexer.md) for how `CalcLexer` is defined.
- See [Parser](parser.md) for how `CalcParser` is defined and how grammar rules produce a typed result.
- See [Lexer](../lexer.md) for how `CalcLexer` is defined.
- See [Parser](../parser.md) for how `CalcParser` is defined and how grammar rules produce a typed result.
- Next: [Tokens & Lexemes](tokens.html) — what the lexer produces: token classes, token instances, and how they are represented in Alpaca
- [The Lexer: Regex to Finite Automata](lexer-fa.html) — how regular expressions define token classes and how Alpaca compiles them
For the full API, see the reference pages:
- See [Lexer](../lexer.html) for how `CalcLexer` is defined.
- See [Parser](../parser.html) for how `CalcParser` is defined and how grammar rules produce a typed result.

Copilot uses AI. Check for mistakes.
Comment on lines 66 to 83
Alpaca follows the same principle but implements it using Java's regex engine, which is itself
backed by NFA/DFA machinery:

- All token patterns are combined into a single Java regex alternation at compile time:

```
// Conceptual: how Alpaca combines patterns internally
(?<NUMBER>[0-9]+(\.[0-9]+)?)|(?<PLUS>\+)|(?<MINUS>-)|(?<TIMES>\*)|...
```

- `java.util.regex.Pattern.compile(...)` is called inside the `lexerImpl` macro at compile
time. An invalid regex pattern therefore causes a compile error, not a runtime crash.
- At runtime, `Tokenization.tokenize()` uses `matcher.lookingAt()` on the combined pattern at
the current input position. It then checks which named group matched using
`matcher.start(i)` to determine the token class.

This means Alpaca's lexer runs with the same O(n) guarantee as a hand-built DFA: one pass
through the input, no backtracking.
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section claims Alpaca’s lexer has the same O(n) guarantee as a hand-built DFA and that there is “no backtracking”. However the implementation uses java.util.regex.Pattern/Matcher.lookingAt() on a combined alternation (see src/alpaca/internal/lexer/Lexer.scala and Tokenization.scala:57-70), and Java’s regex engine is generally backtracking-based and not guaranteed linear-time for all patterns. Suggest softening/rephrasing this (e.g., describe the intended single-pass behavior, and note that worst-case performance depends on the regex engine/patterns), rather than stating a hard O(n) guarantee.

Copilot uses AI. Check for mistakes.
Comment on lines 46 to 66
In Alpaca, each matched token is represented as a `Lexeme[Name, Value]`. A lexeme carries four
pieces of information:

- `name` — the token class name string, e.g., `"NUMBER"` or `"PLUS"`
- `value` — the extracted value with its Scala type, e.g., `3.14: Double` for NUMBER, `(): Unit`
for PLUS
- `position` — the character offset at the end of the match
- `line` — the line number at the end of the match

The tokenization output for a simple expression illustrates this:

```scala sc:nocompile
val (_, lexemes) = CalcLexer.tokenize("3 + 4 * 2")
// lexemes: List[Lexeme] =
// NUMBER(3.0), PLUS, NUMBER(4.0), TIMES, NUMBER(2.0)
//
// Each Lexeme carries:
// .name — token class name (e.g., "NUMBER")
// .value — extracted value (e.g., 3.0: Double)
// .position — character offset at end of match
// .line — line number at end of match
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

position is described here as a “character offset at the end of the match”, but the default lexer context tracks position as the 1-based column within the current line and resets it on \n (see src/alpaca/internal/lexer/PositionTracking.scala:8-32). Consider updating the wording (and the later .position bullet in the code comment) to avoid implying it’s a global offset across the entire input.

Copilot uses AI. Check for mistakes.
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the LR family table, ANTLR is listed as an example of an LALR(1) parser generator. ANTLR (v4) is generally an LL(*) / adaptive predictive parser rather than LALR(1), so this example is likely misleading. Consider replacing ANTLR with an actual LALR(1) tool (e.g., yacc/Bison only, or another LALR(1) generator) or rewording the examples.

Suggested change
| LALR(1) | Per-state lookahead (merged item-set cores) | Same as LR(0)/SLR | Most common in practice (yacc, Bison) |

Copilot uses AI. Check for mistakes.
Comment on lines +23 to +42
- title: Compiler Theory
subsection:
- title: The Compilation Pipeline
page: theory/pipeline.md
- title: Tokens and Lexemes
page: theory/tokens.md
- title: The Lexer — Regex to Finite Automata
page: theory/lexer-fa.md
- title: Context-Free Grammars
page: theory/cfg.md
- title: Why LR Parsing?
page: theory/why-lr.md
- title: Shift-Reduce Parsing
page: theory/shift-reduce.md
- title: Conflicts and Disambiguation
page: theory/conflicts.md
- title: Semantic Actions
page: theory/semantic-actions.md
- title: Full Calculator Example
page: theory/full-example.md
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR title is "Tech debt cleanup", but this change adds a substantial new "Compiler Theory" documentation section (new pages + sidebar navigation). Consider updating the PR title/description to reflect the actual scope so reviewers/readers can find it later.

Copilot uses AI. Check for mistakes.
Comment on lines 1 to 3
The shift-reduce loop is the heart of LR parsing. Every LR parser — regardless of whether it uses LR(0), LALR(1), or full LR(1) lookahead — executes the same fundamental loop: shift the next token onto a stack, or reduce the top of the stack to a non-terminal. This page traces that loop step by step for a concrete input.

## The Parse Stack
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This new page doesn’t start with an H1 heading (# ...). Existing docs pages consistently begin with an H1 (e.g., docs/_docs/parser.md:1). Add a top-level heading like # Shift-Reduce Parsing to avoid rendering a page with no title.

Copilot uses AI. Check for mistakes.
@github-actions github-actions bot added documentation Improvements or additions to documentation thesis labels Mar 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation refactoring thesis

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants