Skip to content

Conversation

@seizethedave
Copy link
Contributor

@seizethedave seizethedave commented Jun 23, 2024

This adds digit separators (1_000) to Jsonnet's numeric literals.

Companion to the same PR in C++ repo: google/jsonnet#1160

Reference issue with spec proposal: google/jsonnet#1155

@coveralls
Copy link

Coverage Status

coverage: 68.206% (+0.06%) from 68.143%
when pulling f10caa0 on seizethedave:digitsep
into 2b4d753 on google:master.

@seizethedave seizethedave marked this pull request as ready for review July 6, 2024 20:26
@johnbartholomew johnbartholomew marked this pull request as draft January 26, 2026 19:54
@johnbartholomew johnbartholomew marked this pull request as ready for review January 26, 2026 19:54
@coveralls
Copy link

coveralls commented Jan 26, 2026

Coverage Status

coverage: 44.297% (+0.1%) from 44.168%
when pulling a52ac8d on seizethedave:digitsep
into 6a5c085 on google:master.

// Run the postprocessor if the token kind has one defined.
if pp, ok := tokenKindPostprocessors[kind]; ok {
data = pp(data)
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be honest I think this is an unnecessary generalisation. There are various other tokens that already put edited/processed content into the token data field, but they just have the processing inline in the lexer code, and call emitFullToken directly.

Examples:

  • Chomping newlines from a text block:
    var str string = cb.String()
    if chompTrailingNl {
    str = str[:len(str)-1]
    }
    l.emitFullToken(tokenStringBlock, str,
    stringBlockIndent, stringBlockTermIndent)
    l.resetTokenStart()
    return nil
  • Removing the quotes from a string literal:
    if r == '"' {
    // Don't include the quotes in the token data
    l.emitFullToken(tokenStringDouble, l.input[l.tokenStart+1:l.pos.byteNo-1], "", "")
    l.resetTokenStart()
    break
    }

I think we can just do the same for lexNumber - currently it calls emitToken just before returning; it can process the token data and call emitFullToken instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. I remember being surprised that this would be the first instance of needing to post-process a lexed token. I guess I didn't look hard enough.

@johnbartholomew
Copy link
Collaborator

Ok, I can go ahead and rebase+add-adjustments+merge this, unless you have further changes in flight.

@seizethedave
Copy link
Contributor Author

👍 Thanks John, I welcome the assist. I have nothing in flight.

seizethedave and others added 10 commits January 27, 2026 18:43
…onents

See also the corresponding C++ jsonnet commit:
google/jsonnet@82ebe7d

There are some cases which are a little strange but lexically valid.

- `1.2.3.4` lexically this tokenises as `1.2` DOT `3.4`, because a dot
  in the fractional or exponent part of a number is simply treated the
  same as any other possible terminating character (any character that
  isn't part of the valid number lexical syntax)
- `1e2.34` lexically is `1e2` DOT `34` (same as the first case)
- `1e2e34` lexically is `1e2` (number) `e34` (identifier)

These behaviours are basically preserved/extrapolated in the case of
digit separators, so for example `1_2.3_4.5_6` is lexically parsed
as `12.34` DOT `56`. And `1e2_3e4` is lexically parsed as
`1e23` (number), `e4` (identifier). These both look very confusing,
but it probably doesn't matter because those token sequences are,
I think, not valid syntactically so they'll just be rejected by
the parser.

Note that in JSON (and jsonnet), leading zeros are not allowed in
numeric literals. This behaviour is explicitly kept with digit
separators, so `0_5` is explicitly rejected. The alternatives are:

- Treat underscore after an initial zero the same as any terminator
  character, so `0_5` lexes as tokens `0` followed by identifier `_5`.
- Allow underscore, thereby breaking the no-leading-zeros rule, so
  `0_5` tokenises as `05`.

Either option seems confusing, hence it seems better to explicitly
reject an underscore after an initial zero.
@johnbartholomew johnbartholomew merged commit a52ac8d into google:master Jan 27, 2026
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants