Skip to content

Fix out-of-bounds reads on truncated multibyte sequences (utf8nlen/utf8len + utf8codepoint)#136

Open
ZUENS2020 wants to merge 2 commits into
sheredom:mainfrom
ZUENS2020:fix-utf8nlen-oob-read
Open

Fix out-of-bounds reads on truncated multibyte sequences (utf8nlen/utf8len + utf8codepoint)#136
ZUENS2020 wants to merge 2 commits into
sheredom:mainfrom
ZUENS2020:fix-utf8nlen-oob-read

Conversation

@ZUENS2020

@ZUENS2020 ZUENS2020 commented Jun 14, 2026

Copy link
Copy Markdown

Two out-of-bounds reads of the same class — a multibyte lead byte whose continuation bytes are truncated by the NUL terminator — both fixed by stopping the scan at the terminator.

1. utf8nlen() / utf8len()

Took the codepoint width from the lead byte and advanced str unconditionally, marching past the NUL on a truncated trailing sequence; the next '\0' != *str then read out of bounds. PoC: utf8len("\x2c\xdf").

2. utf8codepoint()

Read str[1]/str[2]/str[3] based only on the lead byte with no bounds check, so a truncated sequence read past the terminator. This is the OOB reached via utf8lwr/utf8upr/utf8cmp/… (which call utf8codepoint without pre-validating continuation bytes) — the utf8codepoint case noted in #117. PoC: utf8lwr("\xe7") → heap-buffer-overflow READ. Folding the continuation bytes in a loop that stops at the NUL fixes the OOB for all utf8codepoint callers at once; for well-formed input the decoded value and returned pointer are unchanged.

utf8makevalid() (also in #117) pre-validates continuation bytes with short-circuiting checks and treats a NUL as a non-continuation byte, so it stops at the terminator and is not affected.

Testing

Regression tests for all three entry points; the full suite passes under -fsanitize=address (159 tests). Found via libFuzzer + ASan; reproduced on a clean clone.

…e sequence

utf8nlen() (and thus utf8len(), which calls it with SIZE_MAX) determined a
codepoint's byte-width from its lead byte and then advanced `str` by that width
unconditionally. When a multibyte lead byte is the final byte before the NUL
terminator (or the n-byte limit), this marched `str` past the terminator, so the
next `'\0' != *str` loop check read one or more bytes out of bounds.

Minimal trigger: utf8len() on the NUL-terminated 2-byte input {0x2c, 0xdf}
(',' followed by a 2-byte lead byte with no continuation) -> 1-byte
heap-buffer-overflow READ at the loop condition.

Fix: advance one byte at a time over the codepoint width, stopping early at the
NUL terminator or the n-byte limit, so the pointer never steps past the buffer.
Behaviour is unchanged for well-formed input; a truncated trailing sequence is
now counted as a single codepoint instead of triggering an over-read.

This is the same class of bug as sheredom#117 (multibyte lookahead past the end), here
in utf8nlen/utf8len specifically.

Added a regression test (utf8len_truncated_trailing_lead_byte) to the shared
suite; the full suite passes under -fsanitize=address.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
github-actions Bot pushed a commit to ZUENS2020/Sherpa that referenced this pull request Jun 14, 2026
…#136 (#467)

Add to the validation/findings log: a real CWE-125 OOB read in utf8.h
utf8nlen()/utf8len() (reachable via default utf8len, no flag), found by the
pipeline and correctly kept upstream_bug by contract-aware triage (S-465).
Fixed and submitted upstream as sheredom/utf8.h#136 (bounded advance +
regression test; full suite passes under ASan). Same root-cause class as the
open #117 but in a function it does not name.

Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
…te sequence

utf8codepoint() read str[1]/str[2]/str[3] based solely on the lead byte, with no
bounds check, so a truncated trailing multibyte sequence (continuation bytes run
off the end of a NUL-terminated buffer) read out of bounds. This is the OOB
reached via utf8lwr/utf8upr/utf8cmp/... (which call utf8codepoint without
pre-validating continuation bytes); reported for utf8codepoint in sheredom#117.

Minimal trigger: utf8lwr() on the NUL-terminated 1-byte input {0xe7} (a 3-byte
lead) -> heap-buffer-overflow READ in utf8codepoint.

Fix: fold in continuation bytes with a loop that stops at the NUL terminator, so
the cursor never advances past it. For well-formed input the behaviour and the
returned pointer/codepoint are identical to the previous fixed reads. This fixes
the OOB for every utf8codepoint caller at once.

Note: utf8makevalid() (also mentioned in sheredom#117) pre-validates continuation bytes
with short-circuiting checks and a NUL is treated as a non-continuation byte, so
it stops at the terminator and is not affected.

Added regression tests (utf8codepoint + utf8lwr); full suite passes under ASan.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@ZUENS2020 ZUENS2020 changed the title Fix out-of-bounds read in utf8nlen()/utf8len() on a truncated trailing multibyte sequence Fix out-of-bounds reads on truncated multibyte sequences (utf8nlen/utf8len + utf8codepoint) Jun 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant