Fix out-of-bounds reads on truncated multibyte sequences (utf8nlen/utf8len + utf8codepoint)#136
Open
ZUENS2020 wants to merge 2 commits into
Open
Fix out-of-bounds reads on truncated multibyte sequences (utf8nlen/utf8len + utf8codepoint)#136ZUENS2020 wants to merge 2 commits into
ZUENS2020 wants to merge 2 commits into
Conversation
…e sequence
utf8nlen() (and thus utf8len(), which calls it with SIZE_MAX) determined a
codepoint's byte-width from its lead byte and then advanced `str` by that width
unconditionally. When a multibyte lead byte is the final byte before the NUL
terminator (or the n-byte limit), this marched `str` past the terminator, so the
next `'\0' != *str` loop check read one or more bytes out of bounds.
Minimal trigger: utf8len() on the NUL-terminated 2-byte input {0x2c, 0xdf}
(',' followed by a 2-byte lead byte with no continuation) -> 1-byte
heap-buffer-overflow READ at the loop condition.
Fix: advance one byte at a time over the codepoint width, stopping early at the
NUL terminator or the n-byte limit, so the pointer never steps past the buffer.
Behaviour is unchanged for well-formed input; a truncated trailing sequence is
now counted as a single codepoint instead of triggering an over-read.
This is the same class of bug as sheredom#117 (multibyte lookahead past the end), here
in utf8nlen/utf8len specifically.
Added a regression test (utf8len_truncated_trailing_lead_byte) to the shared
suite; the full suite passes under -fsanitize=address.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
github-actions Bot
pushed a commit
to ZUENS2020/Sherpa
that referenced
this pull request
Jun 14, 2026
…#136 (#467) Add to the validation/findings log: a real CWE-125 OOB read in utf8.h utf8nlen()/utf8len() (reachable via default utf8len, no flag), found by the pipeline and correctly kept upstream_bug by contract-aware triage (S-465). Fixed and submitted upstream as sheredom/utf8.h#136 (bounded advance + regression test; full suite passes under ASan). Same root-cause class as the open #117 but in a function it does not name. Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
…te sequence utf8codepoint() read str[1]/str[2]/str[3] based solely on the lead byte, with no bounds check, so a truncated trailing multibyte sequence (continuation bytes run off the end of a NUL-terminated buffer) read out of bounds. This is the OOB reached via utf8lwr/utf8upr/utf8cmp/... (which call utf8codepoint without pre-validating continuation bytes); reported for utf8codepoint in sheredom#117. Minimal trigger: utf8lwr() on the NUL-terminated 1-byte input {0xe7} (a 3-byte lead) -> heap-buffer-overflow READ in utf8codepoint. Fix: fold in continuation bytes with a loop that stops at the NUL terminator, so the cursor never advances past it. For well-formed input the behaviour and the returned pointer/codepoint are identical to the previous fixed reads. This fixes the OOB for every utf8codepoint caller at once. Note: utf8makevalid() (also mentioned in sheredom#117) pre-validates continuation bytes with short-circuiting checks and a NUL is treated as a non-continuation byte, so it stops at the terminator and is not affected. Added regression tests (utf8codepoint + utf8lwr); full suite passes under ASan. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Two out-of-bounds reads of the same class — a multibyte lead byte whose continuation bytes are truncated by the NUL terminator — both fixed by stopping the scan at the terminator.
1.
utf8nlen()/utf8len()Took the codepoint width from the lead byte and advanced
strunconditionally, marching past the NUL on a truncated trailing sequence; the next'\0' != *strthen read out of bounds. PoC:utf8len("\x2c\xdf").2.
utf8codepoint()Read
str[1]/str[2]/str[3]based only on the lead byte with no bounds check, so a truncated sequence read past the terminator. This is the OOB reached viautf8lwr/utf8upr/utf8cmp/… (which callutf8codepointwithout pre-validating continuation bytes) — theutf8codepointcase noted in #117. PoC:utf8lwr("\xe7")→ heap-buffer-overflow READ. Folding the continuation bytes in a loop that stops at the NUL fixes the OOB for allutf8codepointcallers at once; for well-formed input the decoded value and returned pointer are unchanged.utf8makevalid()(also in #117) pre-validates continuation bytes with short-circuiting checks and treats a NUL as a non-continuation byte, so it stops at the terminator and is not affected.Testing
Regression tests for all three entry points; the full suite passes under
-fsanitize=address(159 tests). Found via libFuzzer + ASan; reproduced on a clean clone.