Fix out-of-bounds reads on truncated multibyte sequences (utf8nlen/utf8len + utf8codepoint) by ZUENS2020 · Pull Request #136 · sheredom/utf8.h

ZUENS2020 · 2026-06-14T15:33:18Z

Two out-of-bounds reads of the same class — a multibyte lead byte whose continuation bytes are truncated by the NUL terminator — both fixed by stopping the scan at the terminator.

1. `utf8nlen()` / `utf8len()`

Took the codepoint width from the lead byte and advanced str unconditionally, marching past the NUL on a truncated trailing sequence; the next '\0' != *str then read out of bounds. PoC: utf8len("\x2c\xdf").

2. `utf8codepoint()`

Read str[1]/str[2]/str[3] based only on the lead byte with no bounds check, so a truncated sequence read past the terminator. This is the OOB reached via utf8lwr/utf8upr/utf8cmp/… (which call utf8codepoint without pre-validating continuation bytes) — the utf8codepoint case noted in #117. PoC: utf8lwr("\xe7") → heap-buffer-overflow READ. Folding the continuation bytes in a loop that stops at the NUL fixes the OOB for all utf8codepoint callers at once; for well-formed input the decoded value and returned pointer are unchanged.

utf8makevalid() (also in #117) pre-validates continuation bytes with short-circuiting checks and treats a NUL as a non-continuation byte, so it stops at the terminator and is not affected.

Testing

Regression tests for all three entry points; the full suite passes under -fsanitize=address (159 tests). Found via libFuzzer + ASan; reproduced on a clean clone.

…e sequence utf8nlen() (and thus utf8len(), which calls it with SIZE_MAX) determined a codepoint's byte-width from its lead byte and then advanced `str` by that width unconditionally. When a multibyte lead byte is the final byte before the NUL terminator (or the n-byte limit), this marched `str` past the terminator, so the next `'\0' != *str` loop check read one or more bytes out of bounds. Minimal trigger: utf8len() on the NUL-terminated 2-byte input {0x2c, 0xdf} (',' followed by a 2-byte lead byte with no continuation) -> 1-byte heap-buffer-overflow READ at the loop condition. Fix: advance one byte at a time over the codepoint width, stopping early at the NUL terminator or the n-byte limit, so the pointer never steps past the buffer. Behaviour is unchanged for well-formed input; a truncated trailing sequence is now counted as a single codepoint instead of triggering an over-read. This is the same class of bug as sheredom#117 (multibyte lookahead past the end), here in utf8nlen/utf8len specifically. Added a regression test (utf8len_truncated_trailing_lead_byte) to the shared suite; the full suite passes under -fsanitize=address. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…#136 (#467) Add to the validation/findings log: a real CWE-125 OOB read in utf8.h utf8nlen()/utf8len() (reachable via default utf8len, no flag), found by the pipeline and correctly kept upstream_bug by contract-aware triage (S-465). Fixed and submitted upstream as sheredom/utf8.h#136 (bounded advance + regression test; full suite passes under ASan). Same root-cause class as the open #117 but in a function it does not name. Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>

…te sequence utf8codepoint() read str[1]/str[2]/str[3] based solely on the lead byte, with no bounds check, so a truncated trailing multibyte sequence (continuation bytes run off the end of a NUL-terminated buffer) read out of bounds. This is the OOB reached via utf8lwr/utf8upr/utf8cmp/... (which call utf8codepoint without pre-validating continuation bytes); reported for utf8codepoint in sheredom#117. Minimal trigger: utf8lwr() on the NUL-terminated 1-byte input {0xe7} (a 3-byte lead) -> heap-buffer-overflow READ in utf8codepoint. Fix: fold in continuation bytes with a loop that stops at the NUL terminator, so the cursor never advances past it. For well-formed input the behaviour and the returned pointer/codepoint are identical to the previous fixed reads. This fixes the OOB for every utf8codepoint caller at once. Note: utf8makevalid() (also mentioned in sheredom#117) pre-validates continuation bytes with short-circuiting checks and a NUL is treated as a non-continuation byte, so it stops at the terminator and is not affected. Added regression tests (utf8codepoint + utf8lwr); full suite passes under ASan. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

ZUENS2020 mentioned this pull request Jun 14, 2026

docs(improvements): log utf8.h utf8nlen OOB finding + upstream PR #136 ZUENS2020/Sherpa#467

Merged

ZUENS2020 changed the title ~~Fix out-of-bounds read in utf8nlen()/utf8len() on a truncated trailing multibyte sequence~~ Fix out-of-bounds reads on truncated multibyte sequences (utf8nlen/utf8len + utf8codepoint) Jun 16, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix out-of-bounds reads on truncated multibyte sequences (utf8nlen/utf8len + utf8codepoint)#136

Fix out-of-bounds reads on truncated multibyte sequences (utf8nlen/utf8len + utf8codepoint)#136
ZUENS2020 wants to merge 2 commits into
sheredom:mainfrom
ZUENS2020:fix-utf8nlen-oob-read

ZUENS2020 commented Jun 14, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

ZUENS2020 commented Jun 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

1. utf8nlen() / utf8len()

2. utf8codepoint()

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ZUENS2020 commented Jun 14, 2026 •

edited

Loading

1. `utf8nlen()` / `utf8len()`

2. `utf8codepoint()`