Conversation
There was a problem hiding this comment.
Pull request overview
This PR fixes an issue where rm_stm incorrectly initializes the last stable offset (LSO) to -1 instead of 0, which is the proper value for an invalid LSO. Since invalid LSO (0) is used throughout the codebase for error translation into retry-able consumer errors, this incorrect initialization prevents proper error handling.
Key Changes
- Updated
rm_stm::last_stable_offset()to initialize LSO withmodel::invalid_lsoinstead of -1 - Updated
source_partition_offsets::last_stable_offsetdefault value to usemodel::offset_cast(model::invalid_lso)instead of -1
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| src/v/cluster/rm_stm.cc | Changed local variable initialization from -1 to model::invalid_lso in last_stable_offset() function |
| src/v/kafka/client/direct_consumer/api_types.h | Changed struct member default initialization from -1 to model::offset_cast(model::invalid_lso) |
CI test resultstest results on build#78548
|
|
link the jira? |
|
|
||
| auto synced_leader = _raft->is_leader() && _raft->term() == _insync_term; | ||
| model::offset lso{-1}; | ||
| model::offset lso{model::invalid_lso}; |
There was a problem hiding this comment.
q: the new behavior is that it returns offset_not_available and that triggers a retry?
If yes, I wonder (without this fix) if the translation here would throw an exception because it attempts translation before the start offset? How is it returning -1?
There was a problem hiding this comment.
from the logs
TRACE 2025-12-23 01:45:37,297 [shard 1:fetc] tx - [{kafka/source-topic/0}] - rm_stm.cc:1322 - lso update in progress, last_known_lso: -1, last_applied: 0
from the code
auto maybe_lso = _partition->last_stable_offset();
if (maybe_lso == model::invalid_lso) {
return error_code::offset_not_available;
}
return _translator->from_log_offset(maybe_lso);
and then translator is actually just a translator_state, code here
model::offset offset_translator_state::from_log_offset(model::offset o) const {
const auto d = delta(o);
return model::offset(o - d);
}
no checks on valid offset nor exceptions
There was a problem hiding this comment.
IMO, this is a +1 for trying to use name constants as much as possible as opposed to magic numbers
There was a problem hiding this comment.
Let's try to be consistent. We track high watermark as the inclusive offset and add one when it is returned. Would it make sense to track LSO in the same way i.e. leave it as -1 and only add 1 when we need to retrieve it ?
There was a problem hiding this comment.
We track high watermark as the inclusive offset and add one when it is returned
Can you point me in the direction of where this occurs?
I poked around and it seems hwm and lso get the same treatment almost everywhere in code. Its possible that lso actually gets an increment on some fetch path, in which case, maybe invalid_lso should actually be -1
There was a problem hiding this comment.
in which case, maybe invalid_lso should actually be -1
that makes sense.. hopefully nothing breaks :)
There was a problem hiding this comment.
tests pass, looks good afaik
no jira link, this was a failure flagged in private team core |
ecfa185 to
4fab444
Compare
4fab444 to
c5f985a
Compare
Use named constants when possible.
c5f985a to
faf1868
Compare
|
letting ci run its course |
|
/backport v25.3.x |
|
Failed to create a backport PR to v25.3.x branch. I tried: |
rm_stm instantiated its lso to -1, where invalid_lso is 0
Normally, invalid_lso gets converted into lso_unavailable in the fetch path, but -1 != 0, so direct consumer received an lso of -1, thus firing an error log.
This pr changes invalid lso to be -1 in parity with hwm, and then spot updates places where lso's are being instantiated to the magic number "-1" to instead use invalid_lso.
Backports Required
Release Notes
Bug Fixes