Skip to content

KAFKA-19932: adding handling of OOM and avoiding wrapped as timeout#21117

Open
iit2009060 wants to merge 5 commits intoapache:trunkfrom
iit2009060:KAFKA-19932
Open

KAFKA-19932: adding handling of OOM and avoiding wrapped as timeout#21117
iit2009060 wants to merge 5 commits intoapache:trunkfrom
iit2009060:KAFKA-19932

Conversation

@iit2009060
Copy link
Contributor

@iit2009060 iit2009060 commented Dec 10, 2025

adding handling of OOM and avoiding wrapped as timeout

@github-actions github-actions bot added triage PRs from the community clients small Small PRs labels Dec 10, 2025
@iit2009060
Copy link
Contributor Author

@kirktrue can you review it

Copy link
Contributor

@kirktrue kirktrue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @iit2009060. Can you add some unit tests to validate the work? Thanks!

@github-actions github-actions bot removed the triage PRs from the community label Dec 11, 2025
@iit2009060
Copy link
Contributor Author

Thanks for the PR @iit2009060. Can you add some unit tests to validate the work? Thanks!

added

@iit2009060
Copy link
Contributor Author

@kirktrue can you review it.

@iit2009060
Copy link
Contributor Author

@kirktrue can you please review it

Copy link
Contributor

@kirktrue kirktrue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@iit2009060 Sorry for the delay. Thanks for the addition of the unit tests.

I'm going back and forth on whether we should handle this outside of handleTimeoutFailure() or not. On the one hand, the root cause (OOM) is not a timeout, but on the other hand, we might consider hitting an OOM to be a "retriable" error. Thoughts?

Also, I provided some minor feedback on the tests.

Thanks!

Comment on lines +11767 to +11771
// Create a spy of MetadataResponse that throws OutOfMemoryError when topicMetadata() is accessed
// The AdminClient calls response.topicMetadata() in listTopics handleResponse(), which will trigger the OOM
MetadataResponseData data = new MetadataResponseData();
MetadataResponse realResponse = new MetadataResponse(data, ApiKeys.METADATA.latestVersion());
MetadataResponse spyResponse = spy(realResponse);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to use a generic "mock" here instead of a spy?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines +11786 to +11788
ExecutionException exception = assertThrows(ExecutionException.class, () -> result.names().get());
assertInstanceOf(OutOfMemoryError.class, exception.getCause(),
"Expected OutOfMemoryError to be propagated, but got: " + exception.getCause());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you see if you can use TestUtils.assertFutureThrows() here? It could be slightly less boilerplate.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

ExecutionException exception = assertThrows(ExecutionException.class, () -> result.names().get());
assertInstanceOf(OutOfMemoryError.class, exception.getCause(),
"Expected OutOfMemoryError to be propagated, but got: " + exception.getCause());
assertEquals("Simulated OOM during response handling", exception.getCause().getMessage());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another option is to create the OutOfMemoryError outside of the doThrow() and then perform assertEquals().

@kirktrue
Copy link
Contributor

@iit2009060—gentle reminder on this. Thanks!

@mjsax mjsax changed the title KAFKA-19932 adding handling of OOM and avoiding wrapped as timeout KAFKA-19932: adding handling of OOM and avoiding wrapped as timeout Jan 29, 2026
@kirktrue
Copy link
Contributor

@iit2009060—gentle reminder on this. Thanks!

@iit2009060
Copy link
Contributor Author

@iit2009060—gentle reminder on this. Thanks!

@kirktrue Thanks for reminding me. I will have a look.

@kirktrue
Copy link
Contributor

@iit2009060—do you still have the time and motivation to work on this, or should we hand it off to someone else? I just want to make sure we get it into Kafka 4.3. Thanks!

@iit2009060
Copy link
Contributor Author

@iit2009060—do you still have the time and motivation to work on this, or should we hand it off to someone else? I just want to make sure we get it into Kafka 4.3. Thanks!

i am really sorry if you have to follow up on this multiple times.I have addressed the comments.

@iit2009060
Copy link
Contributor Author

I'm going back and forth on whether we should handle this outside of handleTimeoutFailure() or not. On the one hand, the root cause (OOM) is not a timeout, but on the other hand, we might consider hitting an OOM to be a "retriable" error. Thoughts?
Yes i think OOM is not a retryable error and we should not be handling in a timeout failure. We should fail fast instead of waiting on the timeout. Should I make a change for this ?

@github-actions github-actions bot added streams and removed small Small PRs labels Feb 21, 2026
@github-actions github-actions bot added the small Small PRs label Feb 21, 2026
@kirktrue
Copy link
Contributor

Yes i think OOM is not a retryable error and we should not be handling in a timeout failure. We should fail fast instead of waiting on the timeout. Should I make a change for this ?

Yes, please.

}
if (cause instanceof TimeoutException) {
// Don't mask OutOfMemoryError as TimeoutException - propagate it directly
if (cause instanceof OutOfMemoryError) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there any reason to wrap any errors into a TImeoutException?
I may be wrong, but it looks like it may make sense to treat all errors the same way.
wdyt?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Nikita-Shupletsov I am moving Out of memory error out of Timeout as out of memory is not a retryable error.

Copy link
Contributor

@Nikita-Shupletsov Nikita-Shupletsov Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I understand that, sorry for not being crisp. My question was more like: as we found that problem with OutOfMemoryError, should we treat all errors the same way? I am not sure if StackOverFlowError for example is a retriable error in that context, or NotSuchMethodError.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that should also not be treated a timedout error , we can open a separate JIRA to handle this specific errors to move out of timedout exception.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@iit2009060—Can you create that separate PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kirktrue can we track in a separate ticket? This PR has been open for a long time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@iit2009060
May I ask what your concern is? it should be a one-liner. I am not 100% sure why we would need a separate ticket for that

Copy link
Contributor Author

@iit2009060 iit2009060 Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Nikita-Shupletsov Are you suggesting instead of catching OutOfMemoryError , we should catch VirtualMachineError which avoid all other non retryable exception to fall under the timedout error?

Screenshot 2026-03-11 at 10 50 39 PM

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

effectively yes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Nikita-Shupletsov @kirktrue, I have made the changes. Please check

@iit2009060
Copy link
Contributor Author

Yes i think OOM is not a retryable error and we should not be handling in a timeout failure. We should fail fast instead of waiting on the timeout. Should I make a change for this ?

Yes, please.

done

@iit2009060
Copy link
Contributor Author

@kirktrue Please review it , I have made the changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants