KAFKA-19932: adding handling of OOM and avoiding wrapped as timeout#21117
KAFKA-19932: adding handling of OOM and avoiding wrapped as timeout#21117iit2009060 wants to merge 5 commits intoapache:trunkfrom
Conversation
|
@kirktrue can you review it |
kirktrue
left a comment
There was a problem hiding this comment.
Thanks for the PR @iit2009060. Can you add some unit tests to validate the work? Thanks!
13fa78a to
1011d7a
Compare
added |
|
@kirktrue can you review it. |
|
@kirktrue can you please review it |
kirktrue
left a comment
There was a problem hiding this comment.
@iit2009060 Sorry for the delay. Thanks for the addition of the unit tests.
I'm going back and forth on whether we should handle this outside of handleTimeoutFailure() or not. On the one hand, the root cause (OOM) is not a timeout, but on the other hand, we might consider hitting an OOM to be a "retriable" error. Thoughts?
Also, I provided some minor feedback on the tests.
Thanks!
| // Create a spy of MetadataResponse that throws OutOfMemoryError when topicMetadata() is accessed | ||
| // The AdminClient calls response.topicMetadata() in listTopics handleResponse(), which will trigger the OOM | ||
| MetadataResponseData data = new MetadataResponseData(); | ||
| MetadataResponse realResponse = new MetadataResponse(data, ApiKeys.METADATA.latestVersion()); | ||
| MetadataResponse spyResponse = spy(realResponse); |
There was a problem hiding this comment.
Is it possible to use a generic "mock" here instead of a spy?
| ExecutionException exception = assertThrows(ExecutionException.class, () -> result.names().get()); | ||
| assertInstanceOf(OutOfMemoryError.class, exception.getCause(), | ||
| "Expected OutOfMemoryError to be propagated, but got: " + exception.getCause()); |
There was a problem hiding this comment.
Can you see if you can use TestUtils.assertFutureThrows() here? It could be slightly less boilerplate.
| ExecutionException exception = assertThrows(ExecutionException.class, () -> result.names().get()); | ||
| assertInstanceOf(OutOfMemoryError.class, exception.getCause(), | ||
| "Expected OutOfMemoryError to be propagated, but got: " + exception.getCause()); | ||
| assertEquals("Simulated OOM during response handling", exception.getCause().getMessage()); |
There was a problem hiding this comment.
Another option is to create the OutOfMemoryError outside of the doThrow() and then perform assertEquals().
|
@iit2009060—gentle reminder on this. Thanks! |
|
@iit2009060—gentle reminder on this. Thanks! |
@kirktrue Thanks for reminding me. I will have a look. |
|
@iit2009060—do you still have the time and motivation to work on this, or should we hand it off to someone else? I just want to make sure we get it into Kafka 4.3. Thanks! |
i am really sorry if you have to follow up on this multiple times.I have addressed the comments. |
|
0d15324 to
fe7799f
Compare
Yes, please. |
| } | ||
| if (cause instanceof TimeoutException) { | ||
| // Don't mask OutOfMemoryError as TimeoutException - propagate it directly | ||
| if (cause instanceof OutOfMemoryError) { |
There was a problem hiding this comment.
is there any reason to wrap any errors into a TImeoutException?
I may be wrong, but it looks like it may make sense to treat all errors the same way.
wdyt?
There was a problem hiding this comment.
@Nikita-Shupletsov I am moving Out of memory error out of Timeout as out of memory is not a retryable error.
There was a problem hiding this comment.
yeah, I understand that, sorry for not being crisp. My question was more like: as we found that problem with OutOfMemoryError, should we treat all errors the same way? I am not sure if StackOverFlowError for example is a retriable error in that context, or NotSuchMethodError.
There was a problem hiding this comment.
Yes that should also not be treated a timedout error , we can open a separate JIRA to handle this specific errors to move out of timedout exception.
There was a problem hiding this comment.
@kirktrue can we track in a separate ticket? This PR has been open for a long time.
There was a problem hiding this comment.
@iit2009060
May I ask what your concern is? it should be a one-liner. I am not 100% sure why we would need a separate ticket for that
There was a problem hiding this comment.
@Nikita-Shupletsov Are you suggesting instead of catching OutOfMemoryError , we should catch VirtualMachineError which avoid all other non retryable exception to fall under the timedout error?
There was a problem hiding this comment.
@Nikita-Shupletsov @kirktrue, I have made the changes. Please check
done |
|
@kirktrue Please review it , I have made the changes. |
adding handling of OOM and avoiding wrapped as timeout