Skip to content

6818: Gracefully stop compaction tasks#6913

Open
ca61688 wants to merge 24 commits intodevelopfrom
6818-gracefully-stop-compaction-tasks
Open

6818: Gracefully stop compaction tasks#6913
ca61688 wants to merge 24 commits intodevelopfrom
6818-gracefully-stop-compaction-tasks

Conversation

@ca61688
Copy link
Copy Markdown
Collaborator

@ca61688 ca61688 commented Mar 26, 2026

Make sure you have checked all steps below.

Issue

  • My PR fully resolves the following issues. I've referenced an issue in the PR title, for example "Issue 1234 - My
    Feature". Note that before an issue is finished, you can still make a pull request by raising a separate issue
    for your progress.

Tests

  • My PR adds the following tests based on our test strategy OR does not need testing for this extremely good reason:
    -CompactionTaskTerminateTest - StopAfterMaxAliveTime

Documentation

  • In case of new functionality, my PR adds documentation that describes how to use it, or I have linked to a
    separate issue for that below.
  • If I have added new Java code, I have added Javadoc that explains it following our conventions and style.
  • If I have added or removed any dependencies from the project, I have updated the NOTICES file.

@ca61688 ca61688 linked an issue Mar 26, 2026 that may be closed by this pull request
@ca61688 ca61688 marked this pull request as ready for review March 31, 2026 09:12
@ca61688 ca61688 added the needs-reviewer Pull requests that need a reviewer to be assigned label Mar 31, 2026
Copy link
Copy Markdown
Collaborator

@patchwork01 patchwork01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's a problem we didn't think of here. If we add this as it is, the tasks will shut down at very similar times if they started at similar times. We'll want to avoid a situation where too many compaction tasks shut down at once. For that reason I don't think we can merge this as it is.

One option would be to randomise the lifetime of the task when it starts up. We could have both a maximum lifetime and an amount of time to jitter it by. For example, if the maximum lifetime is 24 hours, and the jitter is 2 hours, and you have 100 tasks starting at once, then they'd stop gradually over a period of 2 hours instead of all at once after 24 hours.

For how to control the jitter in tests, you could look at ExponentialBackoffWithJitterTest, and how we've used ExponentialBackoffWithJitterTestHelper.

@patchwork01 patchwork01 removed the needs-reviewer Pull requests that need a reviewer to be assigned label Mar 31, 2026
@ca61688
Copy link
Copy Markdown
Collaborator Author

ca61688 commented Mar 31, 2026

Comments addressed, back in review

@ca61688 ca61688 removed their assignment Mar 31, 2026
@ca61688 ca61688 added the needs-reviewer Pull requests that need a reviewer to be assigned label Mar 31, 2026
Comment on lines +130 to +133
.description("The max ammount of jitter to add to a new compaction task's keep alive time.\n" +
"A random number of minutes between 0 and this value will be added to the keep alive time above " +
"for each new compaction task.\n" +
"This lowers the chance that all the compaction tasks stop at the same time.")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a typo at "ammount".

It looks like there are some words missing at "keep alive time above for each new compaction task".

Comment on lines +120 to +125
UserDefinedInstanceProperty COMPACTION_TASK_KEEP_ALIVE_TIME_IN_MINMUTES = Index.propertyBuilder("sleeper.compaction.task.keep.alive.time.minutes")
.description("The total time in minutes that a compaction task can be alive for before it is terminated.\n" +
"When the task has finished it's current job, it will calculate how long it's been running and " +
"if it's been running longer than the max alive time it will terminate.\n" +
"A new task will then begin running to process remaining compaction jobs.\n" +
"This allows the user to gracefully upgrade Sleeper to a newer version.")
Copy link
Copy Markdown
Collaborator

@patchwork01 patchwork01 Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's some disagreement here between "max alive time" and "keep alive time". I think the idea that this is a keep alive is a little misleading because we don't do anything to keep it alive, we stop it when it would otherwise carry on. I think "max alive time" makes more sense, but it seems inaccurate when the jitter is added on top. I'd consider making the jitter subtract from the max alive time rather than add to it.

This naming is also used in the tests, including the new comments and the nested test class.

* Wrapper class around the Supplier Instant interface to be able to assert cleanly there are no remaining expected
* time calls.
*/
public class TestSupplier implements Supplier<Instant> {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name TestSupplier doesn't say what it's supplying. TestInstantSupplier seems better?

private final Supplier<String> jobRunIdSupplier;
private final Supplier<Instant> timeSupplier;
private final ThreadSleep threadSleep;
private final Long maxAliveTime;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could do with stating the units, i.e. minutes rather than time.

It doesn't look like there's a reason to box this value? It can just be a primitive long?

Comment on lines +279 to +284
private CompactionTask generateNewCompactionTask(String taskId) {
return new CompactionTask(instanceProperties, null,
null, null,
null, null,
null, null,
null, null, taskId);
Copy link
Copy Markdown
Collaborator

@patchwork01 patchwork01 Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than passing in lots of nulls, we could extract the code that calls the constructor in the test base class, and reuse that?

I don't think you'll need this method if you add tests for the jitter behaviour though.

Comment on lines +242 to +250
@Test
void shouldGenerateUniqueMaxAliveTimesForUniqueTasks() {
//Given
instanceProperties.setNumber(COMPACTION_TASK_KEEP_ALIVE_JITTER_IN_MINUTES, Integer.MAX_VALUE);
CompactionTask task1 = generateNewCompactionTask("task1");
CompactionTask task2 = generateNewCompactionTask("task2");

//When/Then
assertThat(task1.getMaxAliveTime()).isNotEqualTo(task2.getMaxAliveTime());
Copy link
Copy Markdown
Collaborator

@patchwork01 patchwork01 Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't really testing the jitter behaviour. In ExponentialBackoffWithJitter, we passed in a constructor parameter DoubleSupplier randomJitterFraction, that's usually implemented as Math::random. With that, you can replace the Math.random call with your own code, to take control over how much jitter happens in a test. Take a look at ExponentialBackoffWithJitterTest.

That way you won't need to add the getMaxAliveTime method, and you can provide real test coverage of the behaviour, e.g. if the minimum amount of jitter happens, then it'll stop at the first possible time, if the maximum amount of jitter happens it'll only stop after the last possible time.

@patchwork01 patchwork01 removed the needs-reviewer Pull requests that need a reviewer to be assigned label Apr 2, 2026
+ (long) (Math.random() * instanceProperties.getLong(COMPACTION_TASK_KEEP_ALIVE_JITTER_IN_MINUTES));
}

public Long getMaxAliveTime() {
Copy link
Copy Markdown
Collaborator

@patchwork01 patchwork01 Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't look like this getter should be necessary. I'm not sure the testing against this is really adding much.

I think either it's worth testing the behaviour and we should wire in a test fake for Math.random, or it's not and we just test that it'll terminate using times where it should terminate even with the maximum jitter.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Gracefully stop compaction tasks after a maximum age

3 participants