Conversation
patchwork01
left a comment
There was a problem hiding this comment.
I think there's a problem we didn't think of here. If we add this as it is, the tasks will shut down at very similar times if they started at similar times. We'll want to avoid a situation where too many compaction tasks shut down at once. For that reason I don't think we can merge this as it is.
One option would be to randomise the lifetime of the task when it starts up. We could have both a maximum lifetime and an amount of time to jitter it by. For example, if the maximum lifetime is 24 hours, and the jitter is 2 hours, and you have 100 tasks starting at once, then they'd stop gradually over a period of 2 hours instead of all at once after 24 hours.
For how to control the jitter in tests, you could look at ExponentialBackoffWithJitterTest, and how we've used ExponentialBackoffWithJitterTestHelper.
...ction/compaction-core/src/test/java/sleeper/compaction/core/task/CompactionTaskTestBase.java
Outdated
Show resolved
Hide resolved
...ction/compaction-core/src/test/java/sleeper/compaction/core/task/CompactionTaskTestBase.java
Outdated
Show resolved
Hide resolved
java/core/src/main/java/sleeper/core/properties/instance/CompactionProperty.java
Outdated
Show resolved
Hide resolved
|
Comments addressed, back in review |
| .description("The max ammount of jitter to add to a new compaction task's keep alive time.\n" + | ||
| "A random number of minutes between 0 and this value will be added to the keep alive time above " + | ||
| "for each new compaction task.\n" + | ||
| "This lowers the chance that all the compaction tasks stop at the same time.") |
There was a problem hiding this comment.
There's a typo at "ammount".
It looks like there are some words missing at "keep alive time above for each new compaction task".
| UserDefinedInstanceProperty COMPACTION_TASK_KEEP_ALIVE_TIME_IN_MINMUTES = Index.propertyBuilder("sleeper.compaction.task.keep.alive.time.minutes") | ||
| .description("The total time in minutes that a compaction task can be alive for before it is terminated.\n" + | ||
| "When the task has finished it's current job, it will calculate how long it's been running and " + | ||
| "if it's been running longer than the max alive time it will terminate.\n" + | ||
| "A new task will then begin running to process remaining compaction jobs.\n" + | ||
| "This allows the user to gracefully upgrade Sleeper to a newer version.") |
There was a problem hiding this comment.
There's some disagreement here between "max alive time" and "keep alive time". I think the idea that this is a keep alive is a little misleading because we don't do anything to keep it alive, we stop it when it would otherwise carry on. I think "max alive time" makes more sense, but it seems inaccurate when the jitter is added on top. I'd consider making the jitter subtract from the max alive time rather than add to it.
This naming is also used in the tests, including the new comments and the nested test class.
| * Wrapper class around the Supplier Instant interface to be able to assert cleanly there are no remaining expected | ||
| * time calls. | ||
| */ | ||
| public class TestSupplier implements Supplier<Instant> { |
There was a problem hiding this comment.
The name TestSupplier doesn't say what it's supplying. TestInstantSupplier seems better?
| private final Supplier<String> jobRunIdSupplier; | ||
| private final Supplier<Instant> timeSupplier; | ||
| private final ThreadSleep threadSleep; | ||
| private final Long maxAliveTime; |
There was a problem hiding this comment.
This could do with stating the units, i.e. minutes rather than time.
It doesn't look like there's a reason to box this value? It can just be a primitive long?
| private CompactionTask generateNewCompactionTask(String taskId) { | ||
| return new CompactionTask(instanceProperties, null, | ||
| null, null, | ||
| null, null, | ||
| null, null, | ||
| null, null, taskId); |
There was a problem hiding this comment.
Rather than passing in lots of nulls, we could extract the code that calls the constructor in the test base class, and reuse that?
I don't think you'll need this method if you add tests for the jitter behaviour though.
| @Test | ||
| void shouldGenerateUniqueMaxAliveTimesForUniqueTasks() { | ||
| //Given | ||
| instanceProperties.setNumber(COMPACTION_TASK_KEEP_ALIVE_JITTER_IN_MINUTES, Integer.MAX_VALUE); | ||
| CompactionTask task1 = generateNewCompactionTask("task1"); | ||
| CompactionTask task2 = generateNewCompactionTask("task2"); | ||
|
|
||
| //When/Then | ||
| assertThat(task1.getMaxAliveTime()).isNotEqualTo(task2.getMaxAliveTime()); |
There was a problem hiding this comment.
This isn't really testing the jitter behaviour. In ExponentialBackoffWithJitter, we passed in a constructor parameter DoubleSupplier randomJitterFraction, that's usually implemented as Math::random. With that, you can replace the Math.random call with your own code, to take control over how much jitter happens in a test. Take a look at ExponentialBackoffWithJitterTest.
That way you won't need to add the getMaxAliveTime method, and you can provide real test coverage of the behaviour, e.g. if the minimum amount of jitter happens, then it'll stop at the first possible time, if the maximum amount of jitter happens it'll only stop after the last possible time.
| + (long) (Math.random() * instanceProperties.getLong(COMPACTION_TASK_KEEP_ALIVE_JITTER_IN_MINUTES)); | ||
| } | ||
|
|
||
| public Long getMaxAliveTime() { |
There was a problem hiding this comment.
It doesn't look like this getter should be necessary. I'm not sure the testing against this is really adding much.
I think either it's worth testing the behaviour and we should wire in a test fake for Math.random, or it's not and we just test that it'll terminate using times where it should terminate even with the maximum jitter.
Make sure you have checked all steps below.
Issue
Feature". Note that before an issue is finished, you can still make a pull request by raising a separate issue
for your progress.
Tests
-CompactionTaskTerminateTest - StopAfterMaxAliveTime
Documentation
separate issue for that below.