Skip to content

Replaced LocalUniqueCache with .forEach methods#14724

Open
Ambeco wants to merge 1 commit intoyairm210:masterfrom
Ambeco:lowAllocStats
Open

Replaced LocalUniqueCache with .forEach methods#14724
Ambeco wants to merge 1 commit intoyairm210:masterfrom
Ambeco:lowAllocStats

Conversation

@Ambeco
Copy link
Copy Markdown
Contributor

@Ambeco Ambeco commented Apr 5, 2026

Replaced LocalUniqueCache with .forEach methods.

These avoid allocating a coroutine, sequence, and iterator. They avoid the expensive yieldAll code, improving performance, and reducing memory allocations, which especially helps on low-memory devices.
with .neighbors. GameInfo.nextTurn now 11.09% faster on my Pixel3a.

Timing changes:

  • CityPopulationManager.autoAssignPopulation 22.68% faster
  • CityTurnManager.startTurn 22.08% faster
  • GameInfo.updateCivilizationState 20.73% faster
  • reassignPopulation 20.39% faster
  • Civilization.updateStatsForNextTurn 16.65% faster
  • CityStats.update 14.06% faster
  • GameInfo.nextTurn 11.09% faster
  • Civilization.setTransients 11.04% faster
  • updateTileStats 10.21% faster
  • moveToTile 9.98% faster
  • UnitTurnManager.startTurn 3.41% faster.
  • automateUnitMoves 14.85% SLOWER (?)

https://docs.google.com/spreadsheets/d/1r-TC1N3wKhZMRJGU-ph5Fcjfu1u31MHfjFx6xM693Ss/edit?usp=sharing

Signed-off-by: MPD mooing_psycho_duck@hotmail.com

@Ambeco
Copy link
Copy Markdown
Contributor Author

Ambeco commented Apr 5, 2026

Notably, this is built on top of #14714

@SomeTroglodyte
Copy link
Copy Markdown
Collaborator

Sounds intriguing. Why would yieldAll be notably expensive...?

@Ambeco
Copy link
Copy Markdown
Contributor Author

Ambeco commented Apr 5, 2026

Honestly, I was expecting like a 0.5% improvement, and was also shocked.
My suspicious are:

  • allocating a sequencing coroutine allocated a massive amount of memory for the call stack, which is expensive on memory constrained devices.
  • the sequence + yield all logic tends to be nested, meaning we actually allocate multiple sequencing callstacks.
  • yieldAll also has to allocate lists of things, which means the iterators have the overhead of iterating over internal lists and switching to the coroutine if empty, and again, these are often nested. The app was spending ~2-4% of the total count time in iterator.hasNext and iterator.next due to this. (I expected a total of 2-4% improvement from removing all sequencing coroutines)
  • Ive replaced these with for-loops, sometimes with an explicit filter parameter, because this eliminates all allocations. The code is not only no longer allocating coroutines, it's now not even allocating iterators or filters.

But some of my testing indicated the problem wasn't the iterators, or even the sequences. The problem seems to be the coroutines. Further testing is needed to confirm.

But the improvement is still stunning. I'll probably remeasure. I think I uninstalled Facebook and Discord from my test phone between the control and test. It's possible that had an unexpected affect on the performance?

@Ambeco Ambeco force-pushed the lowAllocStats branch 2 times, most recently from 568eb51 to 4456e76 Compare April 6, 2026 18:42
@Ambeco
Copy link
Copy Markdown
Contributor Author

Ambeco commented Apr 6, 2026

Further testing shows that measurements are problematic.

  • Control called GameInfo.nextTurn 36 times after loading the file, and treatment only 26 times, so the games apparently differed wildly :(
  • automateUnitMoves called 461 times in Control, 836 times in treatment.

And this time, GameInfo.nextTurn only 4% faster, which wasn't statistically significant.
However, GC data is now working, and we have 50% fewer GCs per turn, which is nice.

@SomeTroglodyte
Copy link
Copy Markdown
Collaborator

... yieldAll also has to allocate lists of things ...

Thanks for the detailed answer. But that's close to my point - in a simple environment, yieldAll can simply wrap the existing iterator to pass elements through. Running inside a coroutine shouldn't really influence that. Oh, don't forget CoroutineContext.yield is an entirely different beast than SequenceBuilder.yield - no relation at all. They "meet" in a Flow, but luckily there it's renamed to emit...

But what I could imagine - is costly closures. As you know, using data from an outer scope inside e.g. a lambda creates a closure, but when that outer scope is not even known to be in the same CoroutineContext, then I imagine things get expensive. Maybe. Back when I was active I had taught myself to look properly, see the closures (the debugger sometimes helps too), and check whether one can reduce them to simpler, fewer or cheaper ones.

@Ambeco
Copy link
Copy Markdown
Contributor Author

Ambeco commented Apr 8, 2026

You say the yields are entirely unrelated, but consider sequence { while(true) yield(0) }. The ONLY way to implement SequenceScope.yield is via CoroutineScope.yield. The two share the same name because they are, indeed, the same thing. Additionally, it's SequenceScope, not SequenceBuilder, because it's the coroutine scope thats yielding, not just some method. As much as I love pedantic debates though, I think this is not key to the PR.

I also considered the closures, but my measurements weren't actually showing many closures. There's a few here and there, but not NEARLY so many as I had expected to find. Most of the memory churn (by bytes) was Sequences and Sequence Iterators, Array Iterators, HashMap Iterators, Object[], Long[], and HashMap.Entry. There's definitely a handful of closures, but not very many. I have a separate WIP that cleans up the most offensive closures.

@SomeTroglodyte
Copy link
Copy Markdown
Collaborator

I think this is not key to the PR

Of course, agreed, but... "pedantic debates" so far this wasn't, at least I thought comparing our views on mechanics is a kind of fun. Not as much as downhill biking through deep mud, but still...

it's SequenceScope, not SequenceBuilder,

Yup, mixed that up in part because the source for public abstract class SequenceScope is in a file named SequenceBuilder.kt

The ONLY way

Not convinced, but I'll file that away (gears grinding)

closures... not NEARLY so many as I had expected...

Excellent.

...WIP...

Even excellenter.

@Ambeco
Copy link
Copy Markdown
Contributor Author

Ambeco commented Apr 8, 2026

(allocation sample from head, without this change) image
Looks like when I sort by bytes allocated in this sample, then the worst offending closure is the 11th item.
@ byte[] 4962K
@ FlatteningSequence (kotlin.sequences) 4289K
@ ArrayList$Itr (java.util) 3573K
@ FlatteningSequence$iterator$1 (kotlin.sequences) 3594K
@ String (java.lang) 3214K
@ FilteringSequence (kotlin.sequences) 2401K
@ Object[] (java.lang) 1831K
@ FilteringSequence$iterator$1 (kotlin.sequences) 1579K
@ Stats (com.unciv.models.stats) 1381K
@ HashMap$Node (java.util) 1176K
@ Civilization$getMatchingUniques$1 (com.unciv.logic.civilization) 1076K

FlatteningSequence and FlatteningSequence$iterator (aka yieldAll) allocate more than 7 times as many bytes as the worst closure :(
And FilteringSequence and FilteringSequence$iterator$1 aren't far behind FlatteningSequence.

To be clear, I love pedantic debates 😄 But the relation between SequenceScope.yield and CoroutineContext.yield isn't really much of a factor in if this PR should be accepted or not.

@Ambeco
Copy link
Copy Markdown
Contributor Author

Ambeco commented Apr 8, 2026

Oh right I forgot that updating the commit description doesn't update here. I redid the timings:

GCs/turn dropped from 11.13 to 6.31, a 56% reduction.

Timing changes:

  • automateUnitMoves 23% faster
  • CityTurnManager.endTurn 21% faster
  • CityStats.update 18% faster
  • reassignPopulation 17% faster
  • ConstructionAutomation.choosenext 15% faster
  • updateTileStats 15% faster
  • CityPopManager.autoAssignPop 14% faster
  • GameInfo.updateCivState 14% faster
  • CityTurnManager.startTurn 12% faster
  • moveToTile 8% faster
  • MapUnit.updateVisibleTiles 10% slower (???)

Not stat sig:

  • GameInfo.nextTurn 4% faster

Notes:

  • Control called GameInfo.nextTurn 36 times after loading the file, and treatment only 26 times, so the games apparently differed wildly :( Removing some unseeded randomness #14737 should help reduce the differences, somewhat.
  • automateUnitMoves called 461 times in Control,  836 times in treatment.
  • MapUnit.updateVisibleTiles stderr increased from 16->29, so became much noisier. Did some GCs become moved inside this?

@github-actions
Copy link
Copy Markdown

This pull request has conflicts, please resolve those before we can evaluate the pull request.

@yairm210
Copy link
Copy Markdown
Owner

Okay, this looks like a lot of changes, but the premise and the results are both good, and I trust your work enough at this point that I don't feel the need to go through everything with a comb :)

@yairm210
Copy link
Copy Markdown
Owner

I will wait to merge this in the next version so we can get #14714 out to not include too much change in one version

@github-actions
Copy link
Copy Markdown

Conflicts have been resolved.

@Ambeco
Copy link
Copy Markdown
Contributor Author

Ambeco commented Apr 28, 2026

I've been contemplating the doc Yairm wrote to prefer for loops for .forEach.
It occurs to me that if my theory is right, then we don't actually need new methods, we just need to replace the kotlin.sequence.sequence coroutine method with any other sequence type. MutableList.asSequence should work fine.

Maybe we should hold off on merging this until I test that theory. Maybe we don't have to deprecate the existing methods.

These avoid allocating a coroutine, sequence, and iterator. They avoid the expensive yieldAll code, improving performance, and reducing memory allocations, which especially helps on low-memory devices.
with .neighbors.  GCs/turn dropped from 11.13 to 6.31, a 56% reduction.

Timing changes:
- automateUnitMoves 23% faster
- CityTurnManager.endTurn 21% faster
- CityStats.update 18% faster
- reassignPopulation 17% faster
- ConstructionAutomation.choosenext 15% faster
- updateTileStats 15% faster
- CityPopManager.autoAssignPop 14% faster
- GameInfo.updateCivState 14% faster
- CityTurnManager.startTurn 12% faster
- moveToTile 8% faster
- MapUnit.updateVisibleTiles 10% slower (???)

Not stat sig:
- GameInfo.nextTurn 4% faster

Notes:
- Control called GameInfo.nextTurn 36 times after loading the file, and treatment only 26 times, so the games apparently differed wildly :(
- automateUnitMoves called 461 times in Control,  836 times in treatment.
- MapUnit.updateVisibleTiles stderr increased from 16->29, so became much noisier. Did some GCs become moved inside this?

https://docs.google.com/spreadsheets/d/1r-TC1N3wKhZMRJGU-ph5Fcjfu1u31MHfjFx6xM693Ss/edit?usp=sharing

Signed-off-by: MPD <mooing_psycho_duck@hotmail.com>
@Ambeco
Copy link
Copy Markdown
Contributor Author

Ambeco commented Apr 29, 2026

Curious. I made a variant where I replaced the sequence methods with a listSequence which evaluates the lambda immediately into a list, so there's no coroutines, but it had unexpected results
https://docs.google.com/spreadsheets/d/1NpVPqQXOsSxrEKUp3K9F-xP6s25-B3XGX4kzuYIsHQw/edit?gid=1346917422#gid=1346917422

119% more GCs, and most methods were slower, often by ~20%. Though UnitTurnManager.startTurn was ~15% faster. GameInfo.nextTurn was 8.8% slower, though not stat-sig.

So apparently the savings isn't the coroutines. Maybe its simply the reduced allocations?

But that means we're clear to merge this, so that's nice.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants