Skip to content

this usually triggers a deadlock...#2559

Open
stalep wants to merge 1 commit intoHyperfoil:masterfrom
stalep:deadlock_investigation
Open

this usually triggers a deadlock...#2559
stalep wants to merge 1 commit intoHyperfoil:masterfrom
stalep:deadlock_investigation

Conversation

@stalep
Copy link
Member

@stalep stalep commented Oct 20, 2025

Not always, but usually this will trigger a deadlock locally.

2025-10-20 19:28:58,606 INFO  [io.hyp.too.hor.svc.NotificationServiceImpl] (executor-thread-1) Received new changes in test 10 (quarkus-spring-boot-comparison), dataset 12/2 (fingerprint: {"Runtime":"spring3","buildType":"native"})
2025-10-20 19:28:58,713 WARN  [org.hib.orm.jdb.error] (horreum.dataset.pool-10) HHH000247: ErrorCode: 0, SQLState: 40P01
2025-10-20 19:28:58,713 WARN  [org.hib.orm.jdb.error] (horreum.dataset.pool-10) ERROR: deadlock detected
  Detail: Process 74 waits for ShareLock on transaction 1881; blocked by process 68.
Process 68 waits for AccessExclusiveLock on tuple (6,6) of relation 16621 of database 16384; blocked by process 65.
Process 65 waits for ShareLock on transaction 1891; blocked by process 74.
  Hint: See server log for query details.
  Where: while deleting tuple (3,5) in relation "change"
2025-10-20 19:28:58,718 ERROR [io.sma.rea.mes.provider] (horreum.dataset.pool-10) SRMSG00200: The method io.hyperfoil.tools.horreum.svc.ServiceMediator#processDatasetEvents has thrown an exception: jakarta.persistence.PessimisticLockException: JDBC exception executing SQL [ERROR: deadlock detected
  Detail: Process 74 waits for ShareLock on transaction 1881; blocked by process 68.
Process 68 waits for AccessExclusiveLock on tuple (6,6) of relation 16621 of database 16384; blocked by process 65.
Process 65 waits for ShareLock on transaction 1891; blocked by process 74.
  Hint: See server log for query details.
  Where: while deleting tuple (3,5) in relation "change"] [DELETE FROM change cc
WHERE NOT cc.confirmed
  AND cc.variable_id = ?
  AND (cc.timestamp > ? OR (cc.timestamp = ? AND ?))
  AND EXISTS (
      SELECT 1 FROM fingerprint fp
      WHERE fp.dataset_id = cc.dataset_id
        AND fp.fp_hash = ?
  );
]
	at org.hibernate.internal.ExceptionConverterImpl.wrapLockException(ExceptionConverterImpl.java:227)
	at org.hibernate.internal.ExceptionConverterImpl.convert(ExceptionConverterImpl.java:85)
	at org.hibernate.internal.ExceptionConverterImpl.convert(ExceptionConverterImpl.java:155)
	at org.hibernate.query.spi.AbstractQuery.executeUpdate(AbstractQuery.java:647)
	at io.hyperfoil.tools.horreum.svc.AlertingServiceImpl.runChangeDetection(AlertingServiceImpl.java:658)

@stalep
Copy link
Member Author

stalep commented Oct 20, 2025

With the latest commit I can not trigger the dead lock (I think). @barreiro @lampajr

@stalep stalep force-pushed the deadlock_investigation branch from c5c96c5 to 0d04c3e Compare October 21, 2025 07:48
Comment on lines 728 to 730
validateUpTo(variableId, fpHash, nextTimestamp);
//assume not last datapoint if we have found more
tryRunChangeDetection(variableId, testId, fingerprint, fpHash, notify);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure removing messageBus.executeForTest here is straightforward because we are heavily changing the behavior:

  • Previous: on every dataset recalc we enqueue a recursive call tryRunChangeDetection with executeForTest ensuring that this is run after the dataset was processed and committed (i.e., the datapoint is created). Moreover the executeForTest guarantees only ONE processing will be executed for that test
  • New: we are likely increasing the lifespan of db transaction because of the recursion, since we are not offloading that execution to a different async worker. Moreover they will run concurrently and it is not guaranteed that the datapoints have been already created when processing others

messageBus.executeForTest(testId, () -> {
startRecalculation(testId, notify, debug, clearDatapoints == null || clearDatapoints, from, to);
});
startRecalculation(testId, notify, debug, clearDatapoints == null || clearDatapoints, from, to);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing executeForTest on this, won't guarantee we will execute only one recalculation at time, which was the previous behavior. Moreover the current expectation is that this method will return immediately and do not block (as this might be a very long process) and its status is monitored by the UI using another endpoint)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding one calculation, if we're doing more than one per testId, something else is wrong since that should not happen afaik?
Regarding it being a long running process, yes, that might be something we should consider, but I would rather us using the same approach for this (using amq) as we're doing for the other offloading we're doing.

@stalep stalep force-pushed the deadlock_investigation branch from 69d5180 to 227ad9b Compare October 23, 2025 12:40
refactored calculateDatapoints and runChangeDetection
removed calls to executeBlocking, only one useage of messageBus
remaining
@stalep stalep force-pushed the deadlock_investigation branch from 227ad9b to af4f325 Compare October 23, 2025 15:19
@stalep stalep marked this pull request as ready for review October 23, 2025 15:19
@lampajr lampajr self-requested a review October 24, 2025 08:37
Copy link
Member

@lampajr lampajr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @stalep,

I did another quick round and from a correctness standpoint it seems working as expected but there are a couple of concerns:

  • If you trigger the recalculation in the Change Detection tab you will see the bar will not progress (I believe because the progress is not properly tracked in the backend)
image
  • The process of recalculating datasets (consuming messages) seems much much slower. This is my biggest concern, I am trying some tests locally with a prod db to compare before/after to have some broad numbers..

@lampajr
Copy link
Member

lampajr commented Oct 24, 2025

* The process of recalculating datasets (consuming messages) seems **much much slower**. This is my biggest concern, I am trying some tests locally with a prod db to compare before/after to have some broad numbers..

I can confirm this:

  • Current prod (master) implementation: the messages are processed really fast, i.e., the datapoints are created really fast. Then the change detection is offloaded to a different worker that will process all datapoints and it will eventually compute all changes (I believe this could be acceptable as the changes identification is async by nature)

  • This PR: all messages are consumed really slow, i.e., also the datapoints are processed really slow.. this means there is a quite long period where the datapoints you are seeing in the charts are wrong/old until all messages are processed. Moreover with this approach we are doing "in sync" something that might be re-done after some time, therefore we risk to do something blocking completely useless and which can cause further slowness in identifying the changes

Overall just looking at how fast the changes are identified, the current master seems much faster
As a test I am deleting changes delete from change c where dataset_id in (select id from dataset d where d.testid = 339);
and then checking with: select count(*) from change c where dataset_id in (select id from dataset d where d.testid = 339);

The main change that introduced this difference seems to be the removal of executeForTest recursive call

@stalep
Copy link
Member Author

stalep commented Oct 24, 2025

Yeah, I did notice some slowdown as well, thanks for confirming. Before I had the insight I had now, I wanted to make sure we did it "correct" as I thought the deadlock could be related to this. However, I now know it's related to the changer calculation and not a problem with doing the datapoint calculations in parallel.
I would however like if we can code this without calling vertx.blabla directly (sorry, in a rush we can discuss more later)

@lampajr
Copy link
Member

lampajr commented Oct 24, 2025

However, I now know it's related to the changer calculation and not a problem with doing the datapoint calculations in parallel.

Exactly, removing the vertx.blabla calls (especially the messageBus.executeForTest(testId, () -> tryRunChangeDetection(variableId, testId, fingerprint, fpHash, notify));) the recursive change detection logic is now processed in sync with the datapoints, i.e., the logic cannot proceed creating the next datapoint until the full change detection is completed for the previous one

sorry, in a rush we can discuss more later)

sure thing!

@barreiro barreiro mentioned this pull request Nov 7, 2025
2 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants