this usually triggers a deadlock... by stalep · Pull Request #2559 · Hyperfoil/Horreum

stalep · 2025-10-20T17:35:13Z

Not always, but usually this will trigger a deadlock locally.

2025-10-20 19:28:58,606 INFO  [io.hyp.too.hor.svc.NotificationServiceImpl] (executor-thread-1) Received new changes in test 10 (quarkus-spring-boot-comparison), dataset 12/2 (fingerprint: {"Runtime":"spring3","buildType":"native"})
2025-10-20 19:28:58,713 WARN  [org.hib.orm.jdb.error] (horreum.dataset.pool-10) HHH000247: ErrorCode: 0, SQLState: 40P01
2025-10-20 19:28:58,713 WARN  [org.hib.orm.jdb.error] (horreum.dataset.pool-10) ERROR: deadlock detected
  Detail: Process 74 waits for ShareLock on transaction 1881; blocked by process 68.
Process 68 waits for AccessExclusiveLock on tuple (6,6) of relation 16621 of database 16384; blocked by process 65.
Process 65 waits for ShareLock on transaction 1891; blocked by process 74.
  Hint: See server log for query details.
  Where: while deleting tuple (3,5) in relation "change"
2025-10-20 19:28:58,718 ERROR [io.sma.rea.mes.provider] (horreum.dataset.pool-10) SRMSG00200: The method io.hyperfoil.tools.horreum.svc.ServiceMediator#processDatasetEvents has thrown an exception: jakarta.persistence.PessimisticLockException: JDBC exception executing SQL [ERROR: deadlock detected
  Detail: Process 74 waits for ShareLock on transaction 1881; blocked by process 68.
Process 68 waits for AccessExclusiveLock on tuple (6,6) of relation 16621 of database 16384; blocked by process 65.
Process 65 waits for ShareLock on transaction 1891; blocked by process 74.
  Hint: See server log for query details.
  Where: while deleting tuple (3,5) in relation "change"] [DELETE FROM change cc
WHERE NOT cc.confirmed
  AND cc.variable_id = ?
  AND (cc.timestamp > ? OR (cc.timestamp = ? AND ?))
  AND EXISTS (
      SELECT 1 FROM fingerprint fp
      WHERE fp.dataset_id = cc.dataset_id
        AND fp.fp_hash = ?
  );
]
	at org.hibernate.internal.ExceptionConverterImpl.wrapLockException(ExceptionConverterImpl.java:227)
	at org.hibernate.internal.ExceptionConverterImpl.convert(ExceptionConverterImpl.java:85)
	at org.hibernate.internal.ExceptionConverterImpl.convert(ExceptionConverterImpl.java:155)
	at org.hibernate.query.spi.AbstractQuery.executeUpdate(AbstractQuery.java:647)
	at io.hyperfoil.tools.horreum.svc.AlertingServiceImpl.runChangeDetection(AlertingServiceImpl.java:658)

stalep · 2025-10-20T19:05:40Z

With the latest commit I can not trigger the dead lock (I think). @barreiro @lampajr

horreum-backend/src/test/java/io/hyperfoil/tools/horreum/svc/TestServiceTest.java

lampajr · 2025-10-21T07:59:42Z

horreum-backend/src/main/java/io/hyperfoil/tools/horreum/svc/AlertingServiceImpl.java

+        validateUpTo(variableId, fpHash, nextTimestamp);
+        //assume not last datapoint if we have found more
+        tryRunChangeDetection(variableId, testId, fingerprint, fpHash, notify);


I am not sure removing messageBus.executeForTest here is straightforward because we are heavily changing the behavior:

Previous: on every dataset recalc we enqueue a recursive call tryRunChangeDetection with executeForTest ensuring that this is run after the dataset was processed and committed (i.e., the datapoint is created). Moreover the executeForTest guarantees only ONE processing will be executed for that test

New: we are likely increasing the lifespan of db transaction because of the recursion, since we are not offloading that execution to a different async worker. Moreover they will run concurrently and it is not guaranteed that the datapoints have been already created when processing others

lampajr · 2025-10-21T08:03:13Z

horreum-backend/src/main/java/io/hyperfoil/tools/horreum/svc/AlertingServiceImpl.java

-        messageBus.executeForTest(testId, () -> {
-            startRecalculation(testId, notify, debug, clearDatapoints == null || clearDatapoints, from, to);
-        });
+        startRecalculation(testId, notify, debug, clearDatapoints == null || clearDatapoints, from, to);


Removing executeForTest on this, won't guarantee we will execute only one recalculation at time, which was the previous behavior. Moreover the current expectation is that this method will return immediately and do not block (as this might be a very long process) and its status is monitored by the UI using another endpoint)

Regarding one calculation, if we're doing more than one per testId, something else is wrong since that should not happen afaik?
Regarding it being a long running process, yes, that might be something we should consider, but I would rather us using the same approach for this (using amq) as we're doing for the other offloading we're doing.

refactored calculateDatapoints and runChangeDetection removed calls to executeBlocking, only one useage of messageBus remaining

lampajr

Hi @stalep,

I did another quick round and from a correctness standpoint it seems working as expected but there are a couple of concerns:

If you trigger the recalculation in the Change Detection tab you will see the bar will not progress (I believe because the progress is not properly tracked in the backend)

The process of recalculating datasets (consuming messages) seems much much slower. This is my biggest concern, I am trying some tests locally with a prod db to compare before/after to have some broad numbers..

lampajr · 2025-10-24T09:40:40Z

* The process of recalculating datasets (consuming messages) seems **much much slower**. This is my biggest concern, I am trying some tests locally with a prod db to compare before/after to have some broad numbers..

I can confirm this:

Current prod (master) implementation: the messages are processed really fast, i.e., the datapoints are created really fast. Then the change detection is offloaded to a different worker that will process all datapoints and it will eventually compute all changes (I believe this could be acceptable as the changes identification is async by nature)
This PR: all messages are consumed really slow, i.e., also the datapoints are processed really slow.. this means there is a quite long period where the datapoints you are seeing in the charts are wrong/old until all messages are processed. Moreover with this approach we are doing "in sync" something that might be re-done after some time, therefore we risk to do something blocking completely useless and which can cause further slowness in identifying the changes

Overall just looking at how fast the changes are identified, the current master seems much faster
As a test I am deleting changes delete from change c where dataset_id in (select id from dataset d where d.testid = 339);
and then checking with: select count(*) from change c where dataset_id in (select id from dataset d where d.testid = 339);

The main change that introduced this difference seems to be the removal of executeForTest recursive call

stalep · 2025-10-24T10:02:54Z

Yeah, I did notice some slowdown as well, thanks for confirming. Before I had the insight I had now, I wanted to make sure we did it "correct" as I thought the deadlock could be related to this. However, I now know it's related to the changer calculation and not a problem with doing the datapoint calculations in parallel.
I would however like if we can code this without calling vertx.blabla directly (sorry, in a rush we can discuss more later)

lampajr · 2025-10-24T12:23:19Z

However, I now know it's related to the changer calculation and not a problem with doing the datapoint calculations in parallel.

Exactly, removing the vertx.blabla calls (especially the messageBus.executeForTest(testId, () -> tryRunChangeDetection(variableId, testId, fingerprint, fpHash, notify));) the recursive change detection logic is now processed in sync with the datapoints, i.e., the logic cannot proceed creating the next datapoint until the full change detection is completed for the previous one

sorry, in a rush we can discuss more later)

sure thing!

lampajr reviewed Oct 21, 2025

View reviewed changes

horreum-backend/src/test/java/io/hyperfoil/tools/horreum/svc/TestServiceTest.java Outdated Show resolved Hide resolved

stalep force-pushed the deadlock_investigation branch from c5c96c5 to 0d04c3e Compare October 21, 2025 07:48

lampajr reviewed Oct 21, 2025

View reviewed changes

stalep force-pushed the deadlock_investigation branch from 69d5180 to 227ad9b Compare October 23, 2025 12:40

added a test that usually triggers a deadlock.

af4f325

refactored calculateDatapoints and runChangeDetection removed calls to executeBlocking, only one useage of messageBus remaining

stalep force-pushed the deadlock_investigation branch from 227ad9b to af4f325 Compare October 23, 2025 15:19

stalep marked this pull request as ready for review October 23, 2025 15:19

lampajr self-requested a review October 24, 2025 08:37

lampajr requested changes Oct 24, 2025

View reviewed changes

barreiro mentioned this pull request Nov 7, 2025

Changes to recalculation #2578

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

this usually triggers a deadlock...#2559

this usually triggers a deadlock...#2559
stalep wants to merge 1 commit intoHyperfoil:masterfrom
stalep:deadlock_investigation

stalep commented Oct 20, 2025

Uh oh!

stalep commented Oct 20, 2025 •

edited

Loading

Uh oh!

Uh oh!

lampajr Oct 21, 2025

Uh oh!

lampajr Oct 21, 2025

Uh oh!

stalep Oct 21, 2025

Uh oh!

lampajr left a comment •

edited

Loading

Uh oh!

lampajr commented Oct 24, 2025 •

edited

Loading

Uh oh!

stalep commented Oct 24, 2025

Uh oh!

lampajr commented Oct 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

stalep commented Oct 20, 2025

Uh oh!

stalep commented Oct 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

lampajr Oct 21, 2025

Choose a reason for hiding this comment

Uh oh!

lampajr Oct 21, 2025

Choose a reason for hiding this comment

Uh oh!

stalep Oct 21, 2025

Choose a reason for hiding this comment

Uh oh!

lampajr left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lampajr commented Oct 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stalep commented Oct 24, 2025

Uh oh!

lampajr commented Oct 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

stalep commented Oct 20, 2025 •

edited

Loading

lampajr left a comment •

edited

Loading

lampajr commented Oct 24, 2025 •

edited

Loading