From 29006aa17082e091ae4fe0d098b1467589683734 Mon Sep 17 00:00:00 2001 From: Tom Gabsow Date: Tue, 26 May 2026 10:53:12 +0300 Subject: [PATCH 1/5] MOD-15862 Bound SSL tests with per-test + step timeouts to surface CI hang Linux CI runs were getting cancelled at the 6h job cap because some test under the new `--tls` (single-shard) invocation of `make test_ssl` blocks indefinitely. RLTest was being invoked with `--no-progress` and the default `--test-timeout 0`, so neither the test name nor a hang signal made it into the log. - Drop `--no-progress` from run_tests.sh and add `--test-timeout 120`, so any single test that hangs gets killed at 2 min and shows up by name. - Add `timeout-minutes: 45` to the Linux job and `timeout-minutes: 25` to the SSL step so a wedged run fails fast instead of burning 6h. Co-Authored-By: Claude Opus 4.7 (1M context) --- .github/workflows/linux.yml | 2 ++ tests/mr_test_module/pytests/run_tests.sh | 2 +- 2 files changed, 3 insertions(+), 1 deletion(-) diff --git a/.github/workflows/linux.yml b/.github/workflows/linux.yml index 1a95545..efdaadc 100644 --- a/.github/workflows/linux.yml +++ b/.github/workflows/linux.yml @@ -13,6 +13,7 @@ jobs: build: runs-on: ubuntu-latest + timeout-minutes: 45 strategy: fail-fast: false @@ -60,6 +61,7 @@ jobs: env: PYTHON: python - name: SSL tests + timeout-minutes: 25 run: make run_tests_ssl env: PYTHON: python diff --git a/tests/mr_test_module/pytests/run_tests.sh b/tests/mr_test_module/pytests/run_tests.sh index ca1cb05..bc52201 100755 --- a/tests/mr_test_module/pytests/run_tests.sh +++ b/tests/mr_test_module/pytests/run_tests.sh @@ -19,4 +19,4 @@ else fi -"${PYTHON:-python}" -m RLTest --verbose-information-on-failure --no-progress --randomize-ports --module $MODULE_PATH --clear-logs "$@" --oss_password "password" --enable-debug-command +"${PYTHON:-python}" -m RLTest --verbose-information-on-failure --test-timeout 120 --randomize-ports --module $MODULE_PATH --clear-logs "$@" --oss_password "password" --enable-debug-command From d9b3ed00fb9dd430163656c9f06b98f90849c36b Mon Sep 17 00:00:00 2001 From: Tom Gabsow Date: Tue, 26 May 2026 11:08:32 +0300 Subject: [PATCH 2/5] MOD-15862 Scope SSL diagnostic flags to Linux only MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The previous attempt set --test-timeout via run_tests.sh, which is shared between Linux and macOS workflows, so macOS Default tests began hitting the cap and failing even though they were passing in the past. Move the diagnostics into env vars consumed by run_tests.sh only when the Linux SSL step sets them: - run_tests.sh: restore the original --no-progress, but interpolate an optional RLTEST_EXTRA_ARGS env var (empty by default — no behaviour change for any consumer that does not opt in). - linux.yml SSL step: PYTHONUNBUFFERED=1 so test names flush in real time (CI stdout has no TTY, so otherwise prints were batching at end and the hung test name never made it out), plus RLTEST_EXTRA_ARGS="--test-timeout 180" so an individual hung test fails fast and is named in the log. macOS workflow is untouched. Co-Authored-By: Claude Opus 4.7 (1M context) --- .github/workflows/linux.yml | 2 ++ tests/mr_test_module/pytests/run_tests.sh | 2 +- 2 files changed, 3 insertions(+), 1 deletion(-) diff --git a/.github/workflows/linux.yml b/.github/workflows/linux.yml index efdaadc..d75a182 100644 --- a/.github/workflows/linux.yml +++ b/.github/workflows/linux.yml @@ -65,6 +65,8 @@ jobs: run: make run_tests_ssl env: PYTHON: python + PYTHONUNBUFFERED: "1" + RLTEST_EXTRA_ARGS: "--test-timeout 180" - name: Valgrind tests run: make run_tests_valgrind env: diff --git a/tests/mr_test_module/pytests/run_tests.sh b/tests/mr_test_module/pytests/run_tests.sh index bc52201..2c84565 100755 --- a/tests/mr_test_module/pytests/run_tests.sh +++ b/tests/mr_test_module/pytests/run_tests.sh @@ -19,4 +19,4 @@ else fi -"${PYTHON:-python}" -m RLTest --verbose-information-on-failure --test-timeout 120 --randomize-ports --module $MODULE_PATH --clear-logs "$@" --oss_password "password" --enable-debug-command +"${PYTHON:-python}" -m RLTest --verbose-information-on-failure --no-progress ${RLTEST_EXTRA_ARGS:-} --randomize-ports --module $MODULE_PATH --clear-logs "$@" --oss_password "password" --enable-debug-command From 175346e4f3b105bd85c54f0b0f243b509caad5ea Mon Sep 17 00:00:00 2001 From: Tom Gabsow Date: Tue, 26 May 2026 11:46:31 +0300 Subject: [PATCH 3/5] MOD-15862 Make testSendRetriesMechanizm tolerant of TLS retry-count Root cause of the Linux SSL hang: testSendRetriesMechanizm hardcoded three `-Err` exchanges, but under TLS only TWO INNERCOMMUNICATION sends actually go out before libmr gives up. Why the count differs by TLS: - Non-TLS: NETWORKTEST runs after the node reaches NodeStatus_Connected, so MR_ClusterSendMsgToNode sends INNERCOMMUNICATION synchronously (retries stays 0). The first `-Err` triggers a disconnect, after which MR_HelloResponseArrived re-sends from pendingMessages, incrementing retries to 1, 2, then 3. At retries==MSG_MAX_RETRIES (=3) libmr logs `Gave up of message...`. Total: 1 initial + 2 resends = 3 sends. - TLS: the TLS+AUTH+HELLO handshake takes longer; NETWORKTEST runs while the node status is still NodeStatus_HelloSent. The "message was not sent because status is not connected" path queues the msg in pendingMessages and the actual first send happens via the resend loop in MR_HelloResponseArrived -- which DOES increment retries. So the initial send burns retry #1, leaving only one further resend before give-up. Total: 2 sends. The old test waited unboundedly for a fourth GetConnection on Linux TLS and hung until the 6h job cap. The single-shard `--tls` invocation that 2857cfe added is what surfaced this, since this test has skipOnCluster=True and was previously only run in cluster mode under TLS (where it was skipped entirely). Rewrite the test to express the actual invariant -- libmr sends the message between 1 and MSG_MAX_RETRIES times, then stops -- with bounded read/connection waits so it cannot hang regardless of where the retry-count boundary lands. Co-Authored-By: Claude Opus 4.7 (1M context) --- tests/mr_test_module/pytests/test_network.py | 70 ++++++++++---------- 1 file changed, 36 insertions(+), 34 deletions(-) diff --git a/tests/mr_test_module/pytests/test_network.py b/tests/mr_test_module/pytests/test_network.py index a4bc3d9..cfd9634 100644 --- a/tests/mr_test_module/pytests/test_network.py +++ b/tests/mr_test_module/pytests/test_network.py @@ -431,47 +431,49 @@ def testMessageNotResentAfterCrash(env, conn): @MRTestDecorator(skipOnCluster=True) def testSendRetriesMechanizm(env, conn): + # MSG_MAX_RETRIES in src/cluster.c. + MSG_MAX_RETRIES = 3 + expected_msg = ['MRTESTS.INNERCOMMUNICATION', '0000000000000000000000000000000000000001', + None, '0', 'test msg', '0'] for host in _get_hosts(): with ShardMock(env, host) as shardMock: + expected_msg[2] = shardMock.runId conn = shardMock.GetConnection() env.expect('MRTESTS.NETWORKTEST').equal('OK') - env.assertEqual(conn.read_request(), ['MRTESTS.INNERCOMMUNICATION', '0000000000000000000000000000000000000001', shardMock.runId, '0', 'test msg', '0']) - - conn.send('-Err\r\n') - - env.assertTrue(conn.is_close()) - - # should be a retry - - conn = shardMock.GetConnection() - - env.assertEqual(conn.read_request(), ['MRTESTS.INNERCOMMUNICATION', '0000000000000000000000000000000000000001', shardMock.runId, '0', 'test msg', '0']) - - conn.send('-Err\r\n') - - env.assertTrue(conn.is_close()) - - # should be a retry - - conn = shardMock.GetConnection() - - env.assertEqual(conn.read_request(), ['MRTESTS.INNERCOMMUNICATION', '0000000000000000000000000000000000000001', shardMock.runId, '0', 'test msg', '0']) - - conn.send('-Err\r\n') - - env.assertTrue(conn.is_close()) - - # should not retry - - conn = shardMock.GetConnection() - - # make sure message will not be sent again + # libmr should resend INNERCOMMUNICATION up to MSG_MAX_RETRIES times. + # Whether the initial send counts as one of those retries depends on + # whether the node has reached NodeStatus_Connected by the time + # NETWORKTEST runs -- under TLS the HELLO handshake is slower, so the + # initial send is queued and goes through the resend loop, which + # counts as a retry. We therefore accept any count in [1, MSG_MAX_RETRIES]. + attempts = 0 + for _ in range(MSG_MAX_RETRIES + 2): + try: + with TimeLimit(3): + req = conn.read_request() + except Exception: + break # libmr stopped sending -- gave up + env.assertEqual(req, expected_msg) + attempts += 1 + conn.send('-Err\r\n') + # libmr will disconnect on error and may reconnect to retry. + try: + with TimeLimit(3): + conn = shardMock.GetConnection() + except Exception: + break # no reconnect -- gave up + + env.assertGreaterEqual(attempts, 1, message='libmr did not send the initial message') + env.assertLessEqual(attempts, MSG_MAX_RETRIES, + message='libmr exceeded MSG_MAX_RETRIES (=%d) sends' % MSG_MAX_RETRIES) + + # After giving up, libmr must not reconnect to retry the same msg. try: - with TimeLimit(1): - conn.read_request() - env.assertTrue(False) # we should not get any data after crash + with TimeLimit(2): + shardMock.GetConnection() + env.assertTrue(False, message='Unexpected reconnect after MSG_MAX_RETRIES') except Exception: pass From 27b00e1488c725d003b0663b6c45d9b1579a701b Mon Sep 17 00:00:00 2001 From: Tom Gabsow Date: Tue, 26 May 2026 12:38:31 +0300 Subject: [PATCH 4/5] MOD-15862 Bump per-test timeout to 300s for slow 7.2 cluster TLS tests testSendRetriesMechanizm hang is now fixed, but the SSL run on 7.2 with --env oss-cluster --shards-count 3 surfaced a separate pre-existing flake: testUnevenWork can take >180s on that runner combination (it has an internal TimeLimit(2) but the SIGALRM evidently isn't interrupting the redis-py TLS handshake reliably in this configuration). On 7.4 and unstable the same invocation completes in ~144s, so 180s was just too tight. Bump to 300s. This will be tracked separately -- the actual fix likely belongs in test_basic.py / redis-py connection setup rather than the workflow timeout. Co-Authored-By: Claude Opus 4.7 (1M context) --- .github/workflows/linux.yml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/.github/workflows/linux.yml b/.github/workflows/linux.yml index d75a182..0cfa08d 100644 --- a/.github/workflows/linux.yml +++ b/.github/workflows/linux.yml @@ -66,7 +66,7 @@ jobs: env: PYTHON: python PYTHONUNBUFFERED: "1" - RLTEST_EXTRA_ARGS: "--test-timeout 180" + RLTEST_EXTRA_ARGS: "--test-timeout 300" - name: Valgrind tests run: make run_tests_valgrind env: From 09446a25af6ff1e234f39eb91e8c1067004f4549 Mon Sep 17 00:00:00 2001 From: Tom Gabsow Date: Wed, 27 May 2026 11:25:50 +0300 Subject: [PATCH 5/5] MOD-15862 Add timeout-minutes to macOS workflow MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit This PR's CI run hit a 6h cancel on macOS 7.4 SSL — a runner-side flake (other macOS jobs in the same run passed in under 8 minutes). The macOS workflow had no timeout-minutes set, so a single hung test could burn the full job cap. Mirror the Linux workflow: 45min job cap + 25min SSL step cap. macOS keeps the default --test-timeout (no per-test cap) since it isn't where the original MOD-15862 hang occurred and we don't want to risk regressing the runs that currently pass. Co-Authored-By: Claude Opus 4.7 (1M context) --- .github/workflows/macos.yml | 2 ++ 1 file changed, 2 insertions(+) diff --git a/.github/workflows/macos.yml b/.github/workflows/macos.yml index 1e60b30..2e4c1db 100644 --- a/.github/workflows/macos.yml +++ b/.github/workflows/macos.yml @@ -15,6 +15,7 @@ jobs: build: runs-on: macos-latest + timeout-minutes: 45 strategy: fail-fast: false @@ -51,6 +52,7 @@ jobs: env: PYTHON: python - name: SSL tests + timeout-minutes: 25 run: make run_tests_ssl env: PYTHON: python