From 29006aa17082e091ae4fe0d098b1467589683734 Mon Sep 17 00:00:00 2001
From: Tom Gabsow <gabsow.tom@gmail.com>
Date: Tue, 26 May 2026 10:53:12 +0300
Subject: [PATCH 1/5] MOD-15862 Bound SSL tests with per-test + step timeouts
 to surface CI hang

Linux CI runs were getting cancelled at the 6h job cap because some test
under the new `--tls` (single-shard) invocation of `make test_ssl` blocks
indefinitely. RLTest was being invoked with `--no-progress` and the default
`--test-timeout 0`, so neither the test name nor a hang signal made it into
the log.

- Drop `--no-progress` from run_tests.sh and add `--test-timeout 120`, so
  any single test that hangs gets killed at 2 min and shows up by name.
- Add `timeout-minutes: 45` to the Linux job and `timeout-minutes: 25` to
  the SSL step so a wedged run fails fast instead of burning 6h.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 .github/workflows/linux.yml               | 2 ++
 tests/mr_test_module/pytests/run_tests.sh | 2 +-
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/.github/workflows/linux.yml b/.github/workflows/linux.yml
index 1a95545..efdaadc 100644
--- a/.github/workflows/linux.yml
+++ b/.github/workflows/linux.yml
@@ -13,6 +13,7 @@ jobs:
   build:
 
     runs-on: ubuntu-latest
+    timeout-minutes: 45
 
     strategy:
       fail-fast: false
@@ -60,6 +61,7 @@ jobs:
       env:
         PYTHON: python
     - name: SSL tests
+      timeout-minutes: 25
       run: make run_tests_ssl
       env:
         PYTHON: python
diff --git a/tests/mr_test_module/pytests/run_tests.sh b/tests/mr_test_module/pytests/run_tests.sh
index ca1cb05..bc52201 100755
--- a/tests/mr_test_module/pytests/run_tests.sh
+++ b/tests/mr_test_module/pytests/run_tests.sh
@@ -19,4 +19,4 @@ else
 fi
 
 
-"${PYTHON:-python}" -m RLTest --verbose-information-on-failure --no-progress --randomize-ports --module $MODULE_PATH --clear-logs "$@" --oss_password "password" --enable-debug-command
+"${PYTHON:-python}" -m RLTest --verbose-information-on-failure --test-timeout 120 --randomize-ports --module $MODULE_PATH --clear-logs "$@" --oss_password "password" --enable-debug-command

From d9b3ed00fb9dd430163656c9f06b98f90849c36b Mon Sep 17 00:00:00 2001
From: Tom Gabsow <gabsow.tom@gmail.com>
Date: Tue, 26 May 2026 11:08:32 +0300
Subject: [PATCH 2/5] MOD-15862 Scope SSL diagnostic flags to Linux only
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

The previous attempt set --test-timeout via run_tests.sh, which is shared
between Linux and macOS workflows, so macOS Default tests began hitting
the cap and failing even though they were passing in the past.

Move the diagnostics into env vars consumed by run_tests.sh only when
the Linux SSL step sets them:

- run_tests.sh: restore the original --no-progress, but interpolate an
  optional RLTEST_EXTRA_ARGS env var (empty by default — no behaviour
  change for any consumer that does not opt in).
- linux.yml SSL step: PYTHONUNBUFFERED=1 so test names flush in real
  time (CI stdout has no TTY, so otherwise prints were batching at end
  and the hung test name never made it out), plus
  RLTEST_EXTRA_ARGS="--test-timeout 180" so an individual hung test
  fails fast and is named in the log.

macOS workflow is untouched.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 .github/workflows/linux.yml               | 2 ++
 tests/mr_test_module/pytests/run_tests.sh | 2 +-
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/.github/workflows/linux.yml b/.github/workflows/linux.yml
index efdaadc..d75a182 100644
--- a/.github/workflows/linux.yml
+++ b/.github/workflows/linux.yml
@@ -65,6 +65,8 @@ jobs:
       run: make run_tests_ssl
       env:
         PYTHON: python
+        PYTHONUNBUFFERED: "1"
+        RLTEST_EXTRA_ARGS: "--test-timeout 180"
     - name: Valgrind tests
       run: make run_tests_valgrind
       env:
diff --git a/tests/mr_test_module/pytests/run_tests.sh b/tests/mr_test_module/pytests/run_tests.sh
index bc52201..2c84565 100755
--- a/tests/mr_test_module/pytests/run_tests.sh
+++ b/tests/mr_test_module/pytests/run_tests.sh
@@ -19,4 +19,4 @@ else
 fi
 
 
-"${PYTHON:-python}" -m RLTest --verbose-information-on-failure --test-timeout 120 --randomize-ports --module $MODULE_PATH --clear-logs "$@" --oss_password "password" --enable-debug-command
+"${PYTHON:-python}" -m RLTest --verbose-information-on-failure --no-progress ${RLTEST_EXTRA_ARGS:-} --randomize-ports --module $MODULE_PATH --clear-logs "$@" --oss_password "password" --enable-debug-command

From 175346e4f3b105bd85c54f0b0f243b509caad5ea Mon Sep 17 00:00:00 2001
From: Tom Gabsow <gabsow.tom@gmail.com>
Date: Tue, 26 May 2026 11:46:31 +0300
Subject: [PATCH 3/5] MOD-15862 Make testSendRetriesMechanizm tolerant of TLS
 retry-count

Root cause of the Linux SSL hang: testSendRetriesMechanizm hardcoded
three `-Err` exchanges, but under TLS only TWO INNERCOMMUNICATION
sends actually go out before libmr gives up.

Why the count differs by TLS:

  - Non-TLS: NETWORKTEST runs after the node reaches NodeStatus_Connected,
    so MR_ClusterSendMsgToNode sends INNERCOMMUNICATION synchronously
    (retries stays 0). The first `-Err` triggers a disconnect, after which
    MR_HelloResponseArrived re-sends from pendingMessages, incrementing
    retries to 1, 2, then 3. At retries==MSG_MAX_RETRIES (=3) libmr logs
    `Gave up of message...`. Total: 1 initial + 2 resends = 3 sends.

  - TLS: the TLS+AUTH+HELLO handshake takes longer; NETWORKTEST runs
    while the node status is still NodeStatus_HelloSent. The "message
    was not sent because status is not connected" path queues the msg
    in pendingMessages and the actual first send happens via the
    resend loop in MR_HelloResponseArrived -- which DOES increment
    retries. So the initial send burns retry #1, leaving only one
    further resend before give-up. Total: 2 sends.

The old test waited unboundedly for a fourth GetConnection on Linux
TLS and hung until the 6h job cap. The single-shard `--tls` invocation
that 2857cfe added is what surfaced this, since this test has
skipOnCluster=True and was previously only run in cluster mode under
TLS (where it was skipped entirely).

Rewrite the test to express the actual invariant -- libmr sends the
message between 1 and MSG_MAX_RETRIES times, then stops -- with
bounded read/connection waits so it cannot hang regardless of where
the retry-count boundary lands.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 tests/mr_test_module/pytests/test_network.py | 70 ++++++++++----------
 1 file changed, 36 insertions(+), 34 deletions(-)

diff --git a/tests/mr_test_module/pytests/test_network.py b/tests/mr_test_module/pytests/test_network.py
index a4bc3d9..cfd9634 100644
--- a/tests/mr_test_module/pytests/test_network.py
+++ b/tests/mr_test_module/pytests/test_network.py
@@ -431,47 +431,49 @@ def testMessageNotResentAfterCrash(env, conn):
 
 @MRTestDecorator(skipOnCluster=True)
 def testSendRetriesMechanizm(env, conn):
+    # MSG_MAX_RETRIES in src/cluster.c.
+    MSG_MAX_RETRIES = 3
+    expected_msg = ['MRTESTS.INNERCOMMUNICATION', '0000000000000000000000000000000000000001',
+                    None, '0', 'test msg', '0']
     for host in _get_hosts():
         with ShardMock(env, host) as shardMock:
+            expected_msg[2] = shardMock.runId
             conn = shardMock.GetConnection()
 
             env.expect('MRTESTS.NETWORKTEST').equal('OK')
 
-            env.assertEqual(conn.read_request(), ['MRTESTS.INNERCOMMUNICATION', '0000000000000000000000000000000000000001', shardMock.runId, '0', 'test msg', '0'])
-
-            conn.send('-Err\r\n')
-
-            env.assertTrue(conn.is_close())
-
-            # should be a retry
-
-            conn = shardMock.GetConnection()
-
-            env.assertEqual(conn.read_request(), ['MRTESTS.INNERCOMMUNICATION', '0000000000000000000000000000000000000001', shardMock.runId, '0', 'test msg', '0'])
-
-            conn.send('-Err\r\n')
-
-            env.assertTrue(conn.is_close())
-
-            # should be a retry
-
-            conn = shardMock.GetConnection()
-
-            env.assertEqual(conn.read_request(), ['MRTESTS.INNERCOMMUNICATION', '0000000000000000000000000000000000000001', shardMock.runId, '0', 'test msg', '0'])
-
-            conn.send('-Err\r\n')
-
-            env.assertTrue(conn.is_close())
-
-            # should not retry
-
-            conn = shardMock.GetConnection()
-
-            # make sure message will not be sent again
+            # libmr should resend INNERCOMMUNICATION up to MSG_MAX_RETRIES times.
+            # Whether the initial send counts as one of those retries depends on
+            # whether the node has reached NodeStatus_Connected by the time
+            # NETWORKTEST runs -- under TLS the HELLO handshake is slower, so the
+            # initial send is queued and goes through the resend loop, which
+            # counts as a retry. We therefore accept any count in [1, MSG_MAX_RETRIES].
+            attempts = 0
+            for _ in range(MSG_MAX_RETRIES + 2):
+                try:
+                    with TimeLimit(3):
+                        req = conn.read_request()
+                except Exception:
+                    break  # libmr stopped sending -- gave up
+                env.assertEqual(req, expected_msg)
+                attempts += 1
+                conn.send('-Err\r\n')
+                # libmr will disconnect on error and may reconnect to retry.
+                try:
+                    with TimeLimit(3):
+                        conn = shardMock.GetConnection()
+                except Exception:
+                    break  # no reconnect -- gave up
+
+            env.assertGreaterEqual(attempts, 1, message='libmr did not send the initial message')
+            env.assertLessEqual(attempts, MSG_MAX_RETRIES,
+                                message='libmr exceeded MSG_MAX_RETRIES (=%d) sends' % MSG_MAX_RETRIES)
+
+            # After giving up, libmr must not reconnect to retry the same msg.
             try:
-                with TimeLimit(1):
-                    conn.read_request()
-                    env.assertTrue(False)  # we should not get any data after crash
+                with TimeLimit(2):
+                    shardMock.GetConnection()
+                    env.assertTrue(False, message='Unexpected reconnect after MSG_MAX_RETRIES')
             except Exception:
                 pass
 

From 27b00e1488c725d003b0663b6c45d9b1579a701b Mon Sep 17 00:00:00 2001
From: Tom Gabsow <gabsow.tom@gmail.com>
Date: Tue, 26 May 2026 12:38:31 +0300
Subject: [PATCH 4/5] MOD-15862 Bump per-test timeout to 300s for slow 7.2
 cluster TLS tests

testSendRetriesMechanizm hang is now fixed, but the SSL run on 7.2 with
--env oss-cluster --shards-count 3 surfaced a separate pre-existing
flake: testUnevenWork can take >180s on that runner combination (it has
an internal TimeLimit(2) but the SIGALRM evidently isn't interrupting
the redis-py TLS handshake reliably in this configuration). On 7.4 and
unstable the same invocation completes in ~144s, so 180s was just too
tight.

Bump to 300s. This will be tracked separately -- the actual fix likely
belongs in test_basic.py / redis-py connection setup rather than the
workflow timeout.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 .github/workflows/linux.yml | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/.github/workflows/linux.yml b/.github/workflows/linux.yml
index d75a182..0cfa08d 100644
--- a/.github/workflows/linux.yml
+++ b/.github/workflows/linux.yml
@@ -66,7 +66,7 @@ jobs:
       env:
         PYTHON: python
         PYTHONUNBUFFERED: "1"
-        RLTEST_EXTRA_ARGS: "--test-timeout 180"
+        RLTEST_EXTRA_ARGS: "--test-timeout 300"
     - name: Valgrind tests
       run: make run_tests_valgrind
       env:

From 09446a25af6ff1e234f39eb91e8c1067004f4549 Mon Sep 17 00:00:00 2001
From: Tom Gabsow <gabsow.tom@gmail.com>
Date: Wed, 27 May 2026 11:25:50 +0300
Subject: [PATCH 5/5] MOD-15862 Add timeout-minutes to macOS workflow
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

This PR's CI run hit a 6h cancel on macOS 7.4 SSL — a runner-side flake
(other macOS jobs in the same run passed in under 8 minutes). The macOS
workflow had no timeout-minutes set, so a single hung test could burn
the full job cap.

Mirror the Linux workflow: 45min job cap + 25min SSL step cap. macOS
keeps the default --test-timeout (no per-test cap) since it isn't
where the original MOD-15862 hang occurred and we don't want to risk
regressing the runs that currently pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 .github/workflows/macos.yml | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/.github/workflows/macos.yml b/.github/workflows/macos.yml
index 1e60b30..2e4c1db 100644
--- a/.github/workflows/macos.yml
+++ b/.github/workflows/macos.yml
@@ -15,6 +15,7 @@ jobs:
   build:
 
     runs-on: macos-latest
+    timeout-minutes: 45
 
     strategy:
       fail-fast: false
@@ -51,6 +52,7 @@ jobs:
       env:
         PYTHON: python
     - name: SSL tests
+      timeout-minutes: 25
       run: make run_tests_ssl
       env:
         PYTHON: python