Skip to content

Codex/primary blocking bzpopmin#372

Merged
bootjp merged 8 commits intomainfrom
codex/primary-blocking-bzpopmin
Mar 20, 2026
Merged

Codex/primary blocking bzpopmin#372
bootjp merged 8 commits intomainfrom
codex/primary-blocking-bzpopmin

Conversation

@bootjp
Copy link
Owner

@bootjp bootjp commented Mar 20, 2026

No description provided.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly improves the operational resilience and Redis compatibility of the ElasticKV system. It provides a crucial migration path for legacy Raft storage, ensuring data integrity during upgrades. Furthermore, it refines the Redis proxy's behavior by correctly handling blocking commands with dynamic timeouts and implementing intelligent caching for Lua scripts, which is vital for dual-write scenarios. The changes also include enhancements to the Jepsen testing framework and a more robust Pub/Sub shadowing mechanism, collectively contributing to a more stable, maintainable, and performant distributed key-value store.

Highlights

  • Raft Storage Migration: Introduced a new command-line tool and updated documentation to facilitate the migration of Raft logs and stable state from legacy BoltDB files to the more modern Pebble storage format. This ensures smoother upgrades and better data management.
  • Redis Blocking Commands Handling: Enhanced the Redis proxy to correctly identify and apply appropriate timeouts for various Redis blocking commands (e.g., BZPOPMIN, BLPOP, XREAD). This prevents proxy timeouts from prematurely interrupting long-running blocking operations.
  • Redis Script Caching: Implemented a caching mechanism for Lua scripts within the Redis proxy. This allows the proxy to remember scripts sent via EVAL and automatically fall back to EVAL if a subsequent EVALSHA command fails due to the script not being present on a secondary Redis instance.
  • Jepsen Test Robustness: Improved the Jepsen test suite by adding a fail-on-invalid! function. This ensures that the test runner explicitly fails if Jepsen's analysis determines the history of operations to be invalid, making test outcomes more reliable.
  • Shadow Pub/Sub Refactoring: Refactored the shadow Pub/Sub divergence detection mechanism to manage unmatched secondary messages directly within each shadowPubSub instance, rather than using a global buffer. This improves memory management, simplifies concurrency control, and enhances the accuracy of divergence reporting.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@bootjp bootjp changed the base branch from main to feature/rolling-update March 20, 2026 11:12
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant set of features, including a migration tool for legacy BoltDB Raft storage to Pebble, comprehensive documentation, and a robust rolling update script. It also enhances the proxy by adding support for blocking commands with dynamic timeouts and improving Lua script handling with caching and NOSCRIPT error fallbacks. The changes are well-tested and substantially improve the system's operational capabilities. My review includes a few suggestions to enhance code clarity and reduce duplication in the new migration logic and helper scripts.

I am having trouble creating individual review comments. Click here to see my feedback.

internal/raftstore/migrate.go (144-158)

medium

The functions copyLegacyStable and copyLegacyLogs both contain boilerplate code for opening a read-only bbolt transaction and iterating over a bucket. This can be extracted into a helper function to reduce duplication and make the migration logic clearer. For example:

func forEachInBucket(db *bbolt.DB, bucketName []byte, fn func(k, v []byte) error) error {
    return errors.WithStack(db.View(func(tx *bbolt.Tx) error {
        bucket := tx.Bucket(bucketName)
        if bucket == nil {
            return errors.Newf("legacy bucket %q not found", string(bucketName))
        }
        return bucket.ForEach(fn)
    }))
}

Using this helper would simplify both copyLegacyStable and copyLegacyLogs.

References
  1. Centralize common complex logic into shared helper functions to avoid code duplication and improve maintainability.

proxy/script_cache.go (105-114)

medium

The stringArg function's default case is a bit inefficient as it creates a new slice []any{arg} just to call argsToBytes. This can be simplified for better readability and performance by handling the default case with fmt.Sprintf directly and moving the strings.ToLower call to the end.

func stringArg(arg any) string {
	var s string
	switch v := arg.(type) {
	case []byte:
		s = string(v)
	case string:
		s = v
	default:
		s = fmt.Sprintf("%v", v)
	}
	return strings.ToLower(s)
}

scripts/rolling-update.sh (386-393)

medium

The functions grpc_healthy and peer_grpc_healthy have very similar logic for checking TCP port availability. You can combine them into a single function that accepts the host as an argument to reduce code duplication and improve maintainability.

You could replace both with a single function:

is_grpc_healthy() {
  local host="$1"
  bash -lc "exec 3<>/dev/tcp/${host}/${RAFT_PORT}" 2>/dev/null
}

Then you can replace calls to grpc_healthy with is_grpc_healthy "$NODE_HOST" and peer_grpc_healthy "$some_host" with is_grpc_healthy "$some_host".

@bootjp bootjp requested a review from Copilot March 20, 2026 11:26
Base automatically changed from feature/rolling-update to main March 20, 2026 11:28
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds proxy-side support to better handle Redis blocking commands and script replication in dual-write mode.

Changes:

  • Add per-command timeout handling for blocking commands (e.g., BZPOPMIN, BLMOVE, XREAD BLOCK) via a timeout-aware backend path.
  • Add a lightweight script cache to enable EVALSHAEVAL fallback on secondary when it returns NOSCRIPT.
  • Force the go-redis client to use RESP2 for upstream connections, with tests covering these behaviors.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
proxy/script_cache.go Introduces script digesting, caching, NOSCRIPT detection, and EVALSHA fallback argument construction.
proxy/blocking.go Adds blocking-command timeout parsing utilities and a timeout-capable backend interface.
proxy/dualwrite.go Uses per-call timeout override for blocking commands; remembers scripts after script commands; retries secondary writes on NOSCRIPT with fallback args.
proxy/backend.go Sets upstream protocol to RESP2 and adds DoWithTimeout implementation on RedisBackend.
proxy/proxy_test.go Adds unit tests for blocking timeout parsing, timeout-aware backend usage, RESP2 enforcement, and EVALSHA fallback behavior.

Comment on lines +66 to +85
func (d *DualWriter) evalFallbackArgs(cmd string, iArgs []any) ([]any, bool) {
upper := strings.ToUpper(cmd)
if upper != cmdEvalSHA && upper != "EVALSHA_RO" {
return nil, false
}
if len(iArgs) < minEvalSHAArgs {
return nil, false
}

sha := stringArg(iArgs[1])
script, ok := d.lookupScript(sha)
if !ok {
return nil, false
}

fallback := make([]any, len(iArgs))
fallback[0] = []byte(cmdEval)
fallback[1] = []byte(script)
copy(fallback[2:], iArgs[2:])
return fallback, true
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

evalFallbackArgs treats EVALSHA_RO the same as EVALSHA but always falls back to EVAL. For read-only scripts this changes semantics (and may fail on servers enforcing read-only script execution). Preserve the original command by falling back to EVAL_RO when the incoming command is EVALSHA_RO (and similarly keep EVAL for EVALSHA).

Copilot uses AI. Check for mistakes.
Comment on lines 193 to 197
result := d.primary.Do(ctx, iArgs...)
resp, err := result.Result()
d.metrics.CommandDuration.WithLabelValues(cmd, d.primary.Name()).Observe(time.Since(start).Seconds())
d.rememberScript(cmd, args)

Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rememberScript(cmd, args) is called before checking err from the primary. This means a failed SCRIPT FLUSH would still clear the local script cache, creating an inconsistent state where the primary still has scripts cached but the proxy can no longer fall back from EVALSHA to EVAL for secondary writes. Update the flow so cache mutations (especially SCRIPT FLUSH) only happen when the primary successfully applied the command (e.g., call rememberScript after the error return, or pass err into rememberScript and skip clearScripts() on failure).

Copilot uses AI. Check for mistakes.
Comment on lines +45 to +51
func (d *DualWriter) storeScript(script string) {
sha := scriptSHA(script)

d.scriptMu.Lock()
defer d.scriptMu.Unlock()
d.scripts[sha] = script
}
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The script cache stores full script bodies in an unbounded map (d.scripts[sha] = script) with no size cap or eviction. In a long-running proxy, clients that generate many unique scripts (or large scripts) can cause unbounded memory growth. Consider adding a maximum entry count and eviction strategy (e.g., LRU), and/or limiting stored script size to reduce OOM risk.

Copilot uses AI. Check for mistakes.
@bootjp
Copy link
Owner Author

bootjp commented Mar 20, 2026

@copilot open a new pull request to apply changes based on the comments in this thread

Copy link
Contributor

Copilot AI commented Mar 20, 2026

@bootjp I've opened a new pull request, #373, to work on those changes. Once the pull request is ready, I'll request review from you.

bootjp and others added 4 commits March 20, 2026 20:47
…add bounded eviction

Co-authored-by: bootjp <1306365+bootjp@users.noreply.github.com>
proxy: fix EVALSHA_RO fallback, guard script cache on primary error, bounded eviction
@bootjp bootjp enabled auto-merge March 20, 2026 11:52
@bootjp bootjp merged commit 0f2a871 into main Mar 20, 2026
8 checks passed
@bootjp bootjp deleted the codex/primary-blocking-bzpopmin branch March 20, 2026 11:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants