Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly improves the operational resilience and Redis compatibility of the ElasticKV system. It provides a crucial migration path for legacy Raft storage, ensuring data integrity during upgrades. Furthermore, it refines the Redis proxy's behavior by correctly handling blocking commands with dynamic timeouts and implementing intelligent caching for Lua scripts, which is vital for dual-write scenarios. The changes also include enhancements to the Jepsen testing framework and a more robust Pub/Sub shadowing mechanism, collectively contributing to a more stable, maintainable, and performant distributed key-value store. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a significant set of features, including a migration tool for legacy BoltDB Raft storage to Pebble, comprehensive documentation, and a robust rolling update script. It also enhances the proxy by adding support for blocking commands with dynamic timeouts and improving Lua script handling with caching and NOSCRIPT error fallbacks. The changes are well-tested and substantially improve the system's operational capabilities. My review includes a few suggestions to enhance code clarity and reduce duplication in the new migration logic and helper scripts.
I am having trouble creating individual review comments. Click here to see my feedback.
internal/raftstore/migrate.go (144-158)
The functions copyLegacyStable and copyLegacyLogs both contain boilerplate code for opening a read-only bbolt transaction and iterating over a bucket. This can be extracted into a helper function to reduce duplication and make the migration logic clearer. For example:
func forEachInBucket(db *bbolt.DB, bucketName []byte, fn func(k, v []byte) error) error {
return errors.WithStack(db.View(func(tx *bbolt.Tx) error {
bucket := tx.Bucket(bucketName)
if bucket == nil {
return errors.Newf("legacy bucket %q not found", string(bucketName))
}
return bucket.ForEach(fn)
}))
}Using this helper would simplify both copyLegacyStable and copyLegacyLogs.
References
- Centralize common complex logic into shared helper functions to avoid code duplication and improve maintainability.
proxy/script_cache.go (105-114)
The stringArg function's default case is a bit inefficient as it creates a new slice []any{arg} just to call argsToBytes. This can be simplified for better readability and performance by handling the default case with fmt.Sprintf directly and moving the strings.ToLower call to the end.
func stringArg(arg any) string {
var s string
switch v := arg.(type) {
case []byte:
s = string(v)
case string:
s = v
default:
s = fmt.Sprintf("%v", v)
}
return strings.ToLower(s)
}
scripts/rolling-update.sh (386-393)
The functions grpc_healthy and peer_grpc_healthy have very similar logic for checking TCP port availability. You can combine them into a single function that accepts the host as an argument to reduce code duplication and improve maintainability.
You could replace both with a single function:
is_grpc_healthy() {
local host="$1"
bash -lc "exec 3<>/dev/tcp/${host}/${RAFT_PORT}" 2>/dev/null
}Then you can replace calls to grpc_healthy with is_grpc_healthy "$NODE_HOST" and peer_grpc_healthy "$some_host" with is_grpc_healthy "$some_host".
There was a problem hiding this comment.
Pull request overview
Adds proxy-side support to better handle Redis blocking commands and script replication in dual-write mode.
Changes:
- Add per-command timeout handling for blocking commands (e.g.,
BZPOPMIN,BLMOVE,XREAD BLOCK) via a timeout-aware backend path. - Add a lightweight script cache to enable
EVALSHA→EVALfallback on secondary when it returnsNOSCRIPT. - Force the go-redis client to use RESP2 for upstream connections, with tests covering these behaviors.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
proxy/script_cache.go |
Introduces script digesting, caching, NOSCRIPT detection, and EVALSHA fallback argument construction. |
proxy/blocking.go |
Adds blocking-command timeout parsing utilities and a timeout-capable backend interface. |
proxy/dualwrite.go |
Uses per-call timeout override for blocking commands; remembers scripts after script commands; retries secondary writes on NOSCRIPT with fallback args. |
proxy/backend.go |
Sets upstream protocol to RESP2 and adds DoWithTimeout implementation on RedisBackend. |
proxy/proxy_test.go |
Adds unit tests for blocking timeout parsing, timeout-aware backend usage, RESP2 enforcement, and EVALSHA fallback behavior. |
| func (d *DualWriter) evalFallbackArgs(cmd string, iArgs []any) ([]any, bool) { | ||
| upper := strings.ToUpper(cmd) | ||
| if upper != cmdEvalSHA && upper != "EVALSHA_RO" { | ||
| return nil, false | ||
| } | ||
| if len(iArgs) < minEvalSHAArgs { | ||
| return nil, false | ||
| } | ||
|
|
||
| sha := stringArg(iArgs[1]) | ||
| script, ok := d.lookupScript(sha) | ||
| if !ok { | ||
| return nil, false | ||
| } | ||
|
|
||
| fallback := make([]any, len(iArgs)) | ||
| fallback[0] = []byte(cmdEval) | ||
| fallback[1] = []byte(script) | ||
| copy(fallback[2:], iArgs[2:]) | ||
| return fallback, true |
There was a problem hiding this comment.
evalFallbackArgs treats EVALSHA_RO the same as EVALSHA but always falls back to EVAL. For read-only scripts this changes semantics (and may fail on servers enforcing read-only script execution). Preserve the original command by falling back to EVAL_RO when the incoming command is EVALSHA_RO (and similarly keep EVAL for EVALSHA).
| result := d.primary.Do(ctx, iArgs...) | ||
| resp, err := result.Result() | ||
| d.metrics.CommandDuration.WithLabelValues(cmd, d.primary.Name()).Observe(time.Since(start).Seconds()) | ||
| d.rememberScript(cmd, args) | ||
|
|
There was a problem hiding this comment.
rememberScript(cmd, args) is called before checking err from the primary. This means a failed SCRIPT FLUSH would still clear the local script cache, creating an inconsistent state where the primary still has scripts cached but the proxy can no longer fall back from EVALSHA to EVAL for secondary writes. Update the flow so cache mutations (especially SCRIPT FLUSH) only happen when the primary successfully applied the command (e.g., call rememberScript after the error return, or pass err into rememberScript and skip clearScripts() on failure).
| func (d *DualWriter) storeScript(script string) { | ||
| sha := scriptSHA(script) | ||
|
|
||
| d.scriptMu.Lock() | ||
| defer d.scriptMu.Unlock() | ||
| d.scripts[sha] = script | ||
| } |
There was a problem hiding this comment.
The script cache stores full script bodies in an unbounded map (d.scripts[sha] = script) with no size cap or eviction. In a long-running proxy, clients that generate many unique scripts (or large scripts) can cause unbounded memory growth. Consider adding a maximum entry count and eviction strategy (e.g., LRU), and/or limiting stored script size to reduce OOM risk.
|
@copilot open a new pull request to apply changes based on the comments in this thread |
…add bounded eviction Co-authored-by: bootjp <1306365+bootjp@users.noreply.github.com>
proxy: fix EVALSHA_RO fallback, guard script cache on primary error, bounded eviction
Hot fix/grpc resourceexhausted
No description provided.