fix(workflow): prevent NodePath corruption for nested sub-workflows in loop/batch#2649
Open
shentongmartin wants to merge 1 commit intomainfrom
Open
fix(workflow): prevent NodePath corruption for nested sub-workflows in loop/batch#2649shentongmartin wants to merge 1 commit intomainfrom
shentongmartin wants to merge 1 commit intomainfrom
Conversation
…n loop/batch When BatchInfo leaks from composite nodes (loop/batch) through PrepareSubExeCtx into sub-workflow contexts, PrepareNodeExeCtx incorrectly injected the interrupt_event_index_N prefix for ALL descendant nodes, not just direct children of the composite node. This corrupted the NodePath, causing resume detection to fail and generate new sub-execute-IDs instead of restoring existing ones. Fix: align PrepareNodeExeCtx condition with initNodeCtx - only inject the batch index prefix when the current node is a direct child of the composite node (c.BatchInfo.CompositeNodeKey == c.NodeCtx.NodeKey). Add TestLoop_SubWorkflow_Nested_Interrupt to verify loop -> sub_wf -> sub_wf -> interrupt scenario correctly restores (not recreates) inner sub-workflow context.
Codecov Report✅ All modified and coverable lines are covered by tests.
... and 4 files with indirect coverage changes 🚀 New features to boost your workflow:
|
JonXSnow
approved these changes
Apr 3, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
NodePath Corruption in Nested Sub-Workflows During Resume
Problem
When a workflow has a loop containing nested sub-workflows (loop → sub_wf_A → sub_wf_B → interruptible_node), resuming after interrupt generates new sub-execute-IDs for the inner sub-workflow instead of restoring the existing ones. This produces duplicate execution records and re-executes the interrupt node.
The loop has only one item, yet the inner sub-workflow gets a different
SubExecuteIDbefore interrupt vs after resume.Expected: On resume,
inner_sub_wfcontext is restored viarestoreWorkflowCtx(same SubExecuteID).Actual: On resume,
inner_sub_wfcontext is recreated viaPrepareSubExeCtx(new SubExecuteID).Solution
One-line condition fix in
PrepareNodeExeCtx(context.go).The
interrupt_event_index_Nprefix was being injected intoNodePathfor all descendant nodes whenBatchInfowas non-nil, not just direct children of the composite node. SincePrepareSubExeCtxpropagatesBatchInfofrom the composite node into sub-workflow contexts, nodes deep inside nested sub-workflows got a corruptedNodePathwith a spurious index prefix.Before (buggy):
After (fixed):
This aligns
PrepareNodeExeCtxwith the correct condition already used ininitNodeCtx(callback.go:551-555) — only inject the batch index prefix when the current node is the composite node (i.e., the node is a direct child of the loop/batch).Key Insight
BatchInfoleaks throughPrepareSubExeCtxinto sub-workflow contexts. This is by design — the sub-workflow needsBatchInfoto report its composite index. However, twoNodePathconstruction sites must handle this consistently:initNodeCtx(callback.go) — used during resume detection — correctly guards withc.BatchInfo.CompositeNodeKey == c.NodeCtx.NodeKeyPrepareNodeExeCtx(context.go) — used during initial execution — only checkedc.BatchInfo == nil, causing the prefix to be injected too broadlyThe resulting
NodePathmismatch:At index 3,
"interrupt_event_index_0" != "inner_sub_wf"→ resume detection fails → system callsPrepareSubExeCtxinstead ofrestoreWorkflowCtx.Summary
NodePathwith spurious batch index prefixCompositeNodeKey == NodeKeycheck, matching the existing logic ininitNodeCtxSubExecuteIDon resume instead of being restoredNodePathmatches correctly, so resume detection triggersrestoreWorkflowCtx嵌套子工作流在循环/批处理中恢复时 NodePath 损坏
问题
当工作流的 循环中嵌套子工作流(loop → sub_wf_A → sub_wf_B → 可中断节点)时,中断后恢复执行会为内层子工作流生成新的 SubExecuteID,而非恢复已有的。这会产生重复的执行记录,并重复执行中断节点。
循环只有一个元素,但内层子工作流在中断前和恢复后获得了不同的
SubExecuteID。预期行为: 恢复时,
inner_sub_wf通过restoreWorkflowCtx恢复上下文(相同 SubExecuteID)。实际行为: 恢复时,
inner_sub_wf通过PrepareSubExeCtx重新创建上下文(新 SubExecuteID)。解决方案
在
PrepareNodeExeCtx(context.go)中修改一行条件判断。interrupt_event_index_N前缀在BatchInfo非空时被注入到所有后代节点的NodePath中,而不仅仅是组合节点的直接子节点。由于PrepareSubExeCtx会将BatchInfo从组合节点传播到子工作流上下文中,嵌套子工作流内部的节点会得到一个包含多余索引前缀的错误NodePath。修复后的条件与
initNodeCtx(callback.go:551-555)中已有的正确逻辑保持一致——仅当当前节点就是组合节点时(即该节点是 loop/batch 的直接子节点),才注入批次索引前缀。关键洞察
BatchInfo通过PrepareSubExeCtx泄漏到子工作流上下文中。 这是设计如此——子工作流需要BatchInfo来报告其组合索引。但两个NodePath构造点必须一致地处理这一情况:initNodeCtx(callback.go)——用于恢复检测——正确地使用c.BatchInfo.CompositeNodeKey == c.NodeCtx.NodeKey守卫PrepareNodeExeCtx(context.go)——用于初始执行——仅检查了c.BatchInfo == nil,导致前缀被过度注入由此产生的
NodePath不匹配导致恢复检测失败,系统调用PrepareSubExeCtx(生成新 SubExecuteID)而非restoreWorkflowCtx(恢复旧的)。总结
NodePath被错误注入批次索引前缀CompositeNodeKey == NodeKey条件守卫索引前缀注入,与initNodeCtx中的现有逻辑保持一致SubExecuteID而非被恢复NodePath正确匹配,恢复检测正确触发restoreWorkflowCtx