Skip to content

fix(workflow): prevent NodePath corruption for nested sub-workflows in loop/batch#2649

Open
shentongmartin wants to merge 1 commit intomainfrom
fix/nested-sub-wf-nodepath
Open

fix(workflow): prevent NodePath corruption for nested sub-workflows in loop/batch#2649
shentongmartin wants to merge 1 commit intomainfrom
fix/nested-sub-wf-nodepath

Conversation

@shentongmartin
Copy link
Copy Markdown
Collaborator

NodePath Corruption in Nested Sub-Workflows During Resume

Problem

When a workflow has a loop containing nested sub-workflows (loop → sub_wf_A → sub_wf_B → interruptible_node), resuming after interrupt generates new sub-execute-IDs for the inner sub-workflow instead of restoring the existing ones. This produces duplicate execution records and re-executes the interrupt node.

The loop has only one item, yet the inner sub-workflow gets a different SubExecuteID before interrupt vs after resume.

Expected: On resume, inner_sub_wf context is restored via restoreWorkflowCtx (same SubExecuteID).
Actual: On resume, inner_sub_wf context is recreated via PrepareSubExeCtx (new SubExecuteID).

Solution

One-line condition fix in PrepareNodeExeCtx (context.go).

The interrupt_event_index_N prefix was being injected into NodePath for all descendant nodes when BatchInfo was non-nil, not just direct children of the composite node. Since PrepareSubExeCtx propagates BatchInfo from the composite node into sub-workflow contexts, nodes deep inside nested sub-workflows got a corrupted NodePath with a spurious index prefix.

Before (buggy):

if c.BatchInfo == nil {
    newC.NodeCtx.NodePath = append(c.NodeCtx.NodePath, string(nodeKey))
} else {
    newC.NodeCtx.NodePath = append(c.NodeCtx.NodePath, InterruptEventIndexPrefix+strconv.Itoa(c.BatchInfo.Index), string(nodeKey))
}

After (fixed):

if c.BatchInfo != nil && c.BatchInfo.CompositeNodeKey == c.NodeCtx.NodeKey {
    newC.NodeCtx.NodePath = append(c.NodeCtx.NodePath, InterruptEventIndexPrefix+strconv.Itoa(c.BatchInfo.Index), string(nodeKey))
} else {
    newC.NodeCtx.NodePath = append(c.NodeCtx.NodePath, string(nodeKey))
}

This aligns PrepareNodeExeCtx with the correct condition already used in initNodeCtx (callback.go:551-555) — only inject the batch index prefix when the current node is the composite node (i.e., the node is a direct child of the loop/batch).

Key Insight

BatchInfo leaks through PrepareSubExeCtx into sub-workflow contexts. This is by design — the sub-workflow needs BatchInfo to report its composite index. However, two NodePath construction sites must handle this consistently:

  1. initNodeCtx (callback.go) — used during resume detection — correctly guards with c.BatchInfo.CompositeNodeKey == c.NodeCtx.NodeKey
  2. PrepareNodeExeCtx (context.go) — used during initial execution — only checked c.BatchInfo == nil, causing the prefix to be injected too broadly

The resulting NodePath mismatch:

Saved:   [loop_node, interrupt_event_index_0, outer_sub_wf, interrupt_event_index_0, inner_sub_wf]
                                                             ^^^^^^^^^^^^^^^^^^^^^^^^ WRONG
Resume:  [loop_node, interrupt_event_index_0, outer_sub_wf, inner_sub_wf, lambda]

At index 3, "interrupt_event_index_0" != "inner_sub_wf" → resume detection fails → system calls PrepareSubExeCtx instead of restoreWorkflowCtx.

Summary

Problem Solution
Nested sub-workflows inside loop/batch get corrupted NodePath with spurious batch index prefix Guard index prefix injection with CompositeNodeKey == NodeKey check, matching the existing logic in initNodeCtx
Inner sub-workflow gets new SubExecuteID on resume instead of being restored Fix ensures NodePath matches correctly, so resume detection triggers restoreWorkflowCtx

嵌套子工作流在循环/批处理中恢复时 NodePath 损坏

问题

当工作流的 循环中嵌套子工作流(loop → sub_wf_A → sub_wf_B → 可中断节点)时,中断后恢复执行会为内层子工作流生成新的 SubExecuteID,而非恢复已有的。这会产生重复的执行记录,并重复执行中断节点。

循环只有一个元素,但内层子工作流在中断前和恢复后获得了不同的 SubExecuteID

预期行为: 恢复时,inner_sub_wf 通过 restoreWorkflowCtx 恢复上下文(相同 SubExecuteID)。
实际行为: 恢复时,inner_sub_wf 通过 PrepareSubExeCtx 重新创建上下文(新 SubExecuteID)。

解决方案

PrepareNodeExeCtx(context.go)中修改一行条件判断。

interrupt_event_index_N 前缀在 BatchInfo 非空时被注入到所有后代节点的 NodePath 中,而不仅仅是组合节点的直接子节点。由于 PrepareSubExeCtx 会将 BatchInfo 从组合节点传播到子工作流上下文中,嵌套子工作流内部的节点会得到一个包含多余索引前缀的错误 NodePath

修复后的条件与 initNodeCtx(callback.go:551-555)中已有的正确逻辑保持一致——仅当当前节点就是组合节点时(即该节点是 loop/batch 的直接子节点),才注入批次索引前缀。

关键洞察

BatchInfo 通过 PrepareSubExeCtx 泄漏到子工作流上下文中。 这是设计如此——子工作流需要 BatchInfo 来报告其组合索引。但两个 NodePath 构造点必须一致地处理这一情况:

  1. initNodeCtx(callback.go)——用于恢复检测——正确地使用 c.BatchInfo.CompositeNodeKey == c.NodeCtx.NodeKey 守卫
  2. PrepareNodeExeCtx(context.go)——用于初始执行——仅检查了 c.BatchInfo == nil,导致前缀被过度注入

由此产生的 NodePath 不匹配导致恢复检测失败,系统调用 PrepareSubExeCtx(生成新 SubExecuteID)而非 restoreWorkflowCtx(恢复旧的)。

总结

问题 解决方案
循环/批处理中的嵌套子工作流 NodePath 被错误注入批次索引前缀 使用 CompositeNodeKey == NodeKey 条件守卫索引前缀注入,与 initNodeCtx 中的现有逻辑保持一致
内层子工作流恢复时获得新 SubExecuteID 而非被恢复 修复确保 NodePath 正确匹配,恢复检测正确触发 restoreWorkflowCtx

…n loop/batch

When BatchInfo leaks from composite nodes (loop/batch) through PrepareSubExeCtx
into sub-workflow contexts, PrepareNodeExeCtx incorrectly injected the
interrupt_event_index_N prefix for ALL descendant nodes, not just direct children
of the composite node. This corrupted the NodePath, causing resume detection to
fail and generate new sub-execute-IDs instead of restoring existing ones.

Fix: align PrepareNodeExeCtx condition with initNodeCtx - only inject the batch
index prefix when the current node is a direct child of the composite node
(c.BatchInfo.CompositeNodeKey == c.NodeCtx.NodeKey).

Add TestLoop_SubWorkflow_Nested_Interrupt to verify loop -> sub_wf -> sub_wf ->
interrupt scenario correctly restores (not recreates) inner sub-workflow context.
@codecov-commenter
Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.

Files with missing lines Coverage Δ
...ackend/domain/workflow/internal/execute/context.go 67.88% <100.00%> (+1.37%) ⬆️

... and 4 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@shentongmartin shentongmartin enabled auto-merge April 3, 2026 02:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants