Skip to content

[Bug][LinkisManager] Resource calculation exception when task has not requested resources for a long time #5317

@v-kkhuang

Description

@v-kkhuang

Linkis Component

linkis-computation-governance/linkis-manager

What happened

English:

When a task has not requested resources for a long time, it leads to resource calculation exceptions.

Problem Description:
Resource calculation exceptions cause the used resources to increase inexplicably, resulting in resource waste. The resource tracking becomes inaccurate, showing more resources in use than actually allocated.


中文:

在任务长时间未被请求到资源的时候,会导致资源计算异常。

问题描述:
资源计算异常会导致已用资源凭空增大,造成资源浪费。资源追踪变得不准确,显示的使用资源量比实际分配的资源量更多。

What you expected to happen

English:

The resource calculation logic should be correctly fixed to ensure normal resource calculation. The system should:

  1. Accurately track resource allocation and release
  2. Handle long-running tasks without resource calculation drift
  3. Periodically verify and reconcile resource usage
  4. Provide accurate resource metrics for scheduling decisions

中文:

应该正确修复资源计算逻辑,保证资源计算正常。系统应该:

  1. 准确追踪资源分配和释放
  2. 处理长时间运行的任务时不出现资源计算偏差
  3. 定期验证和校正资源使用情况
  4. 为调度决策提供准确的资源指标

How to reproduce

English:

  1. Submit a task to LinkisManager
  2. Let the task run without requesting resources for an extended period
  3. Monitor the resource usage metrics in LinkisManager
  4. Observe that the reported used resources increase without actual resource allocation

中文:

  1. 向 LinkisManager 提交一个任务
  2. 让任务长时间运行但不请求资源
  3. 监控 LinkisManager 中的资源使用指标
  4. 观察报告的已用资源在没有实际资源分配的情况下增加

Anything else

English:

Potential Root Causes:

  1. Resource tracking state not properly synchronized
  2. Resource cleanup logic not triggered for idle tasks
  3. Race condition in resource calculation during concurrent operations
  4. Memory leak or state accumulation in resource manager

Suggested Investigation:

  1. Review resource tracking logic in LinkisManager
  2. Check resource cleanup and garbage collection mechanisms
  3. Add detailed logging for resource allocation/release events
  4. Implement resource reconciliation to detect and fix inconsistencies

Impact:

  • Resource waste due to phantom resource allocation
  • Reduced cluster capacity for new tasks
  • Potential scheduling failures when resources appear exhausted

中文:

可能的根本原因:

  1. 资源追踪状态未正确同步
  2. 空闲任务的资源清理逻辑未触发
  3. 并发操作期间资源计算存在竞态条件
  4. 资源管理器中存在内存泄漏或状态累积

建议调查方向:

  1. 审查 LinkisManager 中的资源追踪逻辑
  2. 检查资源清理和垃圾回收机制
  3. 为资源分配/释放事件添加详细日志
  4. 实现资源校正机制以检测和修复不一致

影响范围:

  • 幽灵资源分配导致资源浪费
  • 新任务的集群容量减少
  • 当资源看似耗尽时可能出现调度失败

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions