-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Open
Labels
Description
Linkis Component
linkis-computation-governance/linkis-manager
What happened
English:
When a task has not requested resources for a long time, it leads to resource calculation exceptions.
Problem Description:
Resource calculation exceptions cause the used resources to increase inexplicably, resulting in resource waste. The resource tracking becomes inaccurate, showing more resources in use than actually allocated.
中文:
在任务长时间未被请求到资源的时候,会导致资源计算异常。
问题描述:
资源计算异常会导致已用资源凭空增大,造成资源浪费。资源追踪变得不准确,显示的使用资源量比实际分配的资源量更多。
What you expected to happen
English:
The resource calculation logic should be correctly fixed to ensure normal resource calculation. The system should:
- Accurately track resource allocation and release
- Handle long-running tasks without resource calculation drift
- Periodically verify and reconcile resource usage
- Provide accurate resource metrics for scheduling decisions
中文:
应该正确修复资源计算逻辑,保证资源计算正常。系统应该:
- 准确追踪资源分配和释放
- 处理长时间运行的任务时不出现资源计算偏差
- 定期验证和校正资源使用情况
- 为调度决策提供准确的资源指标
How to reproduce
English:
- Submit a task to LinkisManager
- Let the task run without requesting resources for an extended period
- Monitor the resource usage metrics in LinkisManager
- Observe that the reported used resources increase without actual resource allocation
中文:
- 向 LinkisManager 提交一个任务
- 让任务长时间运行但不请求资源
- 监控 LinkisManager 中的资源使用指标
- 观察报告的已用资源在没有实际资源分配的情况下增加
Anything else
English:
Potential Root Causes:
- Resource tracking state not properly synchronized
- Resource cleanup logic not triggered for idle tasks
- Race condition in resource calculation during concurrent operations
- Memory leak or state accumulation in resource manager
Suggested Investigation:
- Review resource tracking logic in LinkisManager
- Check resource cleanup and garbage collection mechanisms
- Add detailed logging for resource allocation/release events
- Implement resource reconciliation to detect and fix inconsistencies
Impact:
- Resource waste due to phantom resource allocation
- Reduced cluster capacity for new tasks
- Potential scheduling failures when resources appear exhausted
中文:
可能的根本原因:
- 资源追踪状态未正确同步
- 空闲任务的资源清理逻辑未触发
- 并发操作期间资源计算存在竞态条件
- 资源管理器中存在内存泄漏或状态累积
建议调查方向:
- 审查 LinkisManager 中的资源追踪逻辑
- 检查资源清理和垃圾回收机制
- 为资源分配/释放事件添加详细日志
- 实现资源校正机制以检测和修复不一致
影响范围:
- 幽灵资源分配导致资源浪费
- 新任务的集群容量减少
- 当资源看似耗尽时可能出现调度失败