Skip to content

Can radical.entk add support for conditional scheduling/execution of tasks/pipelines #632

@GKNB

Description

@GKNB

One common workflow pattern in ML is, we have multiple works, and each work consists of three stages: data generation, training, and data analysis. It is natural to submit each work as a different pipeline so that they can run asynchronously. However, in adaptive learning, different works could have dependencies. For example, the data generation stage of work_2 might depend on data generation or training stage of work_1, while work_1 and work_2 are not completely serial (i.e., data generation stage of work_2 does not depend on data analysis stage of work_1).

My current solution to this problem is, I first create n pipelines, where n equals the number of works. Next, for all pipelines except the first one, I insert a "monitoring stage" which includes a "monitoring task" at the beginning. This task monitors whether the required stages in the last pipeline have finished. If those required stages are finished, this task will quit so that the pipeline can continue to do its actual work.

However, this implementation has a problem of resource wasting. In my current implementation, we need n-1 "monitoring task", and the one in pipeline_k is a simple bash script that looks for a "signal file" created by pipeline_{k-1}, if the pipeline_{k-1} has generated all required data for pipeline_k. This monitoring task, like all other tasks, will consume resources, and each monitoring task will need one cpu core. As the number of pipelines increases, it will consume more resources. A more serious problem is that it might introduce deadlock. This is because entk does not guarantee the order of launching different pipelines. In the extreme case where the number of pipelines exceeds the number of cores, we might be in a scenario where all resources are used by monitoring tasks in pipeline index 2 to n, so that the first real work in pipeline_1 can not start, suggesting that we have a deadlock.

If we are allowed to have conditional scheduling/execution of tasks/pipelines, then we don't need to introduce monitoring tasks anymore. For example, we can do something similar to CUDA stream like task.post_exec_create_event(event_name) which creates an event when a task is finished, and pipeline.wait_event(event_name) which means a pipeline will wait for an event to show up in order to continue to the next state, then we can also solve this issue.

Metadata

Metadata

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions