Update resource managing by nicelhc13 · Pull Request #125 · ut-parla/Parla.py

nicelhc13 · 2022-05-03T19:39:05Z

This PR updates two main things:

replace lock-based resource tracking with lock-free one by making the scheduler manage.
move resource updates from mapping to launching; it blocked task scheduling critically.

With this update, on 2000 independent tasks + with GIL + 76000 microsec for each task + depth 100 (5MB),
it took 7 sec for each iteration.

Without this, it took more than 130 sec for each iteration.

wlruys · 2022-05-03T20:03:21Z

Main concern is that I think @sestephens73 had a bunch of reasons for moving resource allocation for mapping. Want to make sure we can resolve those before moving it back.

nicelhc13 · 2022-05-03T20:07:40Z

yes please point me out. regardless of that this pr works, the main bottleneck was that updating resources at mapping is bad.

nicelhc13 · 2022-05-03T20:08:07Z

let me also add @yinengy for the review.

wlruys · 2022-05-05T20:15:39Z

+                if len(task.req.environment.placement) > 1:
+                    raise NotImplementedError("Multidevice not supported")
+                for device in task.req.environment.placement:
+                    self._available_resources.register_parray_move(parray, device)


This seems wrong. The data movement tasks already exist and are already run before this code executes. This is only updating the information once the "parent" task is starting to be run.

Wait it also happens at creation, I'm not sure then. Why is it registered twice onto the same device?

I think this should be a mistake. Then is it fine just to remove this registeration at here?

wlruys · 2022-05-05T20:22:12Z

            if task:
                if not task.assigned:
+                    _new_launched_task = []
+                    for n_task in self._launched_tasks:


Looping over all launched tasks for every mapping decision seems really expensive.

Agree, this should be also refined. but the point of this PR is that we should update resource at launching, not mapping.

wlruys · 2022-05-05T20:25:02Z

+                        task_state = n_task.state
+                        if isinstance(task_state, TaskCompleted):
+                            for dev in n_task.req.devices:
+                                self._available_resources.deallocate_resources(dev, n_task.req.resources)


I think that this doesn't handle Data Movement Tasks? Their resources are still bundled with their "parents", but their resources are no longer allocated 'ahead' of time. Only compute tasks are being tracked and deallocated. This means that they currently are considered to take 'no resources' and can be over-scheduled.

hmm, maybe not. Seem like in this Data Movement tasks are still allocated at mapping time and freed when the "parent" completes. Sorry.

Although (sorry for the stream of consciousness comments on this), doesn't this mean it can still deadlock in the same way as the current mapper? Since it can allocate a bunch of data movement tasks out of order (as there is no guaranteed order to the mapping phase) and prevent the next needed dependency from being mapped (and running) due to being out of resources?

Yes. still deadlock happens. Ok, let's consider this PR as the immediate solution for the performance. I think your pointing out could not be resolved by this PR. Let's consider that issue on the separate PR with more discussion.

sestephens73 · 2022-05-05T21:14:04Z

+        for name, v in resources.items():
+            if not self._update_resource(d, name, v * multiplier, block):
+                success = False
+                break


If this function no longer uses the monitor, it should not be named "atomically"

sestephens73 · 2022-05-05T21:20:23Z

        :param resources: The resources to deallocate.
        """
        logger.debug("[ResourcePool] Acquiring monitor in check_resources_availability()")
-        with self._monitor:


Are we sure removing this monitor is safe?

I suppose it is if only the scheduling thread makes this call.

I checked it by printing a thread and it was called only by the scheduler. But let me double check this.

It can also be called by any thread that creates a PArray:

Parla.py/parla/parray/core.py

Line 52 in d215415

get_scheduler_context().scheduler._available_resources.track_parray(self)

sestephens73 · 2022-05-05T21:20:51Z

@@ -1126,48 +1123,45 @@ def check_resources_availability(self, d: Device, resources: ResourceDict):
        :param resources: The resources to deallocate.
        """
        logger.debug("[ResourcePool] Acquiring monitor in check_resources_availability()")


Delete this line too if deleting the monitor

sestephens73 · 2022-05-05T21:22:01Z

-                if amount > dres[name]:
-                    is_available = False
-                logger.debug("Resource check for %d %s on device %r: %s", amount, name, d, "Passed" if is_available else "Failed")
-            logger.debug("[ResourcePool] Releasing monitor in check_resources_availability()")


delete this too if removing the monitor

sestephens73 · 2022-05-05T21:24:29Z

-                    assert dres[res] >= 0, "{}.{} was over allocated".format(dev, res)
+                    # TODO(lhc): Due to floating point, it doesn't work.
+                    #assert dres[res] <= dev.resources[res], "{}.{} was over deallocated".format(dev, res)
+                    #assert dres[res] >= 0, "{}.{} was over allocated".format(dev, res)


Gross. Can we make these integers?

one possible way is use the default vcu sum as the number of threads or something bigger numbers, not 1.
@dialecticDolt is it fine?

Also gross. We might add the resources back up and exceed the original amount bc of rounding errors and the order that tasks complete matters too.
Probably using: https://docs.python.org/3/library/fractions.html ? might be the cleanest way to handle this.

Update resource managing

d215415

nicelhc13 requested review from sestephens73 and wlruys May 3, 2022 19:39

nicelhc13 requested a review from yinengy May 3, 2022 20:08

wlruys reviewed May 5, 2022

View reviewed changes

sestephens73 reviewed May 5, 2022

View reviewed changes

Conversation

nicelhc13 commented May 3, 2022

Uh oh!

wlruys commented May 3, 2022

Uh oh!

nicelhc13 commented May 3, 2022

Uh oh!

nicelhc13 commented May 3, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wlruys May 5, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

wlruys May 5, 2022 •

edited

Loading