Handle HTCondor "Unable to locate local daemon" Error by hassan11196 · Pull Request #12172 · dmwm/WMCore

hassan11196 · 2024-11-19T22:28:58Z

Status

in development

Description

This pull request adds Exception handling to catch this Unable to locate local daemon error thrown when a schedd instance is created i.e. schedd = htcondor.Schedd() in the SimpleCondorPlugin.py

Is it backward compatible (if not, which system it affects?)

YES

Related PRs

None

External dependencies / deployment changes

No

…ler and StatusPoller

dmwm-bot · 2024-11-19T22:43:01Z

Jenkins results:

Python3 Unit tests: succeeded
- 1 tests no longer failing
- 1 changes in unstable tests
Python3 Pylint check: failed
- 8 warnings and errors that must be fixed
- 1 warnings
- 45 comments to review
Pycodestyle check: succeeded
- 13 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/WMCore-PR-Report/89/artifact/artifacts/PullRequestReport.html

hassan11196 · 2024-11-21T15:09:07Z

Hi @amaltaro,

I have wrapped the bossAir submit call in JobSubmitterPoller and the track call in StatusPoller.
however, as we are re-raising the Exception wouldn't the issue remain?

Should I modify the Exception type to something specific and handle it somewhere upstream?
Let me know what you think.

Thanks.

amaltaro

@hassan11196 thank you for proposing this fix.

This is not a complete review, as I need to look into the exception propagation with more attention, but I think we should move all lines schedd = htcondor.Schedd() in this module under the try/except clause as well (according to the tracebacks reported in the original issue).

With that, I believe some of these try/except that you provided are no longer relevant.

hassan11196 · 2024-11-26T09:20:39Z

@hassan11196 thank you for proposing this fix.

This is not a complete review, as I need to look into the exception propagation with more attention, but I think we should move all lines schedd = htcondor.Schedd() in this module under the try/except clause as well (according to the tracebacks reported in the original issue).

With that, I believe some of these try/except that you provided are no longer relevant.

I looked at all of schedd = htcondor.Schedd() instances and thier exceptions were handled upstream in several places, but I agree that since its the source we should wrap it, maybe refactor it into a separate method with exception handling.

Let me know what you think. Thanks for looking into it.

mapellidario

@hassan11196 , changes look good and reflect the discussion you had with alan in the original issue. I have a small question though :)

mapellidario · 2024-12-10T10:34:04Z

src/python/WMComponent/JobSubmitter/JobSubmitterPoller.py

+            myThread.transaction.rollback()
+            raise WMException(msg) from ex


i am not sure i understand the need for this rollback, since it seems that if you raise a wmexception that the rollback already happens here

WMCore/src/python/WMComponent/JobSubmitter/JobSubmitterPoller.py

Line 833 in beefc74

myThread.transaction.rollback()

Hi @mapellidario, I agree it was redundant, I have received the rollback from the submitJobs method.

hassan11196 · 2024-12-10T10:50:45Z

Thank you for the review @mapellidario, can you provide a suggestion on how to handle the exception thrown at
https://github.com/hassan11196/WMCore/blob/handle-condor-local-daemon/src/python/WMComponent/JobSubmitter/JobSubmitterPoller.py#L764

That is re-raised in the algorithm method?
https://github.com/hassan11196/WMCore/blob/024268088c77e29805feb2fd44c8ca529f1ee7bc/src/python/WMComponent/JobSubmitter/JobSubmitterPoller.py#L842

If this is not handled, the component will crash

dmwm-bot · 2024-12-10T10:59:38Z

Jenkins results:

Python3 Unit tests: succeeded
- 1 tests no longer failing
- 3 changes in unstable tests
Python3 Pylint check: failed
- 8 warnings and errors that must be fixed
- 1 warnings
- 45 comments to review
Pycodestyle check: succeeded
- 13 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/WMCore-PR-Report/178/artifact/artifacts/PullRequestReport.html

mapellidario · 2024-12-10T14:28:05Z

I thought about this for a while. The only way out that I see is adding a new exception, let's say WMSoftException, that you throw it when bossair fail (L765), that you catch it one level up (L840), that you rollback the transaction without raising again.

you can start defining the new exception in the jobsubmitterpoller module, than in case we need it in more places we can create a new file, even a simple would do for the time being:

class WMSoftException(Exception):
    pass

amaltaro · 2025-01-22T02:52:44Z

@hassan11196 @mapellidario I think the code already implemented will be needed, so please keep it around.

In addition, I would suggest creating a (private) method to instantiate a schedd object, which would basically contain the htcondor.Schedd() instantiation and raise a custom exception in case of problems. I would suggest to name it as ScheddUnavailable or a variation of this and Dario's suggestion above.

We need then to catch this exception upstream and ensure that:

a transaction is rolled back, if needed
this new exception is caught and gracefully handled by the component (JobSubmitter, JobStatusLite, anyone else?)

mapellidario · 2025-01-22T16:05:11Z

Sorry for throwing yet another idea onto the table.

In JobSubmitterPoller we call bossair.submit() [1], which I think eventually calls [2]. A few lines below, we create a schedd object.

Instead, we could actually benefit from the PyCondorAPI.py module in SimpleCondorPlugin.py, so that we can recreate the schedd object when the agent loses connection, similar to what it is done here [3].

This could actually be done when addressing #12238 , since SimpleCondorPlugin already needs some changes.

[1]

WMCore/src/python/WMComponent/JobSubmitter/JobSubmitterPoller.py

Line 757 in 59d47b8

successList, failList = self.bossAir.submit(jobs=jobList)

[2]

WMCore/src/python/WMCore/BossAir/Plugins/SimpleCondorPlugin.py

Line 144 in 59d47b8

def submit(self, jobs, info=None):

WMCore/src/python/WMCore/BossAir/Plugins/SimpleCondorPlugin.py

Line 157 in 59d47b8

schedd = htcondor.Schedd()

[3]

WMCore/src/python/WMCore/Services/PyCondor/PyCondorAPI.py

Lines 71 to 72 in 59d47b8

    
           except Exception: 
        
               self.recreateSchedd()

amaltaro · 2025-01-24T18:03:58Z

That is a good point and it might indeed be a good idea, Dario. I have two concerns with that though:

is it a problem to also create a collector object when you don't necessarily need it?
I fear that in many scenarios, if we try to recreate a schedd object right away, it might just fail again (e.g., condor restart or something like that).

Said that, if we want to have a retry logic in place, I think the best would be to give it some very short grace period (<= 10seconds).

mapellidario · 2025-01-27T13:50:59Z

no I do not think that a collector object would create any problem. it should not be a big object, we are going to create a single one every time, not 100k, so the increase in memory usage is going to be negligible.

and yes, i agree with adding a grace period, we do not really want to spam the schedd while it is busy with something else, or with high duty cycle :)

amaltaro · 2025-01-27T21:06:08Z

@mapellidario my concern is not much with memory footprint, but actually another point of failure that can make this implementation weak. In other words, if there is no need to have a collector object - which I understand to open a connection to the actual HTCondor collector - I'd rather not give it a chance to have a failure in there.

…al-daemon

dmwm-bot · 2025-04-28T13:29:15Z

Jenkins results:

Python3 Unit tests: succeeded
- 1 changes in unstable tests
Python3 Pylint check: failed
- 8 warnings and errors that must be fixed
- 1 warnings
- 45 comments to review
Pycodestyle check: succeeded
- 13 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/WMCore-PR-Report/629/artifact/artifacts/PullRequestReport.html

…n JobSubmitterPoller and StatusPoller

hassan11196 · 2025-04-28T13:42:41Z

Hi @mapellidario @amaltaro, I have added the new Exception Class CondorScheddUnavailable, can you please review the current implementation and let me know if it aligns with what we need?
I can then clean it up more.
Thanks

dmwm-bot · 2025-04-28T13:51:33Z

Jenkins results:

Python3 Unit tests: succeeded
- 1 changes in unstable tests
Python3 Pylint check: failed
- 15 warnings and errors that must be fixed
- 7 warnings
- 109 comments to review
Pycodestyle check: succeeded
- 39 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/WMCore-PR-Report/631/artifact/artifacts/PullRequestReport.html

mapellidario · 2025-04-28T16:04:59Z

So, if i understand correctly, when you want to submit jobs and fail to talk to the scedd, you catch the CondorScheddUnavailable, write to logs, return, wait for next cycle and hope that you will be able to talk to the schedd now, right?

amaltaro

@hassan11196 thank you for providing these changes, Ahmed. I left a few comments along the code for your consideration.

amaltaro · 2025-05-13T16:11:27Z

src/python/WMCore/BossAir/StatusPoller.py


+        try:
+            runningJobs = self.bossAir.track()
+        except Exception as ex:


I think this code needs to be rolled back. Otherwise, it defeats the new CondorScheddUnavailable catch implemented in the algorithm above.

amaltaro · 2025-05-13T16:12:52Z

src/python/WMCore/BossAir/Plugins/SimpleCondorPlugin.py

+
+    def getScheddObject(self):
+        """
+        __getScheddObject_


Feel free to remove this line (old docstring style that we are trying not to use anymore) - same for the new exception implemented above.

amaltaro · 2025-05-13T16:13:39Z

src/python/WMComponent/JobUpdater/JobUpdaterPoller.py

                    # if there are jobs in wmbs executing state, update their prio in condor
                    if self.executingJobsDAO.execute(workflow) > 0:
                        logging.info("Updating condor jobs priority for request: %s", workflow)
+                        # TODO: verify if we should wrap this in a try/except for the  CondorScheddException as well?


We could, but I would rather address this in a new/separate issue such that we only print a friendly error message, instead of:

2025-03-01 12:42:46,832:139956067981056:ERROR:SimpleCondorPlugin:Unable to edit jobs matching constraint Traceback (most recent call last): File "/usr/local/lib/python3.8/site-packages/WMCore/BossAir/Plugins/SimpleCondorPlugin.py", line 486, in updateJobInformation schedd.edit(constraint, 'JobPrio', classad.Literal(newPriority)) File "/usr/local/lib/python3.8/site-packages/htcondor/_lock.py", line 70, in wrapper rv = func(*args, **kwargs) htcondor.HTCondorIOError: Unable to edit jobs matching constraint

amaltaro · 2025-05-13T16:16:34Z

src/python/WMComponent/JobSubmitter/JobSubmitterPoller.py

+        except CondorScheddUnavailable as ex:
+            msg = "Condor Schedd is unavailable: %s" % str(ex)
+            logging.error(msg)
+            myThread.logdbClient.post("JobSubmitter_submitWork", msg, "error")


LogDB usage is annoying in the sense that documents are not automatically cleaned up. In other words, if we create an error record, the agent/component would keep this error in the LogDB (and WMStats) until someone decides to delete it.

I would be in favor of not adding LogDB here and rely on other ways to monitor agent job submission.

amaltaro · 2025-05-13T16:17:39Z

src/python/WMComponent/JobSubmitter/JobSubmitterPoller.py

+            # dont raise WMException, just return
+            logging.warning("JobSubmitter didn't submit any jobs due to condor schedd being unavailable.")
+            # TODO: verify if we shoule rollback the transaction or not?
+            myThread.transaction.rollback()


Can you please investigate further if self.bossAir.submit(jobs=jobList) is actually persisting anything in the relational database? I think it relies on the lines below for persisting data in the database. If that is true, then there is no need to add rollback logic in here.

Add error handling for job submission and tracking in JobSubmitterPol…

3bbbaaf

…ler and StatusPoller

hassan11196 self-assigned this Nov 19, 2024

hassan11196 requested review from amaltaro and anpicci November 21, 2024 15:09

amaltaro requested changes Nov 26, 2024

View reviewed changes

mapellidario self-requested a review December 9, 2024 15:12

mapellidario requested changes Dec 10, 2024

View reviewed changes

remove extra rollback

0242680

mapellidario mentioned this pull request Apr 9, 2025

WMAgent: Unhandled exceptions at JobStatusLite component #12323

Closed

Merge remote-tracking branch 'upstream/master' into handle-condor-loc…

8ebb074

…al-daemon

Create new Exception CondorScheddUnavailable for exception handling i…

b8214ae

…n JobSubmitterPoller and StatusPoller

hassan11196 requested review from amaltaro and mapellidario April 28, 2025 13:41

amaltaro requested changes May 13, 2025

View reviewed changes

		myThread.transaction.rollback()
		raise WMException(msg) from ex

Conversation

hassan11196 commented Nov 19, 2024

Status

Description

Is it backward compatible (if not, which system it affects?)

Related PRs

External dependencies / deployment changes

Uh oh!

dmwm-bot commented Nov 19, 2024

Uh oh!

hassan11196 commented Nov 21, 2024

Uh oh!

amaltaro left a comment

Choose a reason for hiding this comment

Uh oh!

hassan11196 commented Nov 26, 2024

Uh oh!

mapellidario left a comment

Choose a reason for hiding this comment

Uh oh!

mapellidario Dec 10, 2024

Choose a reason for hiding this comment

Uh oh!

hassan11196 Dec 10, 2024

Choose a reason for hiding this comment

Uh oh!

hassan11196 commented Dec 10, 2024

Uh oh!

dmwm-bot commented Dec 10, 2024

Uh oh!

mapellidario commented Dec 10, 2024

Uh oh!

amaltaro commented Jan 22, 2025

Uh oh!

mapellidario commented Jan 22, 2025

Uh oh!

amaltaro commented Jan 24, 2025

Uh oh!

mapellidario commented Jan 27, 2025

Uh oh!

amaltaro commented Jan 27, 2025

Uh oh!

dmwm-bot commented Apr 28, 2025

Uh oh!

hassan11196 commented Apr 28, 2025

Uh oh!

dmwm-bot commented Apr 28, 2025

Uh oh!

mapellidario commented Apr 28, 2025

Uh oh!

amaltaro left a comment

Choose a reason for hiding this comment

Uh oh!

amaltaro May 13, 2025

Choose a reason for hiding this comment

Uh oh!

amaltaro May 13, 2025

Choose a reason for hiding this comment

Uh oh!

amaltaro May 13, 2025

Choose a reason for hiding this comment

Uh oh!

amaltaro May 13, 2025

Choose a reason for hiding this comment

Uh oh!

amaltaro May 13, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants