Handle HTCondor "Unable to locate local daemon" Error#12172
Handle HTCondor "Unable to locate local daemon" Error#12172hassan11196 wants to merge 4 commits intodmwm:masterfrom
Conversation
…ler and StatusPoller
|
Jenkins results:
|
|
Hi @amaltaro, I have wrapped the bossAir submit call in JobSubmitterPoller and the track call in StatusPoller. Should I modify the Exception type to something specific and handle it somewhere upstream? Thanks. |
amaltaro
left a comment
There was a problem hiding this comment.
@hassan11196 thank you for proposing this fix.
This is not a complete review, as I need to look into the exception propagation with more attention, but I think we should move all lines schedd = htcondor.Schedd() in this module under the try/except clause as well (according to the tracebacks reported in the original issue).
With that, I believe some of these try/except that you provided are no longer relevant.
I looked at all of Let me know what you think. Thanks for looking into it. |
mapellidario
left a comment
There was a problem hiding this comment.
@hassan11196 , changes look good and reflect the discussion you had with alan in the original issue. I have a small question though :)
| myThread.transaction.rollback() | ||
| raise WMException(msg) from ex |
There was a problem hiding this comment.
i am not sure i understand the need for this rollback, since it seems that if you raise a wmexception that the rollback already happens here
There was a problem hiding this comment.
Hi @mapellidario, I agree it was redundant, I have received the rollback from the submitJobs method.
|
Thank you for the review @mapellidario, can you provide a suggestion on how to handle the exception thrown at That is re-raised in the algorithm method? If this is not handled, the component will crash |
|
Jenkins results:
|
|
I thought about this for a while. The only way out that I see is adding a new exception, let's say you can start defining the new exception in the jobsubmitterpoller module, than in case we need it in more places we can create a new file, even a simple would do for the time being: class WMSoftException(Exception):
pass |
|
@hassan11196 @mapellidario I think the code already implemented will be needed, so please keep it around. In addition, I would suggest creating a (private) method to instantiate a schedd object, which would basically contain the We need then to catch this exception upstream and ensure that:
|
|
Sorry for throwing yet another idea onto the table. In JobSubmitterPoller we call bossair.submit() [1], which I think eventually calls [2]. A few lines below, we create a schedd object. Instead, we could actually benefit from the PyCondorAPI.py module in SimpleCondorPlugin.py, so that we can recreate the schedd object when the agent loses connection, similar to what it is done here [3]. This could actually be done when addressing #12238 , since SimpleCondorPlugin already needs some changes. [1] [2] [3] WMCore/src/python/WMCore/Services/PyCondor/PyCondorAPI.py Lines 71 to 72 in 59d47b8 |
|
That is a good point and it might indeed be a good idea, Dario. I have two concerns with that though:
Said that, if we want to have a retry logic in place, I think the best would be to give it some very short grace period (<= 10seconds). |
|
no I do not think that a collector object would create any problem. it should not be a big object, we are going to create a single one every time, not 100k, so the increase in memory usage is going to be negligible. and yes, i agree with adding a grace period, we do not really want to spam the schedd while it is busy with something else, or with high duty cycle :) |
|
@mapellidario my concern is not much with memory footprint, but actually another point of failure that can make this implementation weak. In other words, if there is no need to have a collector object - which I understand to open a connection to the actual HTCondor collector - I'd rather not give it a chance to have a failure in there. |
|
Jenkins results:
|
…n JobSubmitterPoller and StatusPoller
|
Hi @mapellidario @amaltaro, I have added the new Exception Class CondorScheddUnavailable, can you please review the current implementation and let me know if it aligns with what we need? |
|
Jenkins results:
|
|
So, if i understand correctly, when you want to submit jobs and fail to talk to the scedd, you catch the |
amaltaro
left a comment
There was a problem hiding this comment.
@hassan11196 thank you for providing these changes, Ahmed. I left a few comments along the code for your consideration.
|
|
||
| try: | ||
| runningJobs = self.bossAir.track() | ||
| except Exception as ex: |
There was a problem hiding this comment.
I think this code needs to be rolled back. Otherwise, it defeats the new CondorScheddUnavailable catch implemented in the algorithm above.
|
|
||
| def getScheddObject(self): | ||
| """ | ||
| __getScheddObject_ |
There was a problem hiding this comment.
Feel free to remove this line (old docstring style that we are trying not to use anymore) - same for the new exception implemented above.
| # if there are jobs in wmbs executing state, update their prio in condor | ||
| if self.executingJobsDAO.execute(workflow) > 0: | ||
| logging.info("Updating condor jobs priority for request: %s", workflow) | ||
| # TODO: verify if we should wrap this in a try/except for the CondorScheddException as well? |
There was a problem hiding this comment.
We could, but I would rather address this in a new/separate issue such that we only print a friendly error message, instead of:
2025-03-01 12:42:46,832:139956067981056:ERROR:SimpleCondorPlugin:Unable to edit jobs matching constraint
Traceback (most recent call last):
File "/usr/local/lib/python3.8/site-packages/WMCore/BossAir/Plugins/SimpleCondorPlugin.py", line 486, in updateJobInformation
schedd.edit(constraint, 'JobPrio', classad.Literal(newPriority))
File "/usr/local/lib/python3.8/site-packages/htcondor/_lock.py", line 70, in wrapper
rv = func(*args, **kwargs)
htcondor.HTCondorIOError: Unable to edit jobs matching constraint
| except CondorScheddUnavailable as ex: | ||
| msg = "Condor Schedd is unavailable: %s" % str(ex) | ||
| logging.error(msg) | ||
| myThread.logdbClient.post("JobSubmitter_submitWork", msg, "error") |
There was a problem hiding this comment.
LogDB usage is annoying in the sense that documents are not automatically cleaned up. In other words, if we create an error record, the agent/component would keep this error in the LogDB (and WMStats) until someone decides to delete it.
I would be in favor of not adding LogDB here and rely on other ways to monitor agent job submission.
| # dont raise WMException, just return | ||
| logging.warning("JobSubmitter didn't submit any jobs due to condor schedd being unavailable.") | ||
| # TODO: verify if we shoule rollback the transaction or not? | ||
| myThread.transaction.rollback() |
There was a problem hiding this comment.
Can you please investigate further if self.bossAir.submit(jobs=jobList) is actually persisting anything in the relational database? I think it relies on the lines below for persisting data in the database. If that is true, then there is no need to add rollback logic in here.
Fixes #9703
Status
in development
Description
This pull request adds Exception handling to catch this
Unable to locate local daemonerror thrown when a schedd instance is created i.e.schedd = htcondor.Schedd()in theSimpleCondorPlugin.pyIs it backward compatible (if not, which system it affects?)
YES
Related PRs
None
External dependencies / deployment changes
No