-
Notifications
You must be signed in to change notification settings - Fork 117
Description
Code snippet
What happened?
Description
I have encountered a recurrent issue regarding custodian error handlers that monitor VASP calculations periodically (specifically those with is_monitor=True). When these handlers detect anomalies during a run, they terminate the calculation to modify input files (e.g., INCAR) and restart.
However, after such an intervention, the final validation step fails because the vasprun.xml file from the terminated run is corrupted (truncated or missing closing tags). Even though the handler successfully intervened, the validation throws a parsing error similar to the following:
xml.etree.ElementTree.ParseError: mismatched tag: line 822, column 23
[2025-11-07 07:46:24,381] [ERROR] [custodian.py:398:run] Validation failed: VasprunXMLValidator
I have confirmed the following behaviors:
-
The parsing error occurs specifically when an error handler terminates the VASP job early.
-
Disabling the error handlers with is_monitor=True eliminates the issue.
-
Modifying the code to perform a graceful termination using STOPCAR also resolves the issue completely.
These findings strongly suggest that the default abrupt termination mechanism triggered by these monitors is the root cause.
This behavior has been observed in the following environments:
-
VASP Versions: Both GPU and CPU versions (Tested on VASP 6.5.1 and 6.4.3).
-
Note: The error occurs very frequently with the GPU version of VASP. While it is rare with the CPU version, it does happen occasionally.
-
Custodian Version: Tested on the latest version.
-
Infrastructure: Occurs on both local sandbox environments and HPC computing centers.
Analysis & Suggestion
It seems the main culprit is the abrupt termination signal sent to VASP when a monitor triggers. This "hard kill" cuts VASP off right in the middle of writing vasprun.xml, leaving it incomplete.
To fix this, I implemented a graceful exit mechanism using STOPCAR (e.g., LABORT = .TRUE.) instead of killing the process immediately. This gives VASP enough time to close file handles and finish writing the XML tags. After this change, the parsing errors completely disappeared in all my tests.
I think it would be really useful to add a flag (e.g., graceful_termination=False by default) to the ErrorHandler or VaspErrorHandler. If enabled, the handler would try to stop the job via STOPCAR first and wait for the process to finish cleanly before resorting to a hard kill. This would be a huge lifesaver for users running VASP on GPUs where this I/O issue is frequent.
Version
2025.8.13
Which OS?
- MacOS
- Windows
- Linux