Skip to content

[Bug]: ErrorHandlers with monitoring features sometimes cause vasprun.xml corruption due to abrupt termination (frequent on GPU VASP) #425

@simplecaryu

Description

@simplecaryu

Code snippet

What happened?

Description

I have encountered a recurrent issue regarding custodian error handlers that monitor VASP calculations periodically (specifically those with is_monitor=True). When these handlers detect anomalies during a run, they terminate the calculation to modify input files (e.g., INCAR) and restart.

However, after such an intervention, the final validation step fails because the vasprun.xml file from the terminated run is corrupted (truncated or missing closing tags). Even though the handler successfully intervened, the validation throws a parsing error similar to the following:

xml.etree.ElementTree.ParseError: mismatched tag: line 822, column 23
[2025-11-07 07:46:24,381] [ERROR] [custodian.py:398:run] Validation failed: VasprunXMLValidator

I have confirmed the following behaviors:

  • The parsing error occurs specifically when an error handler terminates the VASP job early.

  • Disabling the error handlers with is_monitor=True eliminates the issue.

  • Modifying the code to perform a graceful termination using STOPCAR also resolves the issue completely.

These findings strongly suggest that the default abrupt termination mechanism triggered by these monitors is the root cause.

This behavior has been observed in the following environments:

  • VASP Versions: Both GPU and CPU versions (Tested on VASP 6.5.1 and 6.4.3).

  • Note: The error occurs very frequently with the GPU version of VASP. While it is rare with the CPU version, it does happen occasionally.

  • Custodian Version: Tested on the latest version.

  • Infrastructure: Occurs on both local sandbox environments and HPC computing centers.

Analysis & Suggestion

It seems the main culprit is the abrupt termination signal sent to VASP when a monitor triggers. This "hard kill" cuts VASP off right in the middle of writing vasprun.xml, leaving it incomplete.

To fix this, I implemented a graceful exit mechanism using STOPCAR (e.g., LABORT = .TRUE.) instead of killing the process immediately. This gives VASP enough time to close file handles and finish writing the XML tags. After this change, the parsing errors completely disappeared in all my tests.

I think it would be really useful to add a flag (e.g., graceful_termination=False by default) to the ErrorHandler or VaspErrorHandler. If enabled, the handler would try to stop the job via STOPCAR first and wait for the process to finish cleanly before resorting to a hard kill. This would be a huge lifesaver for users running VASP on GPUs where this I/O issue is frequent.

Version

2025.8.13

Which OS?

  • MacOS
  • Windows
  • Linux

Log output

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions