How to parallel SSCHA+QE on two different computing nodes on a cluster? #156
-
|
Dear Development Team, I am currently working with a cluster comprised of multiple computing nodes. I have successfully configured SSCHA to run in conjunction with pw.x on a single computing node. For this, I've been using PBS for job scheduling and submission from the login node. However, when I attempt to run SSCHA+QE across two different computing nodes, I run into issues. Here's a detailed breakdown: I created an input file, nvt_para.py, for SSCHA, modifying it from the tutorial file nvt_local.py. I adjusted the 'command' value as follows: command = '~/software/openmpi/4.0.5/bin/mpiexec -n 8 ~/software/QE/6.6_openmpi/bin/pw.x -npool 2 -i PREFIX.pwi > PREFIX.pwo'The whole content of the file nvt_para.py is shown below: import sscha, sscha.Ensemble, sscha.SchaMinimizer, sscha.Relax, sscha.Utilities
import cellconstructor as CC, cellconstructor.Phonons
import cellconstructor.Structure, cellconstructor.calculators
import cellconstructor.calculators
from ase.calculators.espresso import Espresso
import numpy as np, matplotlib.pyplot as plt
import sys, os
#
# Initialize the DFT (Quantum Espresso) calculator for H3S
# The input data is a dictionary that encodes the pw.x input file namelist
input_data = {
'control' : {
# Avoid writing wavefunctions on the disk
'disk_io' : 'None',
# Where to find the pseudopotential
'pseudo_dir' : '.',
'outdir' : './outdir',
'wfcdir' : './wfcdir',
'tprnfor' : True,
'tstress' : True
},
'system' : {
# Specify the basis set cutoffs
'ecutwfc' : 35, # Cutoff for wavefunction
'ecutrho' : 350, # Cutoff for the density
# Information about smearing (it is a metal)
'input_dft' : 'blyp', # exchange-correlation functional
'occupations' : 'smearing',
'smearing' : 'mv',
'degauss' : 0.02
},
'electrons' : {
'conv_thr' : 1e-8
}
}
#
# the pseudopotential for each chemical element
pseudopotentials = {'H' : 'H.pbe-rrkjus_psl.1.0.0.UPF', 'S' : 's_pbe_v1.4.uspp.F.UPF'}
# the kpoints mesh and the offset
kpts = (8,8,8)
koffset = (1,1,1)
#
# Specify the command to call quantum espresso
command = '~/software/openmpi/4.0.5/bin/mpiexec -n 8 ~/software/QE/6.6_openmpi/bin/pw.x -npool 2 -i PREFIX.pwi > PREFIX.pwo'
#
# Prepare the quantum espresso calculator
calculator = CC.calculators.Espresso(input_data,
pseudopotentials,
command = command,
kpts = kpts,
koffset = koffset)
#calculator = Espresso(pseudopotentials = pseudopotentials, input_data = input_data,
# command = command, kpts = kpts, koffset = koffset)
TEMPERATURE = 300
N_CONFIGS = 50
MAX_ITERATIONS = 20
START_DYN = 'start_sscha'
NQIRR = 3
# Let us load the starting dynamical matrix
dyn = CC.Phonons.Phonons(START_DYN, NQIRR)
dyn.Symmetrize()
#
# Initialize the random ionic ensemble
ensemble = sscha.Ensemble.Ensemble(dyn, TEMPERATURE)
#
# Initialize the free energy minimizer
minim = sscha.SchaMinimizer.SSCHA_Minimizer(ensemble)
minim.set_minimization_step(0.01)
#
# Initialize the NVT simulation
relax = sscha.Relax.SSCHA(minim, calculator, N_configs = N_CONFIGS,
max_pop = MAX_ITERATIONS)
#
# Define the I/O operations
# To save info about the free energy minimization after each step
ioinfo = sscha.Utilities.IOInfo()
ioinfo.SetupSaving("minim_info")
relax.setup_custom_functions(custom_function_post = ioinfo.CFP_SaveAll)
#
# Run the NVT simulation (save the stress to compute the pressure)
relax.relax(get_stress = True)
#
# If instead you want to run a NPT simulation, use
# The target pressure is given in GPa.
#relax.vc_relax(target_press = 0)
#
# You can also run a mixed simulation (NVT) but with variable lattice parameters
#relax.vc_relax(fix_volume = True)
#
# Now we can save the final dynamical matrix
# And print in stdout the info about the minimization
relax.minim.finalize()
relax.minim.dyn.save_qe("sscha_T{}_dyn".format(TEMPERATURE))To submit the job, I crafted a script named sub.sh and used the following command to submit the job qsub sub.shThe file 'sub.sh' reads: #!/bin/bash
#PBS -N noname
#PBS -o LogPBS
#PBS -e LogPBS.err
#PBS -m ae
#PBS -l nodes=node1:ppn=16+node2:ppn=16
cd $PBS_O_WORKDIR
echo Job started at `date`
echo Directory is $PWD
echo This job runs on the following nodes:
cat $PBS_NODEFILE
#mpiexec -n 2 -machinefile $PBS_NODEFILE python nvt_para.py >out.log
mpiexec -n 2 -machinefile nodefile python nvt_para.py >out.logDespite these efforts, I found that all of the pw.x processes (2x8=16) spawn exclusively on the first computing node (node1), completely ignoring the second node (node2). node1
node2And modified the sub.sh command to: mpiexec -n 2 -machinefile nodefile python nvt_para.py >out.logHowever, only 8 pw.x processes appear on node1, while node2 remains inactive. Could you offer some guidance on how to run pw.x in parallel across multiple nodes under the control of SSCHA? Best regards! |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 6 replies
-
|
Hi, you could try and use the cluster object and point it as localhost. Setting the cluster calculation is explained in the SSCHA documentation http://sscha.eu/Tutorials/tutorial_02_advanced_submission/, so you change the cluster to 127.0. 0.1 and that will loopback to itself. |
Beta Was this translation helpful? Give feedback.
-
|
The line "skip targets (VXE FMA4 NEON XOP VSX2 VSX4 VX VXE2 AVX512_SKX VSX3 ASIMD VSX AVX512F) not part of baseline or dispatch-able features" suggest that your machine is not supported. Your system compiler does not support AVX-512. Maybe a solution is in the OpenBLAS FAQ. |
Beta Was this translation helpful? Give feedback.
-
|
As a curiosity, this is GPT3.5 answer to the error The compilation error "gcc: error: unrecognized command line option '-mavx512vl'" means that the compiler is encountering an option ("-mavx512vl") that it does not recognize. This could be due to a few reasons:
To resolve this error, you can try the following steps:
It's worth noting that the specific error message you provided ("gcc: error: unrecognized command line option '-mavx512vl'") is a common error message that can occur in various situations, not just related to AVX-512 instructions. So, the steps provided above are general troubleshooting steps to resolve unrecognized option errors in GCC. Sources: |
Beta Was this translation helpful? Give feedback.
I agree with Diego. It would be better to submit each work as a job so that PBS automatically initializes the MPI workload to be correctly distributed across your nodes.
The problem is related to the nested calls of MPI, which confuses the system.
There are two solutions:
However, by default, the cluster module works with SLURM. To have it work with PBS, you can override the specific SLURM commands with the equivalent of PBS. For example, here are the total command configured for SLURM: