Skip to content

MPI job with multiple nodes can not be launched correctly on Polaris #639

@GKNB

Description

@GKNB

I have a workflow that consists of a single stage with a single task. The task is an MPI job, which use multiple processes and is supposed to run on multiple nodes. I find that actually it only runs on a single node. It is on Polaris machine, the resource manager system is PBS, and task launching is handled by mpiexec. Below is my entk script:

from radical import entk
import os
import argparse, sys, math

class MVP(object):

    def __init__(self):
        self.am = entk.AppManager()

    def set_resource(self, res_desc):
        self.am.resource_desc = res_desc

    def generate_task(self):
        t = entk.Task()
        t.pre_exec = []
        t.executable = '/bin/echo'
        t.arguments = ["mytest"]
        t.post_exec = []
        t.cpu_reqs = {
                'cpu_processes'     : 8,
                'cpu_process_type'  : 'MPI',
                'cpu_threads'       : 16,
                'cpu_thread_type'   : 'OpenMP'
                }
        return t

    def generate_pipeline(self):
        p = entk.Pipeline()
        s = entk.Stage()
        t = self.generate_task()
        s.add_tasks(t)
        p.add_stages(s)
        return p

    def run_workflow(self):
        p = self.generate_pipeline()
        self.am.workflow = [p]
        self.am.run()


if __name__ == '__main__':

    mvp = MVP()
    n_nodes = 2
    mvp.set_resource(res_desc = {
        'resource'  : 'anl.polaris',
        'queue'     : 'debug',
        'walltime'  : 60,
        'cpus'      : 64 * n_nodes,
        'gpus'      : 4 * n_nodes,
        'project'   : 'CSC249ADCD08'
        })
    mvp.run_workflow()

Here my MPI job is basically a echo command. I launched 8 processes for that, each with 16 cores, and since Polaris has 64 cores (32 cores but 64 hardware threads, and in resource_anl.json, cpu_per_node is set to be 64) per node, I ask for two Polaris nodes. This is supposed to generate an output file of 8 lines of "mytest". However, I only see 4 lines of "mytest" (see sandbox below, task.0000.out). The script is executed without any error message.

My understanding is that radical is not generating an mpiexec command correctly. If you look at task.0000.launch.sh in sandbox, the mpiexec command it generates is:

/opt/cray/pe/pals/1.1.7/bin/mpiexec -host x3006c0s1b0n0,x3006c0s1b0n0,x3006c0s1b0n0,x3006c0s1b0n0 -n 4 -host x3006c0s1b1n0,x3006c0s1b1n0,x3006c0s1b1n0,x3006c0s1b1n0 -n 4 $RP_TASK_SANDBOX/task.0000.exec.sh

However, I think this command can not submit a job to two nodes (x3006c0s1b0n0 and x3006c0s1b1n0), and my guess is that only the first -host flag is recognized. I did a small test using interactive nodes. I first ask for two interactive nodes on polaris, then I run two command as below:

a). /opt/cray/pe/pals/1.1.7/bin/mpiexec -host x3004c0s25b1n0,x3004c0s25b1n0,x3004c0s25b1n0,x3004c0s25b1n0 -n 4 -host x3004c0s31b0n0,x3004c0s31b0n0,x3004c0s31b0n0,x3004c0s31b0n0 -n 4 echo "mytest"
(Here the two hostname are obtained from $PBS_NODEFILE. This is trying to mimic what rct is doing). This outputs only four lines of "mytest"

b). mpiexec -n 8 --ppn 4 echo "mytest"
This outputs eight lines of "mytest", which is consistent with what we want.

Because of that, I think there is an issue with the mpiexec command rct generated on Polaris. Could you take a look at that? Thanks!

PS. It seems like github does not allow tar file, so I wrap it with zip.
mpi_issue.zip

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions