1. OpenMPI多节点运行报错问题

问题描述:节点一即host3,通过mpirun调用节点二即host4的mpi程序,报错如下。

$ mpirun -np 1 --host host4 ./main

[[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file ess_env_module.c at line 367

[[INVALID],INVALID]-[[59225,0],0] mca_oob_tcp_peer_try_connect: connect to 255.255.255.255:51754 failed: Network is unreachable (101)

--------------------------------------------------------------------------

ORTE was unable to reliably start one or more daemons.

This usually is caused by:

* not finding the required libraries and/or binaries on

one or more nodes. Please check your PATH and LD_LIBRARY_PATH

settings, or configure OMPI with --enable-orterun-prefix-by-default

* lack of authority to execute on one or more specified nodes.

Please verify your allocation and authorities.

* the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).

Please check with your sys admin to determine the correct location to use.

* compilation of the orted with dynamic libraries when static are required

(e.g., on Cray). Please check your configure cmd line and consider using

one of the contrib/platform definitions for your system type.

* an inability to create a connection back to mpirun due to a

lack of common network interfaces and/or no route found between

them. Please check network connectivity (including firewalls

and network routing requirements).

--------------------------------------------------------------------------

解决方案

在确保节点一和节点二都能单机运行OpenMPI程序的前提下,检查两个节点的OpenMPI版本是否一致。如果不一致,重装OpenMPI使之版本一致。

参考资料

[1. OpenMPI报错问题] https://www.slothparadise.com/fix-orte-error-unknown-option-hnp-topo-sig/

Logo

更多推荐