当 host1 包含在主机文件中时,一个简单的 MPI 应用程序失败并出现以下错误。
错误:PMPI_Init 中的致命错误:其他 MPI 错误,错误堆栈:名片中缺少主机名或无效的主机/端口描述
当 host1 从主机文件中排除时,此应用程序可以正常工作。我尝试使用集群检查器。我附上了相应的集群检查器日志。你能帮我解释一下这个日志吗,因为这似乎主要包含使用“-f(机器列表)指定的各种主机之间的差异,而没有真正突出显示 host-e8 的任何可以解释此错误的问题。请在日志下方找到
SUMMARY
Command-line: clck -f machinesToTest -c clck.xml -Fhealth_user -Fhealth_base
-Fhealth_extended_user -Fmpi_prereq_user -l debug
Tests Run: health_user, health_base, health_extended_user,
mpi_prereq_user
**WARNING**: 9 tests failed to run. Information may be incomplete. See
clck_execution_warnings.log for more information.
Overall Result: 33 issues found - FUNCTIONALITY (3), HARDWARE UNIFORMITY (11),
PERFORMANCE (9), SOFTWARE UNIFORMITY (10)
--------------------------------------------------------------------------------
7 nodes tested: host-a2, host-b[1,3,6], host1,
host-c1, host-d
0 nodes with no issues:
7 nodes with issues: host-a2, host-b[1,3,6], host1,
host-c1, host-d
--------------------------------------------------------------------------------
FUNCTIONALITY
The following functionality issues were detected:
1. mpi-local-broken
Message: The single node MPI "Hello World" program did not run
successfully.
7 nodes: host-a2, host-b[1,3,6], host1,
host-c1, host-d
Test: mpi_local_functionality
2. memlock-too-small
Message: The memlock limit, '64', is smaller than recommended.
Remedy: We recommend correcting the limit of locked memory in
/etc/security/limits.conf to the following values: "* hard
memlock unlimited" "* soft memlock unlimited"
7 nodes: host-a2, host-b[1,3,6], host1,
host-c1, host-d
Test: memory_uniformity_user
3. memlock-too-small-ethernet
Message: The memlock limit, '64', is smaller than recommended.
Remedy: We recommend correcting the limit of locked memory in
/etc/security/limits.conf to the following values: "* hard
memlock unlimited" "* soft memlock unlimited"
7 nodes: host-a2, host-b[1,3,6], host1,
host-c1, host-d
Test: mpi_ethernet
HARDWARE UNIFORMITY
The following hardware uniformity issues were detected:
1. memory-not-uniform
Message: The amount of physical memory is not within the range of
792070572.0 KiB +/- 262144.0 KiB defined by nodes in the same
grouping.
5 nodes: host-b[1,3], host1, host-c1, host-d
Test: memory_uniformity_base
Details:
#Nodes Memory Nodes
1 1584974816.0 KiB host-c1
1 2113513608.0 KiB host1
1 529153152.0 KiB host-d
1 790940180.0 KiB host-b1
1 790940184.0 KiB host-b3
2. logical-cores-not-uniform:24
Message: The logical cores, '24', is not uniform across all nodes in the
same grouping. 67% of nodes in the same grouping have the same
number of logical cores.
Remedy: Please ensure that BIOS settings that can influence the number
of logical cores, like Hyper-Threading, VMX, VT-d and x2apic,
are uniform across nodes in the same grouping.
2 nodes: host-b[1,3]
Test: cpu_base
3. logical-cores-not-uniform:48
Message: The logical cores, '48', is not uniform across all nodes in the
same grouping. 33% of nodes in the same grouping have the same
number of logical cores.
Remedy: Please ensure that BIOS settings that can influence the number
of logical cores, like Hyper-Threading, VMX, VT-d and x2apic,
are uniform across nodes in the same grouping.
1 node: host-b6
Test: cpu_base
4. threads-per-core-not-uniform:1
Message: The number of threads available per core, '1', is not uniform.
67% of nodes in the same grouping have the same number of
threads available per core.
Remedy: Please enable/disable hyper-threading uniformly on Intel(R)
CPUs.
2 nodes: host-b[1,3]
Test: cpu_base
5. threads-per-core-not-uniform:2
Message: The number of threads available per core, '2', is not uniform.
33% of nodes in the same grouping have the same number of
threads available per core.
Remedy: Please enable/disable hyper-threading uniformly on Intel(R)
CPUs.
1 node: host-b6
Test: cpu_base
6. cpu-model-name-not-uniform
Message: The CPU model, 'Intel(R) Xeon(R) CPU E5-2697 v3 @ 2.60GHz', is
not uniform. 14% of nodes in the same grouping have the same
CPU model.
1 node: host-a2
Test: cpu_base
7. cpu-model-name-not-uniform
Message: The CPU model, 'Intel(R) Xeon(R) Gold 6256 CPU @ 3.60GHz', is
not uniform. 43% of nodes in the same grouping have the same
CPU model.
3 nodes: host-b[1,3,6]
Test: cpu_base
8. cpu-model-name-not-uniform
Message: The CPU model, 'Intel(R) Xeon(R) CPU E7- 4870 @ 2.40GHz', is
not uniform. 14% of nodes in the same grouping have the same
CPU model.
1 node: host1
Test: cpu_base
9. cpu-model-name-not-uniform
Message: The CPU model, 'Intel(R) Xeon(R) CPU E7-4830 v3 @ 2.10GHz', is
not uniform. 14% of nodes in the same grouping have the same
CPU model.
1 node: host-c1
Test: cpu_base
10. cpu-model-name-not-uniform
Message: The CPU model, 'Intel(R) Xeon(R) CPU E5-2640 v3 @ 2.60GHz', is
not uniform. 14% of nodes in the same grouping have the same
CPU model.
1 node: host-d
Test: cpu_base
11. ethernet-firmware-version-is-not-consistent
Message: Inconsistent Ethernet firmware version.
3 nodes: host-a2, host1, host-c1
Test: ethernet
Details:
#Nodes Firmware Version Nodes
1 0x80000887, 1.2028.0 host-c1
1 0x800008e8 host-a2
1 4.0.596 host1
1 5719-v1.46 NCSI v1.3.16.0 host-a2
PERFORMANCE
The following performance issues were detected:
1. process-is-high-cpu
Message: Processes using high CPU.
Remedy: If this command is running in error, kill the process on the
node (if you are not the owner of the process, elevated
privileges may be required.)
5 nodes: host-a2, host-b[3,6], host1, host-c1
Test: node_process_status
Details:
#Nodes User PID %CPU Process Nodes
1 usera 204058 98.9 /med/code7/usera/blue4/rnd/software/amd64.linux.gnu.product/distribVelsyn host-b3
1 userb 120854 98.5 /med/code7/userb/rb21B/software/amd64.linux.gnu.product/velsyn host1
1 userb 71486 98.6 /med/code7/userb/rb21B/software/amd64.linux.gnu.product/velsyn host1
1 wvgrid 11116 37.2 /wv/wv-med/sge/bin/lx-amd64/sge_execd host-a2
1 wvgrid 19160 21.1 /wv/wv-med/sge/bin/lx-amd64/sge_execd host-b6
1 wvgrid 25097 79.7 /wv/wv-med/sge/bin/lx-amd64/sge_execd host-c1
1 wvgrid 90731 58.1 /wv/wv-med/sge/bin/lx-amd64/sge_execd host1
2. substandard-dgemm-due-to-high-cpu-process
Message: The substandard DGEMM benchmark result of 1.528 TFLOPS is due to
a conflicting process, pid '204058', using a large amount of
cpu.
Remedy: If this command is running in error, kill the process on the
node (if you are not the owner of the process, elevated
privileges may be required.)
1 node: host-b3
Test: node_process_status
3. substandard-sgemm-due-to-high-cpu-process
Message: The substandard SGEMM benchmark result of 3.277 TFLOPS is due to
a conflicting process, pid '204058', using a large amount of
cpu.
Remedy: If this command is running in error, kill the process on the
node (if you are not the owner of the process, elevated
privileges may be required.)
1 node: host-b3
Test: node_process_status
4. sgemm-data-is-substandard-avx512
Message: The following SGEMM benchmark results are below the accepted
4.147 TFLOPS(100%). The acceptable fraction (90%) can be set
using the <sgemm-peak-fraction> option in the configuration
file. For more details, please refer to the Intel(R) Cluster
Checker User Guide.
3 nodes: host-b[1,3,6]
Test: sgemm_cpu_performance
Details:
#Nodes Result %Below Peak Nodes
1 2.355 TFLOPS 57 host-b6
1 3.181 TFLOPS 77 host-b1
1 3.277 TFLOPS 79 host-b3
5. substandard-sgemm-due-to-high-cpu-process
Message: The substandard SGEMM benchmark result of 2.355 TFLOPS is due to
a conflicting process, pid '19160', using a large amount of cpu.
Remedy: If this command is running in error, kill the process on the
node (if you are not the owner of the process, elevated
privileges may be required.)
1 node: host-b6
Test: node_process_status
6. dgemm-data-is-substandard-avx512
Message: The DGEMM benchmark result is below the accepted 2.074
TFLOPS(100%). The acceptable fraction (90%) can be set using the
<dgemm-peak-fraction> option in the configuration file. For more
details, please refer to the Intel(R) Cluster Checker User
Guide.
3 nodes: host-b[1,3,6]
Test: dgemm_cpu_performance
Details:
#Nodes Result %Below Peak Nodes
1 1.389 TFLOPS 67 host-b1
1 1.528 TFLOPS 74 host-b3
1 1.570 TFLOPS 76 host-b6
7. dgemm-data-is-substandard
Message: The following DGEMM benchmark results are below the theoretical
peak of 1.165 TFLOPS.
1 node: host-a2
Test: dgemm_cpu_performance
Details:
#Nodes Result %Below Peak Nodes
1 845.441 GFLOPS 73 host-a2
8. substandard-dgemm-due-to-high-cpu-process
Message: The substandard DGEMM benchmark result of 845.441 GFLOPS is due
to a conflicting process, pid '11116', using a large amount of
cpu.
Remedy: If this command is running in error, kill the process on the
node (if you are not the owner of the process, elevated
privileges may be required.)
1 node: host-a2
Test: node_process_status
9. substandard-dgemm-due-to-high-cpu-process
Message: The substandard DGEMM benchmark result of 1.570 TFLOPS is due to
a conflicting process, pid '19160', using a large amount of cpu.
Remedy: If this command is running in error, kill the process on the
node (if you are not the owner of the process, elevated
privileges may be required.)
1 node: host-b6
Test: node_process_status
SOFTWARE UNIFORMITY
The following software uniformity issues were detected:
1. ethernet-driver-is-not-consistent
Message: Inconsistent Ethernet driver.
2 nodes: host-a2, host1
Test: ethernet
Details:
#Nodes Driver Nodes
1 netxen_nic host1
1 tg3 host-a2
2. kernel-not-uniform
Message: The Linux kernel version, '3.10.0-957.27.2.el7.x86_64', is not
uniform. 86% of nodes in the same grouping have the same
version.
6 nodes: host-a2, host-b[1,3,6], host1,
host-c1
Test: kernel_version_uniformity
3. kernel-not-uniform
Message: The Linux kernel version, '2.6.32-573.26.1.el6.x86_64', is not
uniform. 14% of nodes in the same grouping have the same
version.
1 node: host-d
Test: kernel_version_uniformity
4. environment-variable-not-uniform
Message: Environment variables are not uniform across the nodes.
7 nodes: host-a2, host-b[1,3,6], host1,
host-c1, host-d
Test: environment_variables_uniformity
Details:
#Nodes Variable Value Nodes
6 G_BROKEN_FILENAMES host-a2, host-b[1,3,6], host1, host-c1
6 KDE_IS_PRELINKED host-a2, host-b[1,3,6], host1, host-c1
6 MODULEPATH host-a2, host-b[1,3,6], host1, host-c1
6 MODULESHOME host-a2, host-b[1,3,6], host1, host-c1
1 G_BROKEN_FILENAMES 1 host-d
1 KDE_IS_PRELINKED 1 host-d
1 MODULEPATH /usr/share/Modules/modulefiles:/etc/modulefiles host-d
1 MODULESHOME /usr/share/Modules host-d
5. perl-not-uniform
Message: The Perl version, '5.16.3', is not uniform. 86% of nodes in the
same grouping have the same version.
6 nodes: host-a2, host-b[1,3,6], host1,
host-c1
Test: perl_functionality
6. perl-not-uniform
Message: The Perl version, '5.10.1', is not uniform. 14% of nodes in the
same grouping have the same version.
1 node: host-d
Test: perl_functionality
7. python-not-uniform
Message: The Python version, '2.7.5', is not uniform. 86% of nodes in
the same grouping have the same version.
6 nodes: host-a2, host-b[1,3,6], host1,
host-c1
Test: python_functionality
8. python-not-uniform
Message: The Python version, '2.6.6', is not uniform. 14% of nodes in
the same grouping have the same version.
1 node: host-d
Test: python_functionality
9. ethernet-driver-version-is-not-consistent
Message: Inconsistent Ethernet driver version.
2 nodes: host-a2, host1
Test: ethernet
Details:
#Nodes Version Nodes
1 3.137 host-a2
1 4.0.82 host1
10. ethernet-interrupt-coalescing-state-not-uniform
Message: Ethernet interrupt coalescing is not enabled/disabled uniformly
across nodes in the same grouping.
Remedy: Append "/sbin/ethtool -C eno1 rx-usecs <value>" to the site
specific system startup script. Use '0' to permanently disable
Ethernet interrupt coalescing or other value as needed. The
site specific system startup script is typically
/etc/rc.d/rc.local or /etc/rc.d/boot.local.
1 node: host1
Test: ethernet
Details:
#Nodes State Interface Nodes
1 enabled eno1 host1
1 enabled eno3 host1
--------------------------------------------------------------------------------
INFORMATIONAL
The following additional information was detected:
1. mpi-network-interface
Message: The cluster has 1 network interfaces (Ethernet). Intel(R) MPI
Library uses by default the first interface detected in the
order of: (1) Intel(R) Omni-Path Architecture (Intel(R) OPA),
(2) InfiniBand, (3) Ethernet. You can set a specific interface
by setting the environment variable I_MPI_OFI_PROVIDER.
Ethernet: I_MPI_OFI_PROVIDER=sockets mpiexec.hydra; InfiniBand:
I_MPI_OFI_PROVIDER=verbs mpiexec.hydra; Intel(R) OPA:
I_MPI_OFI_PROVIDER=psm2 mpiexec.hydra.
7 nodes: host-a2, host-b[1,3,6], host1,
host-c1, host-d
Test: mpi_prereq_user
--------------------------------------------------------------------------------
Intel(R) Cluster Checker 2021 Update 1
00:34:46 April 23 2021 UTC
Nodefile used: machinesToTest
Databases used: $HOME/.clck/2021.1.1/clck.db