1

当 host1 包含在主机文件中时,一个简单的 MPI 应用程序失败并出现以下错误。

错误:PMPI_Init 中的致命错误:其他 MPI 错误,错误堆栈:名片中缺少主机名或无效的主机/端口描述

当 host1 从主机文件中排除时,此应用程序可以正常工作。我尝试使用集群检查器。我附上了相应的集群检查器日志。你能帮我解释一下这个日志吗,因为这似乎主要包含使用“-f(机器列表)指定的各种主机之间的差异,而没有真正突出显示 host-e8 的任何可以解释此错误的问题。请在日志下方找到

SUMMARY
  Command-line:   clck -f machinesToTest -c clck.xml -Fhealth_user -Fhealth_base
                  -Fhealth_extended_user -Fmpi_prereq_user -l debug
  Tests Run:      health_user, health_base, health_extended_user,
                  mpi_prereq_user
  **WARNING**:    9 tests failed to run. Information may be incomplete. See
                  clck_execution_warnings.log for more information.
  Overall Result: 33 issues found - FUNCTIONALITY (3), HARDWARE UNIFORMITY (11),
                  PERFORMANCE (9), SOFTWARE UNIFORMITY (10)
--------------------------------------------------------------------------------
7 nodes tested:         host-a2, host-b[1,3,6], host1,
                        host-c1, host-d
0 nodes with no issues: 
7 nodes with issues:    host-a2, host-b[1,3,6], host1,
                        host-c1, host-d
--------------------------------------------------------------------------------
FUNCTIONALITY
The following functionality issues were detected:
  1. mpi-local-broken
       Message: The single node MPI "Hello World" program did not run
                successfully.
       7 nodes: host-a2, host-b[1,3,6], host1,
                host-c1, host-d
       Test:    mpi_local_functionality
  2. memlock-too-small
       Message: The memlock limit, '64', is smaller than recommended.
       Remedy:  We recommend correcting the limit of locked memory in
                /etc/security/limits.conf to the following values: "* hard
                memlock unlimited" "* soft memlock unlimited"
       7 nodes: host-a2, host-b[1,3,6], host1,
                host-c1, host-d
       Test:    memory_uniformity_user
  3. memlock-too-small-ethernet
       Message: The memlock limit, '64', is smaller than recommended.
       Remedy:  We recommend correcting the limit of locked memory in
                /etc/security/limits.conf to the following values: "* hard
                memlock unlimited" "* soft memlock unlimited"
       7 nodes: host-a2, host-b[1,3,6], host1,
                host-c1, host-d
       Test:    mpi_ethernet

HARDWARE UNIFORMITY
The following hardware uniformity issues were detected:
  1.  memory-not-uniform
        Message: The amount of physical memory is not within the range of
                 792070572.0 KiB +/- 262144.0 KiB defined by nodes in the same
                 grouping.
        5 nodes: host-b[1,3], host1, host-c1, host-d
        Test:    memory_uniformity_base
        Details: 
          #Nodes Memory           Nodes           
          1      1584974816.0 KiB host-c1    
          1      2113513608.0 KiB host1    
          1      529153152.0 KiB  host-d    
          1      790940180.0 KiB  host-b1 
          1      790940184.0 KiB  host-b3 
  2.  logical-cores-not-uniform:24
        Message: The logical cores, '24', is not uniform across all nodes in the
                 same grouping. 67% of nodes in the same grouping have the same
                 number of logical cores.
        Remedy:  Please ensure that BIOS settings that can influence the number
                 of logical cores, like Hyper-Threading, VMX, VT-d and x2apic,
                 are uniform across nodes in the same grouping.
        2 nodes: host-b[1,3]
        Test:    cpu_base
  3.  logical-cores-not-uniform:48
        Message: The logical cores, '48', is not uniform across all nodes in the
                 same grouping. 33% of nodes in the same grouping have the same
                 number of logical cores.
        Remedy:  Please ensure that BIOS settings that can influence the number
                 of logical cores, like Hyper-Threading, VMX, VT-d and x2apic,
                 are uniform across nodes in the same grouping.
        1 node:  host-b6
        Test:    cpu_base
  4.  threads-per-core-not-uniform:1
        Message: The number of threads available per core, '1', is not uniform.
                 67% of nodes in the same grouping have the same number of
                 threads available per core.
        Remedy:  Please enable/disable hyper-threading uniformly on Intel(R)
                 CPUs.
        2 nodes: host-b[1,3]
        Test:    cpu_base
  5.  threads-per-core-not-uniform:2
        Message: The number of threads available per core, '2', is not uniform.
                 33% of nodes in the same grouping have the same number of
                 threads available per core.
        Remedy:  Please enable/disable hyper-threading uniformly on Intel(R)
                 CPUs.
        1 node:  host-b6
        Test:    cpu_base
  6.  cpu-model-name-not-uniform
        Message: The CPU model, 'Intel(R) Xeon(R) CPU E5-2697 v3 @ 2.60GHz', is
                 not uniform. 14% of nodes in the same grouping have the same
                 CPU model.
        1 node:  host-a2
        Test:    cpu_base
  7.  cpu-model-name-not-uniform
        Message: The CPU model, 'Intel(R) Xeon(R) Gold 6256 CPU @ 3.60GHz', is
                 not uniform. 43% of nodes in the same grouping have the same
                 CPU model.
        3 nodes: host-b[1,3,6]
        Test:    cpu_base
  8.  cpu-model-name-not-uniform
        Message: The CPU model, 'Intel(R) Xeon(R) CPU E7- 4870 @ 2.40GHz', is
                 not uniform. 14% of nodes in the same grouping have the same
                 CPU model.
        1 node:  host1
        Test:    cpu_base
  9.  cpu-model-name-not-uniform
        Message: The CPU model, 'Intel(R) Xeon(R) CPU E7-4830 v3 @ 2.10GHz', is
                 not uniform. 14% of nodes in the same grouping have the same
                 CPU model.
        1 node:  host-c1
        Test:    cpu_base
  10. cpu-model-name-not-uniform
        Message: The CPU model, 'Intel(R) Xeon(R) CPU E5-2640 v3 @ 2.60GHz', is
                 not uniform. 14% of nodes in the same grouping have the same
                 CPU model.
        1 node:  host-d
        Test:    cpu_base
  11. ethernet-firmware-version-is-not-consistent
        Message: Inconsistent Ethernet firmware version.
        3 nodes: host-a2, host1, host-c1
        Test:    ethernet
        Details: 
          #Nodes Firmware Version          Nodes         
          1      0x80000887, 1.2028.0      host-c1  
          1      0x800008e8                host-a2 
          1      4.0.596                   host1  
          1      5719-v1.46 NCSI v1.3.16.0 host-a2 

PERFORMANCE
The following performance issues were detected:
  1. process-is-high-cpu
       Message: Processes using high CPU.
       Remedy:  If this command is running in error, kill the process on the
                node (if you are not the owner of the process, elevated
                privileges may be required.)
       5 nodes: host-a2, host-b[3,6], host1, host-c1
       Test:    node_process_status
       Details: 
         #Nodes User     PID    %CPU Process                                                                      Nodes           
         1      usera 204058 98.9 /med/code7/usera/blue4/rnd/software/amd64.linux.gnu.product/distribVelsyn host-b3 
         1      userb 120854 98.5 /med/code7/userb/rb21B/software/amd64.linux.gnu.product/velsyn            host1    
         1      userb 71486  98.6 /med/code7/userb/rb21B/software/amd64.linux.gnu.product/velsyn            host1    
         1      wvgrid   11116  37.2 /wv/wv-med/sge/bin/lx-amd64/sge_execd                                        host-a2   
         1      wvgrid   19160  21.1 /wv/wv-med/sge/bin/lx-amd64/sge_execd                                        host-b6 
         1      wvgrid   25097  79.7 /wv/wv-med/sge/bin/lx-amd64/sge_execd                                        host-c1    
         1      wvgrid   90731  58.1 /wv/wv-med/sge/bin/lx-amd64/sge_execd                                        host1    
  2. substandard-dgemm-due-to-high-cpu-process
       Message: The substandard DGEMM benchmark result of 1.528 TFLOPS is due to
                a conflicting process, pid '204058', using a large amount of
                cpu.
       Remedy:  If this command is running in error, kill the process on the
                node (if you are not the owner of the process, elevated
                privileges may be required.)
       1 node:  host-b3
       Test:    node_process_status
  3. substandard-sgemm-due-to-high-cpu-process
       Message: The substandard SGEMM benchmark result of 3.277 TFLOPS is due to
                a conflicting process, pid '204058', using a large amount of
                cpu.
       Remedy:  If this command is running in error, kill the process on the
                node (if you are not the owner of the process, elevated
                privileges may be required.)
       1 node:  host-b3
       Test:    node_process_status
  4. sgemm-data-is-substandard-avx512
       Message: The following SGEMM benchmark results are below the accepted
                4.147 TFLOPS(100%). The acceptable fraction (90%) can be set
                using the <sgemm-peak-fraction> option in the configuration
                file. For more details, please refer to the Intel(R) Cluster
                Checker User Guide.
       3 nodes: host-b[1,3,6]
       Test:    sgemm_cpu_performance
       Details: 
         #Nodes Result       %Below Peak Nodes           
         1      2.355 TFLOPS 57          host-b6 
         1      3.181 TFLOPS 77          host-b1 
         1      3.277 TFLOPS 79          host-b3 
  5. substandard-sgemm-due-to-high-cpu-process
       Message: The substandard SGEMM benchmark result of 2.355 TFLOPS is due to
                a conflicting process, pid '19160', using a large amount of cpu.
       Remedy:  If this command is running in error, kill the process on the
                node (if you are not the owner of the process, elevated
                privileges may be required.)
       1 node:  host-b6
       Test:    node_process_status
  6. dgemm-data-is-substandard-avx512
       Message: The DGEMM benchmark result is below the accepted 2.074
                TFLOPS(100%). The acceptable fraction (90%) can be set using the
                <dgemm-peak-fraction> option in the configuration file. For more
                details, please refer to the Intel(R) Cluster Checker User
                Guide.
       3 nodes: host-b[1,3,6]
       Test:    dgemm_cpu_performance
       Details: 
         #Nodes Result       %Below Peak Nodes           
         1      1.389 TFLOPS 67          host-b1 
         1      1.528 TFLOPS 74          host-b3 
         1      1.570 TFLOPS 76          host-b6 
  7. dgemm-data-is-substandard
       Message: The following DGEMM benchmark results are below the theoretical
                peak of 1.165 TFLOPS.
       1 node:  host-a2
       Test:    dgemm_cpu_performance
       Details: 
         #Nodes Result         %Below Peak Nodes         
         1      845.441 GFLOPS 73          host-a2 
  8. substandard-dgemm-due-to-high-cpu-process
       Message: The substandard DGEMM benchmark result of 845.441 GFLOPS is due
                to a conflicting process, pid '11116', using a large amount of
                cpu.
       Remedy:  If this command is running in error, kill the process on the
                node (if you are not the owner of the process, elevated
                privileges may be required.)
       1 node:  host-a2
       Test:    node_process_status
  9. substandard-dgemm-due-to-high-cpu-process
       Message: The substandard DGEMM benchmark result of 1.570 TFLOPS is due to
                a conflicting process, pid '19160', using a large amount of cpu.
       Remedy:  If this command is running in error, kill the process on the
                node (if you are not the owner of the process, elevated
                privileges may be required.)
       1 node:  host-b6
       Test:    node_process_status

SOFTWARE UNIFORMITY
The following software uniformity issues were detected:
  1.  ethernet-driver-is-not-consistent
        Message: Inconsistent Ethernet driver.
        2 nodes: host-a2, host1
        Test:    ethernet
        Details: 
          #Nodes Driver     Nodes         
          1      netxen_nic host1  
          1      tg3        host-a2 
  2.  kernel-not-uniform
        Message: The Linux kernel version, '3.10.0-957.27.2.el7.x86_64', is not
                 uniform. 86% of nodes in the same grouping have the same
                 version.
        6 nodes: host-a2, host-b[1,3,6], host1,
                 host-c1
        Test:    kernel_version_uniformity
  3.  kernel-not-uniform
        Message: The Linux kernel version, '2.6.32-573.26.1.el6.x86_64', is not
                 uniform. 14% of nodes in the same grouping have the same
                 version.
        1 node:  host-d
        Test:    kernel_version_uniformity
  4.  environment-variable-not-uniform
        Message: Environment variables are not uniform across the nodes.
        7 nodes: host-a2, host-b[1,3,6], host1,
                 host-c1, host-d
        Test:    environment_variables_uniformity
        Details: 
          #Nodes Variable           Value                                           Nodes         
          6      G_BROKEN_FILENAMES                                                 host-a2, host-b[1,3,6], host1, host-c1
          6      KDE_IS_PRELINKED                                                   host-a2, host-b[1,3,6], host1, host-c1
          6      MODULEPATH                                                         host-a2, host-b[1,3,6], host1, host-c1
          6      MODULESHOME                                                        host-a2, host-b[1,3,6], host1, host-c1
          1      G_BROKEN_FILENAMES 1                                               host-d  
          1      KDE_IS_PRELINKED   1                                               host-d  
          1      MODULEPATH         /usr/share/Modules/modulefiles:/etc/modulefiles host-d  
          1      MODULESHOME        /usr/share/Modules                              host-d  
  5.  perl-not-uniform
        Message: The Perl version, '5.16.3', is not uniform. 86% of nodes in the
                 same grouping have the same version.
        6 nodes: host-a2, host-b[1,3,6], host1,
                 host-c1
        Test:    perl_functionality
  6.  perl-not-uniform
        Message: The Perl version, '5.10.1', is not uniform. 14% of nodes in the
                 same grouping have the same version.
        1 node:  host-d
        Test:    perl_functionality
  7.  python-not-uniform
        Message: The Python version, '2.7.5', is not uniform. 86% of nodes in
                 the same grouping have the same version.
        6 nodes: host-a2, host-b[1,3,6], host1,
                 host-c1
        Test:    python_functionality
  8.  python-not-uniform
        Message: The Python version, '2.6.6', is not uniform. 14% of nodes in
                 the same grouping have the same version.
        1 node:  host-d
        Test:    python_functionality
  9.  ethernet-driver-version-is-not-consistent
        Message: Inconsistent Ethernet driver version.
        2 nodes: host-a2, host1
        Test:    ethernet
        Details: 
          #Nodes Version Nodes         
          1      3.137   host-a2 
          1      4.0.82  host1  
  10. ethernet-interrupt-coalescing-state-not-uniform
        Message: Ethernet interrupt coalescing is not enabled/disabled uniformly
                 across nodes in the same grouping.
        Remedy:  Append "/sbin/ethtool -C eno1 rx-usecs <value>" to the site
                 specific system startup script. Use '0' to permanently disable
                 Ethernet interrupt coalescing or other value as needed. The
                 site specific system startup script is typically
                 /etc/rc.d/rc.local or /etc/rc.d/boot.local.
        1 node:  host1
        Test:    ethernet
        Details: 
          #Nodes State   Interface Nodes        
          1      enabled eno1      host1 
          1      enabled eno3      host1 

--------------------------------------------------------------------------------
INFORMATIONAL
The following additional information was detected:
  1. mpi-network-interface
       Message: The cluster has 1 network interfaces (Ethernet). Intel(R) MPI
                Library uses by default the first interface detected in the
                order of: (1) Intel(R) Omni-Path Architecture (Intel(R) OPA),
                (2) InfiniBand, (3) Ethernet. You can set a specific interface
                by setting the environment variable I_MPI_OFI_PROVIDER.
                Ethernet: I_MPI_OFI_PROVIDER=sockets mpiexec.hydra; InfiniBand:
                I_MPI_OFI_PROVIDER=verbs mpiexec.hydra; Intel(R) OPA:
                I_MPI_OFI_PROVIDER=psm2 mpiexec.hydra.
       7 nodes: host-a2, host-b[1,3,6], host1,
                host-c1, host-d
       Test:    mpi_prereq_user

--------------------------------------------------------------------------------
Intel(R) Cluster Checker 2021 Update 1
00:34:46 April 23 2021 UTC
Nodefile used: machinesToTest
Databases used: $HOME/.clck/2021.1.1/clck.db
4

1 回答 1

0

我尝试在 host1 中使用一致的以太网驱动程序版本,并遵循日志中为 ethernet-interrupt-coalescing-state-not-uniform 提供的补救措施,并在包括 host1 在内的异构节点上运行示例。

于 2021-06-18T09:47:27.137 回答