We have a cluster of 176 nodes consisting Infiniband switch and 10GbE
and we are using 10GbE as SSH. Currently we have the older cards of
Marvell 10GbE at launch
https://support.hpe.com/connect/s/softwaredetails?language=en_US&softwareId=MTX_117b0672d7ef4c5bb0eca02886
<https://support.hpe.com/connect/s/softwaredetails?language=en_US&softwareId=MTX_117b0672d7ef4c5bb0eca02886>
and
Current batch of 10GbE Qlogic card
https://support.hpe.com/connect/s/softwaredetails?language=en_US&softwareId=MTX_9bd8f647238c4a5f8c72a5221b&tab=revisionHistory
<https://support.hpe.com/connect/s/softwaredetails?language=en_US&softwareId=MTX_9bd8f647238c4a5f8c72a5221b&tab=revisionHistory>
We are using slurm 20.11.4 as server and node health check daemon are
also deployed using the OpenHPC method. However , we have no issue on
using the Marvell 10GbE cards - which don't have slurm node down <-->
idle state. However, we do have the flip-flip situation of the down <-->
idle state
We tried on increasing the ARP caching , changing the subversion of the
client to 20.11.9 , which doesn't help with the situation.
We would like to see if anyone faced similar situation?