We have a cluster of 176 nodes consisting Infiniband switch and 10GbE and we are using 10GbE as SSH. Currently we have the older cards of
Marvell 10GbE at launch
https://support.hpe.com/connect/s/softwaredetails?language=en_US&softwareId=MTX_117b0672d7ef4c5bb0eca02886 <https://support.hpe.com/connect/s/softwaredetails?language=en_US&softwareId=MTX_117b0672d7ef4c5bb0eca02886>

and
Current batch of 10GbE Qlogic card
https://support.hpe.com/connect/s/softwaredetails?language=en_US&softwareId=MTX_9bd8f647238c4a5f8c72a5221b&tab=revisionHistory <https://support.hpe.com/connect/s/softwaredetails?language=en_US&softwareId=MTX_9bd8f647238c4a5f8c72a5221b&tab=revisionHistory>

We are using slurm 20.11.4 as server and node health check daemon are also deployed using the OpenHPC method. However , we have no issue on using the Marvell 10GbE cards - which don't have slurm node down <--> idle state. However, we do have the flip-flip situation of the down <--> idle state

We tried on increasing the ARP caching , changing the subversion of the client to 20.11.9 , which doesn't help with the situation.

We would like to see if anyone faced similar situation?

Reply via email to