[slurm-users] A strange situation of different network cards on the same network

James Lam Tue, 10 Oct 2023 19:31:30 -0700

We have a cluster of 176 nodes consisting Infiniband switch and 10GbEand we are using 10GbE as SSH. Currently we have the older cards of

Marvell 10GbE at launch

https://support.hpe.com/connect/s/softwaredetails?language=en_US&softwareId=MTX_117b0672d7ef4c5bb0eca02886<https://support.hpe.com/connect/s/softwaredetails?language=en_US&softwareId=MTX_117b0672d7ef4c5bb0eca02886>


and
Current batch of 10GbE Qlogic card

https://support.hpe.com/connect/s/softwaredetails?language=en_US&softwareId=MTX_9bd8f647238c4a5f8c72a5221b&tab=revisionHistory<https://support.hpe.com/connect/s/softwaredetails?language=en_US&softwareId=MTX_9bd8f647238c4a5f8c72a5221b&tab=revisionHistory>

We are using slurm 20.11.4 as server and node health check daemon arealso deployed using the OpenHPC method. However , we have no issue onusing the Marvell 10GbE cards - which don't have slurm node down <-->idle state. However, we do have the flip-flip situation of the down <-->idle state

We tried on increasing the ARP caching , changing the subversion of theclient to 20.11.9 , which doesn't help with the situation.


We would like to see if anyone faced similar situation?

[slurm-users] A strange situation of different network cards on the same network

Reply via email to