We stuck avere between Isilon and a cluster to get us over the hump until next 
budget cycle ... then we replaced with spectrascale for mid level storage.  
Still use lustre of course as scratch.

On 2/22/19, 12:24 PM, "slurm-users on behalf of Will Dennis" 
<slurm-users-boun...@lists.schedmd.com on behalf of wden...@nec-labs.com> wrote:

    (replies inline)
    
    On Friday, February 22, 2019 1:03 PM, Alex Chekholko said:
    
    >Hi Will,
    >
    >If your bottleneck is now your network, you may want to upgrade the 
network.  Then the disks will become your bottleneck :)
    >
    
    Via network bandwidth analysis, it's not really network that's the 
problem... It’s the NFS/disk I/O...
    
    >For GPU training-type jobs that load the same set of data over and over 
again, local node SSD is a good solution.  Especially with the dropping SSD 
prices.
    >
    
    Good to hear :)
    
    >For an example architecture, take a look at the DDN "AI" or IBM "AI" 
solutions. I think they generally take a storage box with lots of flash storage 
and connect it via 2 or 4 100Gb links to something like an nvidia DGX (compute 
node with 8 GPU).  Presumably they are doing mostly small file reads.
    >
    >In my case, I have whitebox compute nodes with GPUs and SSDs and whitebox 
ZFS servers connected at 40GbE.  A fraction of the performance at a fraction of 
the price.
    >
    
    Same here, but connected at only 10G... Again, no budget (as of yet, 
anyhow) to do 25/40/50/100G network or all-flash storage :(
    
    >Regards,
    >Alex
    

Reply via email to