nt: Tuesday, February 26, 2019 22:25
To: slurm-users@lists.schedmd.com
Subject: Re: [slurm-users] Kinda Off-Topic: data management for Slurm clusters
But rsync -a will only help you if people are using identical or at
least overlapping data sets? And you don't need rsync to prune out old
But rsync -a will only help you if people are using identical or at
least overlapping data sets? And you don't need rsync to prune out old
files.
On 2/26/19 1:53 AM, Janne Blomqvist wrote:
> On 22/02/2019 18.50, Will Dennis wrote:
>> Hi folks,
>>
>> Not directly Slurm-related, but... We have a
Hi,
I'd like to share our set-up as well, even though it's very
specialized and thus probably won't work in most places. However, it's
also very efficient in terms of budget when it does.
Our users don't usually have shared data sets, so we don't need high
bandwidth at any particular point -- the
Am 26.02.19 um 09:20 schrieb Tru Huynh:
> On Fri, Feb 22, 2019 at 04:46:33PM -0800, Christopher Samuel wrote:
>> On 2/22/19 3:54 PM, Aaron Jackson wrote:
>>
>>> Happy to answer any questions about our setup.
>>
>>
>
>> Email me directly to get added (I had to disable the Mailman web
> Coul
Hi Janne,
On Tue, Feb 26, 2019 at 3:56 PM Janne Blomqvist
wrote:
> When reaping, it searches for these special .datasync directories (up to
> a configurable recursion depth, say 2 by default), and based on the
> LAST_SYNCED timestamps, deletes entire datasets starting with the oldest
> LAST_SYNC
On Fri, Feb 22, 2019 at 04:46:33PM -0800, Christopher Samuel wrote:
> On 2/22/19 3:54 PM, Aaron Jackson wrote:
>
> >Happy to answer any questions about our setup.
>
>
>
> Email me directly to get added (I had to disable the Mailman web
Could you add me to that list?
Thanks
Tru
--
Dr Tr
On 22/02/2019 18.50, Will Dennis wrote:
Hi folks,
Not directly Slurm-related, but... We have a couple of research groups
that have large data sets they are processing via Slurm jobs
(deep-learning applications) and are presently consuming the data via
NFS mounts (both groups have 10G ethernet
Will, there are some excellent responses here.
I agree that moving data to local fast storage on a node is a great idea.
Regarding the NFS storage, I would look at implementing BeeGFS if you can
get some new hardware or free up existing hardware.
BeeGFS is a skoosh case to set up.
(*) Scottish sl
Hi Will,
On 23/2/2019 1:50 AM, Will Dennis wrote:
For one of my groups, on the GPU servers in their cluster, I have provided a RAID-0 md array of multi-TB SSDs
(for I/O speed) mounted on a given path ("/mnt/local" for historical reasons) that they can use for
local scratch space. Their othe
On 2/22/19 3:54 PM, Aaron Jackson wrote:
Happy to answer any questions about our setup.
If folks are interested in a mailing list where this discussion would be
decidedly on-topic then I'm happy to add people to the Beowulf list
where there's a lot of other folks with expertise in this are
Hi Will,
I look after our GPU cluster in our vision lab. We have a similar setup
- we are working from a single ZFS file server. We have two pools:
/db which is about 40TB spinning SAS built out of two raidz vdevs, with
16TB of L2ARC (across 4 SSDs). This reduces the size of ARC quite
significant
We stuck avere between Isilon and a cluster to get us over the hump until next
budget cycle ... then we replaced with spectrascale for mid level storage.
Still use lustre of course as scratch.
On 2/22/19, 12:24 PM, "slurm-users on behalf of Will Dennis"
wrote:
(replies inline)
Yes, we've thought about using FS-Cache, but it doesn't help on the first
read-in, and the cache eviction may affect subsequent read attempts...
(different people are using different data sets, and the cache will probably
not hold all of them at the same time...)
On Friday, February 22, 2019 2
applications) and are presently consuming the data via NFS mounts (both
groups have 10G ethernet interconnects between the Slurm nodes and the NFS
servers.) They are both now complaining of "too-long loading times" for the
how about just using cachefs (backed by a local filesystem on ssd)?
http
(replies inline)
On Friday, February 22, 2019 1:03 PM, Alex Chekholko said:
>Hi Will,
>
>If your bottleneck is now your network, you may want to upgrade the network.
>Then the disks will become your bottleneck :)
>
Via network bandwidth analysis, it's not really network that's the problem...
Hi Will,
You have bumped into the old adage: "HPC is just about moving the
bottlenecks around".
If your bottleneck is now your network, you may want to upgrade the
network. Then the disks will become your bottleneck :)
For GPU training-type jobs that load the same set of data over and over
agai
At least in our case we use a Lustre filesystem for scratch access, we
have it mounted over IB though. That said some of our nodes only access
it over the 1GbE and I have never heard any complaints about
performance. In general for large scale production work Lustre tends to
be more resilient
Thanks for the reply, Ray.
For one of my groups, on the GPU servers in their cluster, I have provided a
RAID-0 md array of multi-TB SSDs (for I/O speed) mounted on a given path
("/mnt/local" for historical reasons) that they can use for local scratch
space. Their other servers in the cluster ha
Hi Will,
On 23/2/2019 12:50 AM, Will Dennis wrote:
...
would be considered “scratch space”, not for long-term data
storage, but for use over the lifetime of a job, or maybe
perhaps a few sequential jobs (given the nature of the
work.) “Permanent” storage would remain the existing NFS
serve
19 matches
Mail list logo