Converting the files to TF records or similar would be one obvious approach if you are concerned about meta data. But then I d understand why some people would not want that (size, augmentation process). I assume you are are doing the training in a distributed fashion using MPI via Horovod or similar and it might be tempting to do file partitioning across the nodes. However doing so introduces a bias into minibatches (and custom preprocessing). If you partition carefully by mapping classes to nodes it may work but I also understand why some wouldn't be totally happy with that. Ive trained keras/TF/horovod models on imagenet using up to 6 nodes each with four p100/v100 and it worked reasonably well. As the training still took a few days copying to local NVMe disks was a good option. Hth
On Fri, 28 Jun 2019, 18:47 Mark Hahn, <h...@mcmaster.ca> wrote: > Hi all, > I wonder if anyone has comments on ways to avoid metadata bottlenecks > for certain kinds of small-io-intensive jobs. For instance, ML on > imagenet, > which seems to be a massive collection of trivial-sized files. > > A good answer is "beef up your MD server, since it helps everyone". > That's a bit naive, though (no money-trees here.) > > How about things like putting the dataset into squashfs or some other > image that can be loop-mounted on demand? sqlite? perhaps even a format > that can simply be mmaped as a whole? > > personally, I tend to dislike the approach of having a job stage tons of > stuff onto node storage (when it exists) simply because that guarantees a > waste of cpu/gpu/memory resources for however long the stagein takes... > > thanks, mark hahn. > -- > operator may differ from spokesperson. h...@mcmaster.ca > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > https://beowulf.org/cgi-bin/mailman/listinfo/beowulf >
_______________________________________________ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit https://beowulf.org/cgi-bin/mailman/listinfo/beowulf