+1 Would be best to have a controlled environment so we can reason about how MXNet is being built and what libraries are linked. I'm happy to help here. I would think docker won't have a big impact on the measurement or distort the results much.
On Wed, Jun 19, 2019 at 10:28 AM kellen sunderland <[email protected]> wrote: > > I've also quite often seen two versions of OpenMP linked. I think we can > all agree we probably want to avoid linking in two libraries that do > effectively the same thing. > > The performance questions should be fairly straight forward to demonstrate > right? Could we just collaborate on a few minimal Dockerfiles that show > (or don't show) Intel OpenMP performance speedups with the workloads Chris > is referencing? > > On Wed, Jun 19, 2019 at 4:44 AM Tsukrov, Stanislav < > [email protected]> wrote: > > > Hi, Chris! > > > > Stas here - I've gathered that performance data. > > Sure thing, I can be wrong, but please elaborate a bit on what we are > > missing. > > Be assured, intentional misdirection was never a case. > > > > Thanks a lot for being constructive. > > > > > Turning Intel OMP on and off (and MKL as well, since it tends to pull in > > omp, depending which one is linked in). > > > > We never ever considered turning MKL off. We are on the same page here - > > MKL is crucial for the performance. > > Why should we? There's a GOMP-linked version of MKL, that we can use. > > > > What we did - we measured, if using compilers default OpenMP > > implementation instead of referenced source code distribution of OpenMP > > makes anything slower. > > We have found the impact to be hardly measurable. > > The difference between GOMP and iOMP is <5% on our benchmarks, most of the > > time less than that. > > > > We just suggest to simplify the build of mxnet, by removing the > > unnecessary dependency. > > > > During that we discovered for example the following amazing issue: > > https://github.com/apache/incubator-mxnet/issues/14087 > > > > Best Regards > > > > Stas > > > > On 18.06.19, 18:24, "Chris Olivier" <[email protected]> wrote: > > > > I am very reluctant to feed the trolls again, and this will be teh last > > time I address Pedro or Anton on the subject, but since I think the > > numbers > > being presented are incorrect (either by te builders not really > > understanding what they are building, or possibly intentional > > misdirection): > > > > Turning Intel OMP on and off (and MKL as well, since it tends to pull > > in > > omp, depending which one is linked in). > > There is a HUGE difference. This is consistent with my experience > > before > > when it was added. > > > > > > default mnist: > > > > python ../example/image-classification/train_mnist.py > > INFO:root:start with arguments Namespace(add_stn=False, batch_size=64, > > disp_batches=100, dtype='float32', gc_threshold=0.5, gc_type='none', > > gpus=None, image_shape='1, 28, 28', initializer='default', > > kv_store='device', load_epoch=None, loss='', lr=0.05, lr_factor=0.1, > > lr_step_epochs='10', macrobatch_size=0, model_prefix=None, mom=0.9, > > monitor=0, network='mlp', num_classes=10, num_epochs=20, > > num_examples=60000, num_layers=None, optimizer='sgd', > > profile_server_suffix='', profile_worker_suffix='', save_period=1, > > test_io=0, top_k=0, warmup_epochs=5, warmup_strategy='linear', > > wd=0.0001) > > > > INTEL OMP: > > > > ldd libmxnet.so | grep omp > > libomp.so => > > /home/chris/src/mxnet/cmake_omp/3rdparty/openmp/runtime/src/libomp.so > > (0x00007f978fde7000) > > > > :root:Epoch[0] Batch [0-100] Speed: 31548.09 samples/sec > > accuracy=0.780012 > > INFO:root:Epoch[0] Batch [100-200] Speed: 16073.21 samples/sec > > accuracy=0.920469 > > INFO:root:Epoch[0] Batch [200-300] Speed: 19075.91 samples/sec > > accuracy=0.928281 > > INFO:root:Epoch[0] Batch [300-400] Speed: 23211.36 samples/sec > > accuracy=0.942813 > > INFO:root:Epoch[0] Batch [400-500] Speed: 22139.79 samples/sec > > accuracy=0.938750 > > INFO:root:Epoch[0] Batch [500-600] Speed: 23225.52 samples/sec > > accuracy=0.946562 > > INFO:root:Epoch[0] Batch [600-700] Speed: 19547.41 samples/sec > > accuracy=0.953281 > > INFO:root:Epoch[0] Batch [700-800] Speed: 24111.73 samples/sec > > accuracy=0.951562 > > INFO:root:Epoch[0] Batch [800-900] Speed: 13959.88 samples/sec > > accuracy=0.957500 > > INFO:root:Epoch[0] Train-accuracy=0.925423 > > INFO:root:Epoch[0] Time cost=3.806 > > INFO:root:Epoch[0] Validation-accuracy=0.962580 > > INFO:root:Epoch[1] Batch [0-100] Speed: 24560.21 samples/sec > > accuracy=0.968131 > > INFO:root:Epoch[1] Batch [100-200] Speed: 23457.03 samples/sec > > accuracy=0.966250 > > > > > > LIBGOMP: > > > > ldd libmxnet.so | grep omp > > libgomp.so.1 => /usr/lib/x86_64-linux-gnu/libgomp.so.1 > > (0x00007f25c25dd000) > > > > INFO:root:Epoch[0] Batch [0-100] Speed: 1731.01 samples/sec > > accuracy=0.782488 > > INFO:root:Epoch[0] Batch [100-200] Speed: 3551.32 samples/sec > > accuracy=0.907813 > > INFO:root:Epoch[0] Batch [200-300] Speed: 1991.00 samples/sec > > accuracy=0.927188 > > INFO:root:Epoch[0] Batch [300-400] Speed: 2175.45 samples/sec > > accuracy=0.937969 > > INFO:root:Epoch[0] Batch [400-500] Speed: 1644.95 samples/sec > > accuracy=0.942187 > > INFO:root:Epoch[0] Batch [500-600] Speed: 6444.58 samples/sec > > accuracy=0.950156 > > INFO:root:Epoch[0] Batch [600-700] Speed: 7842.16 samples/sec > > accuracy=0.947969 > > INFO:root:Epoch[0] Batch [700-800] Speed: 9412.07 samples/sec > > accuracy=0.953750 > > INFO:root:Epoch[0] Batch [800-900] Speed: 12707.58 samples/sec > > accuracy=0.953125 > > > > That being said, there's other issued beyond speed. The DEFAULT build > > from > > makefile (not CMake) uses Intel OMP mkl (I showed before) and > > mysteriously > > it has no issues? This seems highly suspicious. All I see is a lot of > > hand-waving and conjecture and pointing to StackOverflow posts made by > > people who may be of questionable pedigree to begin with. This smells > > of a > > Pedro-ego-fight rather than one of purely technical merit. Also, if > > one > > knows how OMP works, they would be very suspicious of the > > "intermittent > > hangs" claim -- that's probably just broken race conditions elsewhere > > until > > proven differently. It'd tend freeze on the first use if something is > > wrong (try using libgomp after a fork and see), since worker threads" > > wouldn't be assigned/joined properly. IntelOMP is faster, but also has > > other advantages, such as allowing OMP after a fork. > > > > I actually addressed a lot of issues and ask for clarification in the > > original PR's way back when, but they're all just ignored. > > > > -Chris > > > > > > > >
