Hi, Chris!

Stas here - I've gathered that performance data.
Sure thing, I can be wrong, but please elaborate a bit on what we are missing.
Be assured, intentional misdirection was never a case.

Thanks a lot for being constructive. 

> Turning Intel OMP on and off (and MKL as well, since it tends to pull in omp, 
> depending which one is linked in).

We never ever considered turning MKL off. We are on the same page here - MKL is 
crucial for the performance. 
Why should we? There's a GOMP-linked version of MKL, that we can use.

What we did - we measured, if using compilers default OpenMP implementation 
instead of referenced source code distribution of OpenMP makes anything slower.
We have found the impact to be hardly measurable. 
The difference between GOMP and iOMP is <5% on our benchmarks, most of the time 
less than that. 

We just suggest to simplify the build of mxnet, by removing the unnecessary 
dependency.

During that we discovered for example the following amazing issue:
https://github.com/apache/incubator-mxnet/issues/14087

Best Regards

Stas

On 18.06.19, 18:24, "Chris Olivier" <[email protected]> wrote:

    I am very reluctant to feed the trolls again, and this will be teh last
    time I address Pedro or Anton on the subject, but since I think the numbers
    being presented are incorrect (either by te builders not really
    understanding what they are building, or possibly intentional misdirection):
    
    Turning Intel OMP on and off (and MKL as well, since it tends to pull in
    omp, depending which one is linked in).
    There is a HUGE difference.  This is consistent with my experience before
    when it was added.
    
    
    default mnist:
    
    python ../example/image-classification/train_mnist.py
    INFO:root:start with arguments Namespace(add_stn=False, batch_size=64,
    disp_batches=100, dtype='float32', gc_threshold=0.5, gc_type='none',
    gpus=None, image_shape='1, 28, 28', initializer='default',
    kv_store='device', load_epoch=None, loss='', lr=0.05, lr_factor=0.1,
    lr_step_epochs='10', macrobatch_size=0, model_prefix=None, mom=0.9,
    monitor=0, network='mlp', num_classes=10, num_epochs=20,
    num_examples=60000, num_layers=None, optimizer='sgd',
    profile_server_suffix='', profile_worker_suffix='', save_period=1,
    test_io=0, top_k=0, warmup_epochs=5, warmup_strategy='linear', wd=0.0001)
    
    INTEL OMP:
    
    ldd libmxnet.so | grep omp
            libomp.so =>
    /home/chris/src/mxnet/cmake_omp/3rdparty/openmp/runtime/src/libomp.so
    (0x00007f978fde7000)
    
    :root:Epoch[0] Batch [0-100]        Speed: 31548.09 samples/sec
    accuracy=0.780012
    INFO:root:Epoch[0] Batch [100-200]      Speed: 16073.21 samples/sec
    accuracy=0.920469
    INFO:root:Epoch[0] Batch [200-300]      Speed: 19075.91 samples/sec
    accuracy=0.928281
    INFO:root:Epoch[0] Batch [300-400]      Speed: 23211.36 samples/sec
    accuracy=0.942813
    INFO:root:Epoch[0] Batch [400-500]      Speed: 22139.79 samples/sec
    accuracy=0.938750
    INFO:root:Epoch[0] Batch [500-600]      Speed: 23225.52 samples/sec
    accuracy=0.946562
    INFO:root:Epoch[0] Batch [600-700]      Speed: 19547.41 samples/sec
    accuracy=0.953281
    INFO:root:Epoch[0] Batch [700-800]      Speed: 24111.73 samples/sec
    accuracy=0.951562
    INFO:root:Epoch[0] Batch [800-900]      Speed: 13959.88 samples/sec
    accuracy=0.957500
    INFO:root:Epoch[0] Train-accuracy=0.925423
    INFO:root:Epoch[0] Time cost=3.806
    INFO:root:Epoch[0] Validation-accuracy=0.962580
    INFO:root:Epoch[1] Batch [0-100]        Speed: 24560.21 samples/sec
    accuracy=0.968131
    INFO:root:Epoch[1] Batch [100-200]      Speed: 23457.03 samples/sec
    accuracy=0.966250
    
    
    LIBGOMP:
    
    ldd libmxnet.so | grep omp
            libgomp.so.1 => /usr/lib/x86_64-linux-gnu/libgomp.so.1
    (0x00007f25c25dd000)
    
    INFO:root:Epoch[0] Batch [0-100]        Speed: 1731.01 samples/sec
     accuracy=0.782488
    INFO:root:Epoch[0] Batch [100-200]      Speed: 3551.32 samples/sec
     accuracy=0.907813
    INFO:root:Epoch[0] Batch [200-300]      Speed: 1991.00 samples/sec
     accuracy=0.927188
    INFO:root:Epoch[0] Batch [300-400]      Speed: 2175.45 samples/sec
     accuracy=0.937969
    INFO:root:Epoch[0] Batch [400-500]      Speed: 1644.95 samples/sec
     accuracy=0.942187
    INFO:root:Epoch[0] Batch [500-600]      Speed: 6444.58 samples/sec
     accuracy=0.950156
    INFO:root:Epoch[0] Batch [600-700]      Speed: 7842.16 samples/sec
     accuracy=0.947969
    INFO:root:Epoch[0] Batch [700-800]      Speed: 9412.07 samples/sec
     accuracy=0.953750
    INFO:root:Epoch[0] Batch [800-900]      Speed: 12707.58 samples/sec
    accuracy=0.953125
    
    That being said, there's other issued beyond speed.  The DEFAULT build from
    makefile (not CMake) uses Intel OMP mkl (I showed before) and mysteriously
    it has no issues?  This seems highly suspicious.  All I see is a lot of
    hand-waving and conjecture and pointing to StackOverflow posts made by
    people who may be of questionable pedigree to begin with.  This smells of a
    Pedro-ego-fight rather than one of purely technical merit.  Also, if one
    knows how OMP works,  they would be very suspicious of the "intermittent
    hangs" claim -- that's probably just broken race conditions elsewhere until
    proven differently.  It'd tend freeze on the first use if something is
    wrong (try using libgomp after a fork and see), since worker threads"
    wouldn't be assigned/joined properly.  IntelOMP is faster, but also has
    other advantages, such as allowing OMP after a fork.
    
    I actually addressed a lot of issues and ask for clarification in the
    original PR's way back when, but they're all just ignored.
    
    -Chris
    


Reply via email to