This vote has been closed. We will make another tag and start vote again. -sz
> On Jun 18, 2019, at 5:24 PM, Lin Yuan <[email protected]> wrote: > > With the PR https://github.com/apache/incubator-mxnet/pull/15213 I could > verify that building Horovod is successful with MXNet built from source. So > I will remove my pervious -1 vote. > > Best, > > Lin > >> On Tue, Jun 18, 2019 at 2:10 PM Junru Shao <[email protected]> wrote: >> >> Dear community, >> >> I am happy to share some results with regard to commit 83d2c2d0e (PR >> #14192, link: https://github.com/apache/incubator-mxnet/pull/14192) that >> Pedro mentioned that causes regression. >> >> First, using the exact model that Pedro provides, we did rigorous profiling >> and found out that the PR #14192 slows it down by 7.26 ms (from 235.65 ms >> to 242.91 ms). >> >> Then, we submitted a following up PR #15262 (link: >> https://github.com/apache/incubator-mxnet/pull/15262) to fix the >> regression. By applying the patch to commit 83d2c2d0e, we could verify that >> we get comparable performance. Please refer to the PR if you are interested >> in our experiment. >> >> That is to say, regression caused by commit 83d2c2d0e should have been >> addressed. Please let me know if there is any future issues. >> >> Thank you so much, >> Junru >> >> On Thu, Jun 13, 2019 at 3:05 PM Pedro Larroy <[email protected] >>> >> wrote: >> >>> I reach you in private, the model is not public. We should be able to >>> see this problem in a public model using LSTM I think. >>> >>> >>> On Thu, Jun 13, 2019 at 11:15 AM Junru Shao <[email protected]> >>> wrote: >>>> >>>> Hi Pedro, >>>> >>>> Thanks for brining this up! >>>> >>>> Could you provide your model so that we can dig into this? >>>> >>>> Thanks, >>>> Junru >>>> >>>> On Thu, Jun 13, 2019 at 10:33 Pedro Larroy < >> [email protected] >>>> >>>> wrote: >>>> >>>>> I have isolated some of the commits that are causing performance >>>>> regressions in wavenet like models: >>>>> >>>>> Title: 83d2c2d0e:[MXNET-1324] Add NaiveRunGraph to imperative utils >>>>> (#14192) >>>>> >>>>> Causes a regression making hybridize with static slower using GPU >>>>> inference. >>>>> >>>>> [0f63659be5070af218095a6a460427d2a1b67aba] add a compiler flag to use >>>>> int64 as tensor size (#14570) >>>>> >>>>> Causes overall regressions in CPU inference. >>>>> >>>>> >>>>> Pedro. >>>>> >>>>> On Wed, Jun 12, 2019 at 11:52 AM Lai Wei <[email protected]> >> wrote: >>>>>> >>>>>> Hi @dev, >>>>>> >>>>>> I am canceling the vote as the issue Lin discovered require a >> fix[1] >>> and >>>>>> the solution is not ready yet. >>>>>> It's a general problem when building from source with MXNet, not >> only >>>>>> impacting horovod use cases. Any help is appreciated. >>>>>> >>>>>> Other issues we are tracking: >>>>>> 1. Regression on hybridize with static_alloc. (not a blocker for >> now) >>>>>> 2. Scala doc issue [2], already merged in master, need to backport >> to >>>>> 1.5.x >>>>>> >>>>>> Thanks for everyone's help! Please let us know if there is any >> other >>>>> issue >>>>>> with 1.5.0 >>>>>> >>>>>> [1] https://github.com/apache/incubator-mxnet/pull/15213 >>>>>> [2] https://github.com/apache/incubator-mxnet/pull/15216 >>>>>> >>>>>> >>>>>> >>>>>> Best Regards >>>>>> >>>>>> Lai >>>>>> >>>>>> >>>>>> On Tue, Jun 11, 2019 at 5:04 PM Pedro Larroy < >>>>> [email protected]> >>>>>> wrote: >>>>>> >>>>>>> Tested with CPU, 2.6x slower. comparing master vs 1.4.1. >>>>>>> >>>>>>> Looks like a general regression. >>>>>>> >>>>>>> >>>>>>> On Tue, Jun 11, 2019 at 2:31 PM Lai Wei <[email protected]> >>> wrote: >>>>>>>> >>>>>>>> Hi guys, >>>>>>>> >>>>>>>> Thanks for the updates. Currently, we are able to confirm Lin's >>> issue >>>>>>> with >>>>>>>> Horovod, and there is a fix pending. [1] >>>>>>>> Will update later today to see if we need to cancel this vote >>> for the >>>>>>> fix. >>>>>>>> >>>>>>>> As for the hybridize with static alloc performance regression. >>> IMO it >>>>>>> does >>>>>>>> not need to be a blocker if we have the following speed order. >>>>>>>> 1.5.0 w/o static > 1.5.0 w/ static > 1.4.1 w/ static > 1.4.1 >> w/o >>>>> static >>>>>>>> and it will be great to know the following to better make a >>> decision >>>>> on >>>>>>>> whether this should block the release. >>>>>>>> 1) if this is a model specific or a general regression. >>>>>>>> 2) if this is platform specific or general (w/ or w/o CUDA, w/ >>> or w/o >>>>>>>> MKLDNN) >>>>>>>> >>>>>>>> >>>>>>>> [1]https://github.com/apache/incubator-mxnet/pull/15213 >>>>>>>> >>>>>>>> >>>>>>>> Thanks >>>>>>>> >>>>>>>> Best Regards >>>>>>>> >>>>>>>> Lai >>>>>>>> >>>>>>>> >>>>>>>> On Tue, Jun 11, 2019 at 1:46 PM Zhi Zhang < >> [email protected]> >>>>> wrote: >>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On 2019/06/11 18:53:56, Pedro Larroy < >>> [email protected] >>>>>> >>>>>>>>> wrote: >>>>>>>>>> The stack trace doesn't seem to come from MXNet, do you >> have >>> more >>>>>>> info? >>>>>>>>>> >>>>>>>>>> On Tue, Jun 11, 2019 at 11:46 AM Zhi Zhang < >>> [email protected] >>>>>> >>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On 2019/06/11 17:36:09, Pedro Larroy < >>>>> [email protected] >>>>>>>> >>>>>>>>> wrote: >>>>>>>>>>>> A bit more background into this: >>>>>>>>>>>> >>>>>>>>>>>> While tuning a model using LSTM and convolutions we >> find >>> that >>>>>>> using >>>>>>>>>>>> hybridize with static_alloc and static_shape is 15% >>> slower >>>>> in the >>>>>>>>>>>> latest revision vs in version 1.4.1 in which using >>> hybridize >>>>> with >>>>>>>>>>>> static_alloc and static_shape is 10% faster than >> without. >>>>>>>>>>>> >>>>>>>>>>>> Overwall we are still 33% faster when comparing master >> to >>>>> 1.5. >>>>>>>>>>>> >>>>>>>>>>>> Let me know if you think this is a release blocker or >>> not. >>>>>>>>>>>> >>>>>>>>>>>> Pedro. >>>>>>>>>>>> >>>>>>>>>>>> On Mon, Jun 10, 2019 at 4:51 PM Pedro Larroy >>>>>>>>>>>> <[email protected]> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>> -1 >>>>>>>>>>>>> >>>>>>>>>>>>> We found a performance regression vs 1.4 related to >>>>> CachedOp >>>>>>> which >>>>>>>>>>>>> affects Hybrid forward, which we are looking into. >>>>>>>>>>>>> >>>>>>>>>>>>> Pedro. >>>>>>>>>>>>> >>>>>>>>>>>>> On Mon, Jun 10, 2019 at 4:33 PM Lin Yuan < >>>>> [email protected]> >>>>>>>>> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>> -1 (Tentatively until resolved) >>>>>>>>>>>>>> >>>>>>>>>>>>>> I tried to build MXNet 1.5.0 from source and pip >>> install >>>>>>> horovod >>>>>>>>> but got >>>>>>>>>>>>>> the following error: >>>>>>>>>>>>>> >>>>>>>>>>>>>> Reproduce: >>>>>>>>>>>>>> 1) cp make/config.mk . >>>>>>>>>>>>>> 2) turn on USE_CUDA, USE_CUDNN, USE_NCCL >>>>>>>>>>>>>> 3) make -j >>>>>>>>>>>>>> >>>>>>>>>>>>>> MXNet can build successfully. >>>>>>>>>>>>>> >>>>>>>>>>>>>> 4) pip install horovod >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>> >>>>>>> >>>>> >>> >> /home/ubuntu/src/incubator-mxnet/python/mxnet/../../include/mkldnn/mkldnn.h:55:28: >>>>>>>>>>>>>> fatal error: mkldnn_version.h: No such file or >>> directory >>>>>>>>>>>>>> compilation terminated. >>>>>>>>>>>>>> INFO: Unable to build MXNet plugin, will skip >> it. >>>>>>>>>>>>>> >>>>>>>>>>>>>> I did not change any setting of MKLDNN in my >>> config.mk. >>>>> I am >>>>>>>>> building on >>>>>>>>>>>>>> DLAMI base 18.0 which is Ubuntu 16.04 and CUDA 10.0 >>>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>> >>>>>>>>>>>>>> Lin >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Sat, Jun 8, 2019 at 5:39 PM shiwen hu < >>>>>>> [email protected]> >>>>>>>>> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> +1 >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Lai Wei <[email protected]> 于2019年6月9日周日 >>> 上午4:12写道: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Dear MXNet community, >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> This is the 3-day vote to release Apache MXNet >>>>>>> (incubating) >>>>>>>>> version >>>>>>>>>>>>>>> 1.5.0. >>>>>>>>>>>>>>>> Voting on dev@ will start June 8, >>> 23:59:59(PST) and >>>>>>> close >>>>>>>>> on June 11, >>>>>>>>>>>>>>>> 23:59:59. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> 1) Link to release notes: >>>>>>>>>>>>>>>> >>>>>>>>> >>>>> >> https://cwiki.apache.org/confluence/display/MXNET/1.5.0+Release+Notes >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> 2) Link to release candidate: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>> >>> https://github.com/apache/incubator-mxnet/releases/tag/1.5.0.rc0 >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> 3) Link to source and signatures on apache dist >>>>> server: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>> >>> https://dist.apache.org/repos/dist/dev/incubator/mxnet/1.5.0.rc0/ >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Please remember to TEST first before voting >>>>> accordingly: >>>>>>>>>>>>>>>> +1 = approve >>>>>>>>>>>>>>>> +0 = no opinion >>>>>>>>>>>>>>>> -1 = disapprove (provide reason) >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Best Regards >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Lai >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> -1. Built from source, import mxnet in python cause >>> Segfault. >>>>>>>>>>> >>>>>>>>>>> back trace: >>>>>>>>>>> >>>>>>>>>>> Thread 1 "python3" received signal SIGSEGV, Segmentation >>> fault. >>>>>>>>>>> 0x00007fff3e8a9f20 in ?? () >>>>>>>>>>> (gdb) bt >>>>>>>>>>> #0 0x00007fff3e8a9f20 in ?? () >>>>>>>>>>> #1 0x00007fffebbf440c in ReadConfigFile(Configuration&, >>>>>>>>>>> std::__cxx11::basic_string<char, std::char_traits<char>, >>>>>>>>>>> std::allocator<char> > const&, bool const&, unsigned int >>>>> const&) () >>>>>>>>> from >>>>>>>>>>> /usr/lib/x86_64-linux-gnu/libapt-pkg.so.5.0 >>>>>>>>>>> #2 0x00007fffebbf3d97 in ReadConfigDir(Configuration&, >>>>>>>>>>> std::__cxx11::basic_string<char, std::char_traits<char>, >>>>>>>>>>> std::allocator<char> > const&, bool const&, unsigned int >>>>> const&) () >>>>>>>>> from >>>>>>>>>>> /usr/lib/x86_64-linux-gnu/libapt-pkg.so.5.0 >>>>>>>>>>> #3 0x00007fffebc5e9aa in pkgInitConfig(Configuration&) >> () >>> from >>>>>>>>>>> /usr/lib/x86_64-linux-gnu/libapt-pkg.so.5.0 >>>>>>>>>>> #4 0x00007ffff29d5c48 in ?? () from >>>>>>> /usr/lib/python3/dist-packages/ >>>>>>>>>>> apt_pkg.cpython-35m-x86_64-linux-gnu.so >>>>>>>>>>> #5 0x00000000004ea10f in PyCFunction_Call () >>>>>>>>>>> #6 0x0000000000536d94 in PyEval_EvalFrameEx () >>>>>>>>>>> #7 0x000000000053fc97 in ?? () >>>>>>>>>>> #8 0x00000000005409bf in PyEval_EvalCode () >>>>>>>>>>> #9 0x000000000054a328 in ?? () >>>>>>>>>>> #10 0x00000000004ea1c6 in PyCFunction_Call () >>>>>>>>>>> #11 0x000000000053d353 in PyEval_EvalFrameEx () >>>>>>>>>>> #12 0x000000000053fc97 in ?? () >>>>>>>>>>> #13 0x000000000053bc93 in PyEval_EvalFrameEx () >>>>>>>>>>> #14 0x000000000053b294 in PyEval_EvalFrameEx () >>>>>>>>>>> #15 0x000000000053b294 in PyEval_EvalFrameEx () >>>>>>>>>>> #16 0x000000000053b294 in PyEval_EvalFrameEx () >>>>>>>>>>> #17 0x0000000000540b0b in PyEval_EvalCodeEx () >>>>>>>>>>> #18 0x00000000004ec2e3 in ?? () >>>>>>>>>>> #19 0x00000000005c20e7 in PyObject_Call () >>>>>>>>>>> >>>>>>>>>>> I was using fresh DLAMI ubuntu 18.0 and CUDA 10.0, built >>> with >>>>>>>>> USE_CUDA=1, >>>>>>>>>>> USE_CUDNN=1, the rest are default values. >>>>>>>>>>> >>>>>>>>>>> -Zhi >>>>>>>>>> >>>>>>>>> >>>>>>>>> Change to +1, I figured out that it was due to the >>> dependencies. I >>>>>>> still >>>>>>>>> have issue using DL base AMI with python3, but I will not >>> regard >>>>> it as >>>>>>> a >>>>>>>>> blocker to 1.5 release. >>>>>>>>> Tested Gluon-CV training and works fine. >>>>>>>>> >>>>>>>>> -Zhi >>>>>>>>> >>>>>>> >>>>> >>> >>
