Hello MXNet community, The Windows GPU pipeline on CI works again now. To fix it, we updated the AMI used by the tests and preinstalled VS2019, which uses a 64bit toolchain and resolves the OOM error. Prior attempts with using VS2017 64bit toolchain had failed, making the change to the AMI necessary.
VS2019 only works with Cuda 10 and we thus also preinstalled Cuda 10.2 on the AMI. As the Windows GPU build was the only build testing Cuda 9, we now do not have any tests with Cuda 9. We will start a separate discussion thread to come to a decision if we like to add back Cuda 9 tests on a one of the Unix platforms or to drop Cuda 9 support in MXNet 2. The AMI was further updated with the ninja build tool and a recent version of cmake, helping us to speed up the build further. Previously cmake was installed as part of every CI run, as the version on the AMI has been too outdated for some time already. All these updates were manually build on top of the existing AMI. The team is continuing to work on updating the automated AMI building process to include the updated toolchain. Unfortunately the automation scripts for the recent VS2019 do not work on the Mircosoft Server 2016 version used so far, and we thus intend to switch to Microsoft Server 2019. The new AMI will be tested in a CI Dev environment and we will notify you when it's ready for adoption on the CI. https://github.com/apache/incubator-mxnet/pull/17962 contains the changes to the master branch required to make use of the updated toolchain. On your side, all you have to do is rebase the master to reflect the changes. Refer Git commands to rebase master here <https://gist.github.com/ChaiBapchya/2c52bce4b3d52ab03ccbc875a49996df>. Thanks to Joe, Leo, Marco, Ningyuan, Sandeep, Sheng, Zhi for help, support & guidance. Thanks once again to the community for patience. Apologies for inconvenience caused. Regards, Chai, on behalf of MXNet CI team On Wed, 1 Apr 2020 at 20:38, Chaitanya Bapat <[email protected]> wrote: > Hello MXNet Community, > > Since a week, CI is blocked due to Windows-GPU failure. > PR to fix it is still WIP : > https://github.com/apache/incubator-mxnet/pull/17808 > > This updates the toolchain from 32bit to 64bit [to resolve the 2GB memory > linker error currently facing CI] > Along with host of other updates that are long time coming - > [VSCode2019,opencv,cudnn,etc] > We have 2 pending issues: > 1. cuda segfault in Py3 Windows GPU test > OSError: exception: access violation writing 0x0000000000000000 > > 2. Jenkins Channel Connection > "hudson.remoting.ChannelClosedException: Channel > "hudson.remoting.Channel@5cca06e6:JNLP4-connect connection from [...] > failed. The channel is closing down or has closed down" > > We are hard at work to unblock the CI & get the PR fix merged. > > Since we want to focus on fixing the windows-gpu issue and avoid > complicating the situation further, we are not disabling the windows-gpu > build as of now. As a backup plan, we will disable the windows-gpu builds > by 4/5 Sunday EOD if things don’t recover by then. > > Thanks for the continued patience. > Chai, > on behalf of the MXNet CI team > > > > On Thu, 26 Mar 2020 at 21:16, Chaitanya Bapat <[email protected]> > wrote: > >> Hello MXNet community, >> >> It’s been over 3 days now that windows-gpu builds are failing on CI. >> The team (me, Leo, Ningyuan, Joe, Pedro) are at work trying to identify >> the root-cause and fix. >> >> Issue: Linker is running OOM due to 32bit toolchain not able to address >> the available memory of the machine. >> >> Multiple attempts have been made (albeit with limited success) >> 1. Reduce the number of builds per worker (for window-cpu node) from 3 to >> 1 >> 2. Updated the toolchain from 32bit to 64bit (as pointed out by multiple >> people) >> PR : https://github.com/apache/incubator-mxnet/pull/17916 >> [related to Leo’s PR : >> https://github.com/apache/incubator-mxnet/pull/17912) >> >> Road to unblock: >> Updated AMI coupled with toolchain should possibly help >> Ningyuan has an updated AMI for windows (PR : >> https://github.com/apache/incubator-mxnet/pull/17808) - vs2019, >> cuda10.2, cmake fixes etc. >> >> We will get it deployed by tomorrow and update the status accordingly. >> >> Thanks for the patience. Apologies for the inconvenience caused. >> Thank you 🙏 >> Chai, >> on behalf of the MXNet CI team >> >> -- >> *Chaitanya Prakash Bapat* >> *+1 (973) 953-6299* >> >> [image: https://www.linkedin.com//in/chaibapat25] >> <https://github.com/ChaiBapchya>[image: >> https://www.facebook.com/chaibapat] >> <https://www.facebook.com/chaibapchya>[image: >> https://twitter.com/ChaiBapchya] <https://twitter.com/ChaiBapchya>[image: >> https://www.linkedin.com//in/chaibapat25] >> <https://www.linkedin.com//in/chaibapchya/> >> > > > -- > *Chaitanya Prakash Bapat* > *+1 (973) 953-6299* > > [image: https://www.linkedin.com//in/chaibapat25] > <https://github.com/ChaiBapchya>[image: > https://www.facebook.com/chaibapat] > <https://www.facebook.com/chaibapchya>[image: > https://twitter.com/ChaiBapchya] <https://twitter.com/ChaiBapchya>[image: > https://www.linkedin.com//in/chaibapat25] > <https://www.linkedin.com//in/chaibapchya/> > -- *Chaitanya Prakash Bapat* *+1 (973) 953-6299* [image: https://www.linkedin.com//in/chaibapat25] <https://github.com/ChaiBapchya>[image: https://www.facebook.com/chaibapat] <https://www.facebook.com/chaibapchya>[image: https://twitter.com/ChaiBapchya] <https://twitter.com/ChaiBapchya>[image: https://www.linkedin.com//in/chaibapat25] <https://www.linkedin.com//in/chaibapchya/>
