Re: Update : CI windows-gpu Failure

Chaitanya Bapat Fri, 03 Apr 2020 20:51:24 -0700

Hello MXNet community,

The Windows GPU pipeline on CI works again now. To fix it, we updated the
AMI
used by the tests and preinstalled VS2019, which uses a 64bit toolchain and
resolves the OOM error. Prior attempts with using VS2017 64bit toolchain had
failed, making the change to the AMI necessary.

VS2019 only works with Cuda 10 and we thus also preinstalled Cuda 10.2 on
the
AMI. As the Windows GPU build was the only build testing Cuda 9, we now do
not
have any tests with Cuda 9. We will start a separate discussion thread to
come
to a decision if we like to add back Cuda 9 tests on a one of the Unix
platforms
or to drop Cuda 9 support in MXNet 2. The AMI was further updated with the
ninja
build tool and a recent version of cmake, helping us to speed up the build
further. Previously cmake was installed as part of every CI run, as the
version
on the AMI has been too outdated for some time already.

All these updates were manually build on top of the existing AMI. The team
is
continuing to work on updating the automated AMI building process to
include the
updated toolchain. Unfortunately the automation scripts for the recent
VS2019 do
not work on the Mircosoft Server 2016 version used so far, and we thus
intend to
switch to Microsoft Server 2019. The new AMI will be tested in a CI Dev
environment and we will notify you when it's ready for adoption on the CI.

https://github.com/apache/incubator-mxnet/pull/17962 contains the changes
to the
master branch required to make use of the updated toolchain.

On your side, all you have to do is rebase the master to reflect the
changes.
Refer Git commands to rebase master here
<https://gist.github.com/ChaiBapchya/2c52bce4b3d52ab03ccbc875a49996df>.

Thanks to Joe, Leo, Marco, Ningyuan, Sandeep, Sheng, Zhi for help, support
& guidance.

Thanks once again to the community for patience. Apologies for
inconvenience caused.
Regards,
Chai,
on behalf of MXNet CI team

On Wed, 1 Apr 2020 at 20:38, Chaitanya Bapat <[email protected]> wrote:

> Hello MXNet Community,
>
> Since a week, CI is blocked due to Windows-GPU failure.
> PR to fix it is still WIP :
> https://github.com/apache/incubator-mxnet/pull/17808
>
> This updates the toolchain from 32bit to 64bit [to resolve the 2GB memory
> linker error currently facing CI]
> Along with host of other updates that are long time coming -
> [VSCode2019,opencv,cudnn,etc]
> We have 2 pending issues:
> 1. cuda segfault in Py3 Windows GPU test
> OSError: exception: access violation writing 0x0000000000000000
>
> 2. Jenkins Channel Connection
> "hudson.remoting.ChannelClosedException: Channel
> "hudson.remoting.Channel@5cca06e6:JNLP4-connect connection from [...]
> failed. The channel is closing down or has closed down"
>
> We are hard at work to unblock the CI & get the PR fix merged.
>
> Since we want to focus on fixing the windows-gpu issue and avoid
> complicating the situation further, we are not disabling the windows-gpu
> build as of now. As a backup plan, we will disable the windows-gpu builds
> by 4/5 Sunday EOD if things don’t recover by then.
>
> Thanks for the continued patience.
> Chai,
> on behalf of the MXNet CI team
>
>
>
> On Thu, 26 Mar 2020 at 21:16, Chaitanya Bapat <[email protected]>
> wrote:
>
>> Hello MXNet community,
>>
>> It’s been over 3 days now that windows-gpu builds are failing on CI.
>> The team (me, Leo, Ningyuan, Joe, Pedro) are at work trying to identify
>> the root-cause and fix.
>>
>> Issue: Linker is running OOM due to 32bit toolchain not able to address
>> the available memory of the machine.
>>
>> Multiple attempts have been made (albeit with limited success)
>> 1. Reduce the number of builds per worker (for window-cpu node) from 3 to
>> 1
>> 2. Updated the toolchain from 32bit to 64bit (as pointed out by multiple
>> people)
>> PR : https://github.com/apache/incubator-mxnet/pull/17916
>> [related to Leo’s PR :
>> https://github.com/apache/incubator-mxnet/pull/17912)
>>
>> Road to unblock:
>> Updated AMI coupled with toolchain should possibly help
>> Ningyuan has an updated AMI for windows (PR :
>> https://github.com/apache/incubator-mxnet/pull/17808) - vs2019,
>> cuda10.2, cmake fixes etc.
>>
>> We will get it deployed by tomorrow and update the status accordingly.
>>
>> Thanks for the patience. Apologies for the inconvenience caused.
>> Thank you 🙏
>> Chai,
>> on behalf of the MXNet CI team
>>
>> --
>> *Chaitanya Prakash Bapat*
>> *+1 (973) 953-6299*
>>
>> [image: https://www.linkedin.com//in/chaibapat25]
>> <https://github.com/ChaiBapchya>[image:
>> https://www.facebook.com/chaibapat]
>> <https://www.facebook.com/chaibapchya>[image:
>> https://twitter.com/ChaiBapchya] <https://twitter.com/ChaiBapchya>[image:
>> https://www.linkedin.com//in/chaibapat25]
>> <https://www.linkedin.com//in/chaibapchya/>
>>
>
>
> --
> *Chaitanya Prakash Bapat*
> *+1 (973) 953-6299*
>
> [image: https://www.linkedin.com//in/chaibapat25]
> <https://github.com/ChaiBapchya>[image:
> https://www.facebook.com/chaibapat] 
> <https://www.facebook.com/chaibapchya>[image:
> https://twitter.com/ChaiBapchya] <https://twitter.com/ChaiBapchya>[image:
> https://www.linkedin.com//in/chaibapat25]
> <https://www.linkedin.com//in/chaibapchya/>
>

-- 
*Chaitanya Prakash Bapat*
*+1 (973) 953-6299*

[image: https://www.linkedin.com//in/chaibapat25]
<https://github.com/ChaiBapchya>[image: https://www.facebook.com/chaibapat]
<https://www.facebook.com/chaibapchya>[image:
https://twitter.com/ChaiBapchya] <https://twitter.com/ChaiBapchya>[image:
https://www.linkedin.com//in/chaibapat25]
<https://www.linkedin.com//in/chaibapchya/>

Re: Update : CI windows-gpu Failure

Reply via email to