Re: Update : CI windows-gpu Failure

Marco de Abreu Tue, 07 Apr 2020 02:53:26 -0700

Done:  https://issues.apache.org/jira/browse/INFRA-20085


Since you didn't provide the required events, I took an educated guess.
Next time, please include the events.

-Marco

On Tue, Apr 7, 2020 at 6:57 AM Chaitanya Bapat <[email protected]> wrote:

> Hello MXNet community,
>
> >  The new AMI will be tested in a CI Dev environment and we will notify
> you when it's ready for adoption on the CI.
>
> As we continue to work on improving & updating the CI, I want to notify the
> community about steps taken for testing CI Dev environments.
>
> In order to ensure updated AMI is error-free, we plan to test it in CI Dev
> environment for atleast 1000 builds. In order to test for these many
> builds, we want to route Github PR events to CI Dev environment. In
> addition to the test of AMI, there's a proposal to migrate instances from
> p3/g3 [unix-gpu/centos-gpu/windows-gpu] to G4 from cost & speed standpoint.
> But it needs to be tested and data needs to be gathered before carrying out
> the migration.
>
> In that relation, I request Marco to cut a ticket to Apache Infra for
> another Github webhook. This Github webhook in the apache/incubator-mxnet
> repository will point to the Jenkins CI Dev server [used for A/B testing].
>
> Once tested successfully, I will notify the community of proposed
> migrations to make CI faster, better and less error-prone.
>
> Thank you,
> Chai,
> on behalf of MXNet CI team.
>
>
>
> On Fri, 3 Apr 2020 at 21:30, sandeep krishnamurthy <
> [email protected]> wrote:
>
> > Thanks a lot Chai, Joe, Leo, Marco, Ningyuan, Sheng, Zhi for pulling many
> > all nighters to fix this issue.
> > Thanks for pushing further to automate the AMI building and Leo's
> proposal
> > to update the tool chain. That should stabilize the CI for the project.
> >
> > Best,
> > Sandeep
> >
> > On Fri, Apr 3, 2020 at 8:50 PM Chaitanya Bapat <[email protected]>
> > wrote:
> >
> > > Hello MXNet community,
> > >
> > > The Windows GPU pipeline on CI works again now. To fix it, we updated
> the
> > > AMI
> > > used by the tests and preinstalled VS2019, which uses a 64bit toolchain
> > and
> > > resolves the OOM error. Prior attempts with using VS2017 64bit
> toolchain
> > > had
> > > failed, making the change to the AMI necessary.
> > >
> > > VS2019 only works with Cuda 10 and we thus also preinstalled Cuda 10.2
> on
> > > the
> > > AMI. As the Windows GPU build was the only build testing Cuda 9, we now
> > do
> > > not
> > > have any tests with Cuda 9. We will start a separate discussion thread
> to
> > > come
> > > to a decision if we like to add back Cuda 9 tests on a one of the Unix
> > > platforms
> > > or to drop Cuda 9 support in MXNet 2. The AMI was further updated with
> > the
> > > ninja
> > > build tool and a recent version of cmake, helping us to speed up the
> > build
> > > further. Previously cmake was installed as part of every CI run, as the
> > > version
> > > on the AMI has been too outdated for some time already.
> > >
> > > All these updates were manually build on top of the existing AMI. The
> > team
> > > is
> > > continuing to work on updating the automated AMI building process to
> > > include the
> > > updated toolchain. Unfortunately the automation scripts for the recent
> > > VS2019 do
> > > not work on the Mircosoft Server 2016 version used so far, and we thus
> > > intend to
> > > switch to Microsoft Server 2019. The new AMI will be tested in a CI Dev
> > > environment and we will notify you when it's ready for adoption on the
> > CI.
> > >
> > > https://github.com/apache/incubator-mxnet/pull/17962 contains the
> > changes
> > > to the
> > > master branch required to make use of the updated toolchain.
> > >
> > > On your side, all you have to do is rebase the master to reflect the
> > > changes.
> > > Refer Git commands to rebase master here
> > > <https://gist.github.com/ChaiBapchya/2c52bce4b3d52ab03ccbc875a49996df
> >.
> > >
> > > Thanks to Joe, Leo, Marco, Ningyuan, Sandeep, Sheng, Zhi for help,
> > support
> > > & guidance.
> > >
> > > Thanks once again to the community for patience. Apologies for
> > > inconvenience caused.
> > > Regards,
> > > Chai,
> > > on behalf of MXNet CI team
> > >
> > > On Wed, 1 Apr 2020 at 20:38, Chaitanya Bapat <[email protected]>
> > wrote:
> > >
> > > > Hello MXNet Community,
> > > >
> > > > Since a week, CI is blocked due to Windows-GPU failure.
> > > > PR to fix it is still WIP :
> > > > https://github.com/apache/incubator-mxnet/pull/17808
> > > >
> > > > This updates the toolchain from 32bit to 64bit [to resolve the 2GB
> > memory
> > > > linker error currently facing CI]
> > > > Along with host of other updates that are long time coming -
> > > > [VSCode2019,opencv,cudnn,etc]
> > > > We have 2 pending issues:
> > > > 1. cuda segfault in Py3 Windows GPU test
> > > > OSError: exception: access violation writing 0x0000000000000000
> > > >
> > > > 2. Jenkins Channel Connection
> > > > "hudson.remoting.ChannelClosedException: Channel
> > > > "hudson.remoting.Channel@5cca06e6:JNLP4-connect connection from
> [...]
> > > > failed. The channel is closing down or has closed down"
> > > >
> > > > We are hard at work to unblock the CI & get the PR fix merged.
> > > >
> > > > Since we want to focus on fixing the windows-gpu issue and avoid
> > > > complicating the situation further, we are not disabling the
> > windows-gpu
> > > > build as of now. As a backup plan, we will disable the windows-gpu
> > builds
> > > > by 4/5 Sunday EOD if things don’t recover by then.
> > > >
> > > > Thanks for the continued patience.
> > > > Chai,
> > > > on behalf of the MXNet CI team
> > > >
> > > >
> > > >
> > > > On Thu, 26 Mar 2020 at 21:16, Chaitanya Bapat <[email protected]>
> > > > wrote:
> > > >
> > > >> Hello MXNet community,
> > > >>
> > > >> It’s been over 3 days now that windows-gpu builds are failing on CI.
> > > >> The team (me, Leo, Ningyuan, Joe, Pedro) are at work trying to
> > identify
> > > >> the root-cause and fix.
> > > >>
> > > >> Issue: Linker is running OOM due to 32bit toolchain not able to
> > address
> > > >> the available memory of the machine.
> > > >>
> > > >> Multiple attempts have been made (albeit with limited success)
> > > >> 1. Reduce the number of builds per worker (for window-cpu node)
> from 3
> > > to
> > > >> 1
> > > >> 2. Updated the toolchain from 32bit to 64bit (as pointed out by
> > multiple
> > > >> people)
> > > >> PR : https://github.com/apache/incubator-mxnet/pull/17916
> > > >> [related to Leo’s PR :
> > > >> https://github.com/apache/incubator-mxnet/pull/17912)
> > > >>
> > > >> Road to unblock:
> > > >> Updated AMI coupled with toolchain should possibly help
> > > >> Ningyuan has an updated AMI for windows (PR :
> > > >> https://github.com/apache/incubator-mxnet/pull/17808) - vs2019,
> > > >> cuda10.2, cmake fixes etc.
> > > >>
> > > >> We will get it deployed by tomorrow and update the status
> accordingly.
> > > >>
> > > >> Thanks for the patience. Apologies for the inconvenience caused.
> > > >> Thank you 🙏
> > > >> Chai,
> > > >> on behalf of the MXNet CI team
> > > >>
> > > >> --
> > > >> *Chaitanya Prakash Bapat*
> > > >> *+1 (973) 953-6299*
> > > >>
> > > >> [image: https://www.linkedin.com//in/chaibapat25]
> > > >> <https://github.com/ChaiBapchya>[image:
> > > >> https://www.facebook.com/chaibapat]
> > > >> <https://www.facebook.com/chaibapchya>[image:
> > > >> https://twitter.com/ChaiBapchya] <https://twitter.com/ChaiBapchya
> > > >[image:
> > > >> https://www.linkedin.com//in/chaibapat25]
> > > >> <https://www.linkedin.com//in/chaibapchya/>
> > > >>
> > > >
> > > >
> > > > --
> > > > *Chaitanya Prakash Bapat*
> > > > *+1 (973) 953-6299*
> > > >
> > > > [image: https://www.linkedin.com//in/chaibapat25]
> > > > <https://github.com/ChaiBapchya>[image:
> > > > https://www.facebook.com/chaibapat] <
> > > https://www.facebook.com/chaibapchya>[image:
> > > > https://twitter.com/ChaiBapchya] <https://twitter.com/ChaiBapchya
> > > >[image:
> > > > https://www.linkedin.com//in/chaibapat25]
> > > > <https://www.linkedin.com//in/chaibapchya/>
> > > >
> > >
> > >
> > > --
> > > *Chaitanya Prakash Bapat*
> > > *+1 (973) 953-6299*
> > >
> > > [image: https://www.linkedin.com//in/chaibapat25]
> > > <https://github.com/ChaiBapchya>[image:
> > https://www.facebook.com/chaibapat
> > > ]
> > > <https://www.facebook.com/chaibapchya>[image:
> > > https://twitter.com/ChaiBapchya] <https://twitter.com/ChaiBapchya
> > >[image:
> > > https://www.linkedin.com//in/chaibapat25]
> > > <https://www.linkedin.com//in/chaibapchya/>
> > >
> >
> >
> > --
> > Sandeep Krishnamurthy
> >
>
>
> --
> *Chaitanya Prakash Bapat*
> *+1 (973) 953-6299*
>
> [image: https://www.linkedin.com//in/chaibapat25]
> <https://github.com/ChaiBapchya>[image: https://www.facebook.com/chaibapat
> ]
> <https://www.facebook.com/chaibapchya>[image:
> https://twitter.com/ChaiBapchya] <https://twitter.com/ChaiBapchya>[image:
> https://www.linkedin.com//in/chaibapat25]
> <https://www.linkedin.com//in/chaibapchya/>
>

Re: Update : CI windows-gpu Failure

Reply via email to