Done: https://issues.apache.org/jira/browse/INFRA-20085
Since you didn't provide the required events, I took an educated guess. Next time, please include the events. -Marco On Tue, Apr 7, 2020 at 6:57 AM Chaitanya Bapat <[email protected]> wrote: > Hello MXNet community, > > > The new AMI will be tested in a CI Dev environment and we will notify > you when it's ready for adoption on the CI. > > As we continue to work on improving & updating the CI, I want to notify the > community about steps taken for testing CI Dev environments. > > In order to ensure updated AMI is error-free, we plan to test it in CI Dev > environment for atleast 1000 builds. In order to test for these many > builds, we want to route Github PR events to CI Dev environment. In > addition to the test of AMI, there's a proposal to migrate instances from > p3/g3 [unix-gpu/centos-gpu/windows-gpu] to G4 from cost & speed standpoint. > But it needs to be tested and data needs to be gathered before carrying out > the migration. > > In that relation, I request Marco to cut a ticket to Apache Infra for > another Github webhook. This Github webhook in the apache/incubator-mxnet > repository will point to the Jenkins CI Dev server [used for A/B testing]. > > Once tested successfully, I will notify the community of proposed > migrations to make CI faster, better and less error-prone. > > Thank you, > Chai, > on behalf of MXNet CI team. > > > > On Fri, 3 Apr 2020 at 21:30, sandeep krishnamurthy < > [email protected]> wrote: > > > Thanks a lot Chai, Joe, Leo, Marco, Ningyuan, Sheng, Zhi for pulling many > > all nighters to fix this issue. > > Thanks for pushing further to automate the AMI building and Leo's > proposal > > to update the tool chain. That should stabilize the CI for the project. > > > > Best, > > Sandeep > > > > On Fri, Apr 3, 2020 at 8:50 PM Chaitanya Bapat <[email protected]> > > wrote: > > > > > Hello MXNet community, > > > > > > The Windows GPU pipeline on CI works again now. To fix it, we updated > the > > > AMI > > > used by the tests and preinstalled VS2019, which uses a 64bit toolchain > > and > > > resolves the OOM error. Prior attempts with using VS2017 64bit > toolchain > > > had > > > failed, making the change to the AMI necessary. > > > > > > VS2019 only works with Cuda 10 and we thus also preinstalled Cuda 10.2 > on > > > the > > > AMI. As the Windows GPU build was the only build testing Cuda 9, we now > > do > > > not > > > have any tests with Cuda 9. We will start a separate discussion thread > to > > > come > > > to a decision if we like to add back Cuda 9 tests on a one of the Unix > > > platforms > > > or to drop Cuda 9 support in MXNet 2. The AMI was further updated with > > the > > > ninja > > > build tool and a recent version of cmake, helping us to speed up the > > build > > > further. Previously cmake was installed as part of every CI run, as the > > > version > > > on the AMI has been too outdated for some time already. > > > > > > All these updates were manually build on top of the existing AMI. The > > team > > > is > > > continuing to work on updating the automated AMI building process to > > > include the > > > updated toolchain. Unfortunately the automation scripts for the recent > > > VS2019 do > > > not work on the Mircosoft Server 2016 version used so far, and we thus > > > intend to > > > switch to Microsoft Server 2019. The new AMI will be tested in a CI Dev > > > environment and we will notify you when it's ready for adoption on the > > CI. > > > > > > https://github.com/apache/incubator-mxnet/pull/17962 contains the > > changes > > > to the > > > master branch required to make use of the updated toolchain. > > > > > > On your side, all you have to do is rebase the master to reflect the > > > changes. > > > Refer Git commands to rebase master here > > > <https://gist.github.com/ChaiBapchya/2c52bce4b3d52ab03ccbc875a49996df > >. > > > > > > Thanks to Joe, Leo, Marco, Ningyuan, Sandeep, Sheng, Zhi for help, > > support > > > & guidance. > > > > > > Thanks once again to the community for patience. Apologies for > > > inconvenience caused. > > > Regards, > > > Chai, > > > on behalf of MXNet CI team > > > > > > On Wed, 1 Apr 2020 at 20:38, Chaitanya Bapat <[email protected]> > > wrote: > > > > > > > Hello MXNet Community, > > > > > > > > Since a week, CI is blocked due to Windows-GPU failure. > > > > PR to fix it is still WIP : > > > > https://github.com/apache/incubator-mxnet/pull/17808 > > > > > > > > This updates the toolchain from 32bit to 64bit [to resolve the 2GB > > memory > > > > linker error currently facing CI] > > > > Along with host of other updates that are long time coming - > > > > [VSCode2019,opencv,cudnn,etc] > > > > We have 2 pending issues: > > > > 1. cuda segfault in Py3 Windows GPU test > > > > OSError: exception: access violation writing 0x0000000000000000 > > > > > > > > 2. Jenkins Channel Connection > > > > "hudson.remoting.ChannelClosedException: Channel > > > > "hudson.remoting.Channel@5cca06e6:JNLP4-connect connection from > [...] > > > > failed. The channel is closing down or has closed down" > > > > > > > > We are hard at work to unblock the CI & get the PR fix merged. > > > > > > > > Since we want to focus on fixing the windows-gpu issue and avoid > > > > complicating the situation further, we are not disabling the > > windows-gpu > > > > build as of now. As a backup plan, we will disable the windows-gpu > > builds > > > > by 4/5 Sunday EOD if things don’t recover by then. > > > > > > > > Thanks for the continued patience. > > > > Chai, > > > > on behalf of the MXNet CI team > > > > > > > > > > > > > > > > On Thu, 26 Mar 2020 at 21:16, Chaitanya Bapat <[email protected]> > > > > wrote: > > > > > > > >> Hello MXNet community, > > > >> > > > >> It’s been over 3 days now that windows-gpu builds are failing on CI. > > > >> The team (me, Leo, Ningyuan, Joe, Pedro) are at work trying to > > identify > > > >> the root-cause and fix. > > > >> > > > >> Issue: Linker is running OOM due to 32bit toolchain not able to > > address > > > >> the available memory of the machine. > > > >> > > > >> Multiple attempts have been made (albeit with limited success) > > > >> 1. Reduce the number of builds per worker (for window-cpu node) > from 3 > > > to > > > >> 1 > > > >> 2. Updated the toolchain from 32bit to 64bit (as pointed out by > > multiple > > > >> people) > > > >> PR : https://github.com/apache/incubator-mxnet/pull/17916 > > > >> [related to Leo’s PR : > > > >> https://github.com/apache/incubator-mxnet/pull/17912) > > > >> > > > >> Road to unblock: > > > >> Updated AMI coupled with toolchain should possibly help > > > >> Ningyuan has an updated AMI for windows (PR : > > > >> https://github.com/apache/incubator-mxnet/pull/17808) - vs2019, > > > >> cuda10.2, cmake fixes etc. > > > >> > > > >> We will get it deployed by tomorrow and update the status > accordingly. > > > >> > > > >> Thanks for the patience. Apologies for the inconvenience caused. > > > >> Thank you 🙏 > > > >> Chai, > > > >> on behalf of the MXNet CI team > > > >> > > > >> -- > > > >> *Chaitanya Prakash Bapat* > > > >> *+1 (973) 953-6299* > > > >> > > > >> [image: https://www.linkedin.com//in/chaibapat25] > > > >> <https://github.com/ChaiBapchya>[image: > > > >> https://www.facebook.com/chaibapat] > > > >> <https://www.facebook.com/chaibapchya>[image: > > > >> https://twitter.com/ChaiBapchya] <https://twitter.com/ChaiBapchya > > > >[image: > > > >> https://www.linkedin.com//in/chaibapat25] > > > >> <https://www.linkedin.com//in/chaibapchya/> > > > >> > > > > > > > > > > > > -- > > > > *Chaitanya Prakash Bapat* > > > > *+1 (973) 953-6299* > > > > > > > > [image: https://www.linkedin.com//in/chaibapat25] > > > > <https://github.com/ChaiBapchya>[image: > > > > https://www.facebook.com/chaibapat] < > > > https://www.facebook.com/chaibapchya>[image: > > > > https://twitter.com/ChaiBapchya] <https://twitter.com/ChaiBapchya > > > >[image: > > > > https://www.linkedin.com//in/chaibapat25] > > > > <https://www.linkedin.com//in/chaibapchya/> > > > > > > > > > > > > > -- > > > *Chaitanya Prakash Bapat* > > > *+1 (973) 953-6299* > > > > > > [image: https://www.linkedin.com//in/chaibapat25] > > > <https://github.com/ChaiBapchya>[image: > > https://www.facebook.com/chaibapat > > > ] > > > <https://www.facebook.com/chaibapchya>[image: > > > https://twitter.com/ChaiBapchya] <https://twitter.com/ChaiBapchya > > >[image: > > > https://www.linkedin.com//in/chaibapat25] > > > <https://www.linkedin.com//in/chaibapchya/> > > > > > > > > > -- > > Sandeep Krishnamurthy > > > > > -- > *Chaitanya Prakash Bapat* > *+1 (973) 953-6299* > > [image: https://www.linkedin.com//in/chaibapat25] > <https://github.com/ChaiBapchya>[image: https://www.facebook.com/chaibapat > ] > <https://www.facebook.com/chaibapchya>[image: > https://twitter.com/ChaiBapchya] <https://twitter.com/ChaiBapchya>[image: > https://www.linkedin.com//in/chaibapat25] > <https://www.linkedin.com//in/chaibapchya/> >
