Thanks Leonard. Naively dividing by test files would certainly be an easy
and doable way before going into to proper nose parallelization. Great idea!

Scalability in terms of nodes is not an issue. Our system can handle at
least 600 slaves (didn't want to go higher for obvious reasons). But I
think we don't even have to go that far because most of the time, our
machines are heavily under utilized due to the single-threaded nature of
most tests. Thus, parallel test execution on the same machine would already
speed up the process by great lengths.

-Marco

P.S. the structure of the Jenkinsfiles seems pretty familiar :P i am glad
my approach is considered helpful :)

Leonard Lausen <[email protected]> schrieb am Do., 15. Aug. 2019, 18:59:

> To parallelize across machines: For GluonNLP we started submitting test
> jobs to AWS Batch. Just adding a for-loop over the units in the
> Jenkinsfile [1] and submitting a job for each [2] works quite well. Then
> Jenkins just waits for all jobs to finish and retrieves their status.
> This works since AWS Batch added GPU support this April [3].
>
> For MXNet, naively parallelizing over the files defining the test cases
> that are in the longest running Pipeline stage may already help?
>
> [1]:
> https://github.com/dmlc/gluon-nlp/blob/master/ci/jenkins/Jenkinsfile_py3-master_gpu_doc#L53
> [2]: https://github.com/dmlc/gluon-nlp/blob/master/ci/batch/submit-job.py
> [3]: https://aws.amazon.com/blogs/compute/gpu-workloads-on-aws-batch/
>
> Marco de Abreu <[email protected]> writes:
>
> > The first start wrt parallelization could certainly be start adding
> > parallel test execution in nosetests.
> >
> > -Marco
> >
> > Aaron Markham <[email protected]> schrieb am Do., 15. Aug. 2019,
> > 05:39:
> >
> >> The PRs Thomas and I are working on for the new docs and website share
> the
> >> mxnet binary in the new CI pipelines we made. Speeds things up a lot.
> >>
> >> On Wed, Aug 14, 2019, 18:16 Chris Olivier <[email protected]>
> wrote:
> >>
> >> > I see it done daily now, and while I can’t share all the details, it’s
> >> not
> >> > an incredibly complex thing, and involves not much more than nfs/efs
> >> > sharing and remote ssh commands.  All it takes is a little ingenuity
> and
> >> > some imagination.
> >> >
> >> > On Wed, Aug 14, 2019 at 4:31 PM Pedro Larroy <
> >> [email protected]
> >> > >
> >> > wrote:
> >> >
> >> > > Sounds good in theory. I think there are complex details with
> regards
> >> of
> >> > > resource sharing during parallel execution. Still I think both ways
> can
> >> > be
> >> > > explored. I think some tests run for unreasonably long times for
> what
> >> > they
> >> > > are doing. We already scale parts of the pipeline horizontally
> across
> >> > > workers.
> >> > >
> >> > >
> >> > > On Wed, Aug 14, 2019 at 5:12 PM Chris Olivier <
> [email protected]>
> >> > > wrote:
> >> > >
> >> > > > +1
> >> > > >
> >> > > > Rather than remove tests (which doesn’t scale as a solution), why
> not
> >> > > scale
> >> > > > them horizontally so that they finish more quickly? Across
> processes
> >> or
> >> > > > even on a pool of machines that aren’t necessarily the build
> machine?
> >> > > >
> >> > > > On Wed, Aug 14, 2019 at 12:03 PM Marco de Abreu <
> >> > [email protected]
> >> > > >
> >> > > > wrote:
> >> > > >
> >> > > > > With regards to time I rather prefer us spending a bit more
> time on
> >> > > > > maintenance than somebody running into an error that could've
> been
> >> > > caught
> >> > > > > with a test.
> >> > > > >
> >> > > > > I mean, our Publishing pipeline for Scala GPU has been broken
> for
> >> > quite
> >> > > > > some time now, but nobody noticed that. Basically my stance on
> that
> >> > > > matter
> >> > > > > is that as soon as something is not blocking, you can also just
> >> > > > deactivate
> >> > > > > it since you don't have a forcing function in an open source
> >> project.
> >> > > > > People will rarely come back and fix the errors of some nightly
> >> test
> >> > > that
> >> > > > > they introduced.
> >> > > > >
> >> > > > > -Marco
> >> > > > >
> >> > > > > Carin Meier <[email protected]> schrieb am Mi., 14. Aug.
> 2019,
> >> > > 21:59:
> >> > > > >
> >> > > > > > If a language binding test is failing for a not important
> reason,
> >> > > then
> >> > > > it
> >> > > > > > is too brittle and needs to be fixed (we have fixed some of
> these
> >> > > with
> >> > > > > the
> >> > > > > > Clojure package [1]).
> >> > > > > > But in general, if we thinking of the MXNet project as one
> >> project
> >> > > that
> >> > > > > is
> >> > > > > > across all the language bindings, then we want to know if some
> >> > > > > fundamental
> >> > > > > > code change is going to break a downstream package.
> >> > > > > > I can't speak for all the high level package binding
> maintainers,
> >> > but
> >> > > > I'm
> >> > > > > > always happy to pitch in to provide code fixes to help the
> base
> >> PR
> >> > > get
> >> > > > > > green.
> >> > > > > >
> >> > > > > > The time costs to maintain such a large CI project obviously
> >> needs
> >> > to
> >> > > > be
> >> > > > > > considered as well.
> >> > > > > >
> >> > > > > > [1] https://github.com/apache/incubator-mxnet/pull/15579
> >> > > > > >
> >> > > > > > On Wed, Aug 14, 2019 at 3:48 PM Pedro Larroy <
> >> > > > > [email protected]
> >> > > > > > >
> >> > > > > > wrote:
> >> > > > > >
> >> > > > > > > From what I have seen Clojure is 15 minutes, which I think
> is
> >> > > > > reasonable.
> >> > > > > > > The only question is that when a binding such as R, Perl or
> >> > Clojure
> >> > > > > > fails,
> >> > > > > > > some devs are a bit confused about how to fix them since
> they
> >> are
> >> > > not
> >> > > > > > > familiar with the testing tools and the language.
> >> > > > > > >
> >> > > > > > > On Wed, Aug 14, 2019 at 11:57 AM Carin Meier <
> >> > [email protected]
> >> > > >
> >> > > > > > wrote:
> >> > > > > > >
> >> > > > > > > > Great idea Marco! Anything that you think would be
> valuable
> >> to
> >> > > > share
> >> > > > > > > would
> >> > > > > > > > be good. The duration of each node in the test stage
> sounds
> >> > like
> >> > > a
> >> > > > > good
> >> > > > > > > > start.
> >> > > > > > > >
> >> > > > > > > > - Carin
> >> > > > > > > >
> >> > > > > > > > On Wed, Aug 14, 2019 at 2:48 PM Marco de Abreu <
> >> > > > > > [email protected]>
> >> > > > > > > > wrote:
> >> > > > > > > >
> >> > > > > > > > > Hi,
> >> > > > > > > > >
> >> > > > > > > > > we record a bunch of metrics about run statistics (down
> to
> >> > the
> >> > > > > > duration
> >> > > > > > > > of
> >> > > > > > > > > every individual step). If you tell me which ones you're
> >> > > > > particularly
> >> > > > > > > > > interested in (probably total duration of each node in
> the
> >> > test
> >> > > > > > stage),
> >> > > > > > > > I'm
> >> > > > > > > > > happy to provide them.
> >> > > > > > > > >
> >> > > > > > > > > Dimensions are (in hierarchical order):
> >> > > > > > > > > - job
> >> > > > > > > > > - branch
> >> > > > > > > > > - stage
> >> > > > > > > > > - node
> >> > > > > > > > > - step
> >> > > > > > > > >
> >> > > > > > > > > Unfortunately I don't have the possibility to export
> them
> >> > since
> >> > > > we
> >> > > > > > > store
> >> > > > > > > > > them in CloudWatch Metrics which afaik doesn't offer raw
> >> > > exports.
> >> > > > > > > > >
> >> > > > > > > > > Best regards,
> >> > > > > > > > > Marco
> >> > > > > > > > >
> >> > > > > > > > > Carin Meier <[email protected]> schrieb am Mi., 14.
> >> Aug.
> >> > > > 2019,
> >> > > > > > > 19:43:
> >> > > > > > > > >
> >> > > > > > > > > > I would prefer to keep the language binding in the PR
> >> > > process.
> >> > > > > > > Perhaps
> >> > > > > > > > we
> >> > > > > > > > > > could do some analytics to see how much each of the
> >> > language
> >> > > > > > bindings
> >> > > > > > > > is
> >> > > > > > > > > > contributing to overall run time.
> >> > > > > > > > > > If we have some metrics on that, maybe we can come up
> >> with
> >> > a
> >> > > > > > > guideline
> >> > > > > > > > of
> >> > > > > > > > > > how much time each should take. Another possibility is
> >> > > leverage
> >> > > > > the
> >> > > > > > > > > > parallel builds more.
> >> > > > > > > > > >
> >> > > > > > > > > > On Wed, Aug 14, 2019 at 1:30 PM Pedro Larroy <
> >> > > > > > > > > [email protected]
> >> > > > > > > > > > >
> >> > > > > > > > > > wrote:
> >> > > > > > > > > >
> >> > > > > > > > > > > Hi Carin.
> >> > > > > > > > > > >
> >> > > > > > > > > > > That's a good point, all things considered would
> your
> >> > > > > preference
> >> > > > > > be
> >> > > > > > > > to
> >> > > > > > > > > > keep
> >> > > > > > > > > > > the Clojure tests as part of the PR process or in
> >> > Nightly?
> >> > > > > > > > > > > Some options are having notifications here or in
> slack.
> >> > But
> >> > > > if
> >> > > > > we
> >> > > > > > > > think
> >> > > > > > > > > > > breakages would go unnoticed maybe is not a good
> idea
> >> to
> >> > > > fully
> >> > > > > > > remove
> >> > > > > > > > > > > bindings from the PR process and just streamline the
> >> > > process.
> >> > > > > > > > > > >
> >> > > > > > > > > > > Pedro.
> >> > > > > > > > > > >
> >> > > > > > > > > > > On Wed, Aug 14, 2019 at 5:09 AM Carin Meier <
> >> > > > > > [email protected]>
> >> > > > > > > > > > wrote:
> >> > > > > > > > > > >
> >> > > > > > > > > > > > Before any binding tests are moved to nightly, I
> >> think
> >> > we
> >> > > > > need
> >> > > > > > to
> >> > > > > > > > > > figure
> >> > > > > > > > > > > > out how the community can get proper
> notifications of
> >> > > > failure
> >> > > > > > and
> >> > > > > > > > > > success
> >> > > > > > > > > > > > on those nightly runs. Otherwise, I think that
> >> > breakages
> >> > > > > would
> >> > > > > > go
> >> > > > > > > > > > > > unnoticed.
> >> > > > > > > > > > > >
> >> > > > > > > > > > > > -Carin
> >> > > > > > > > > > > >
> >> > > > > > > > > > > > On Tue, Aug 13, 2019 at 7:47 PM Pedro Larroy <
> >> > > > > > > > > > > [email protected]
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > > wrote:
> >> > > > > > > > > > > >
> >> > > > > > > > > > > > > Hi
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > > > Seems we are hitting some problems in CI. I
> propose
> >> > the
> >> > > > > > > following
> >> > > > > > > > > > > action
> >> > > > > > > > > > > > > items to remedy the situation and accelerate
> turn
> >> > > around
> >> > > > > > times
> >> > > > > > > in
> >> > > > > > > > > CI,
> >> > > > > > > > > > > > > reduce cost, complexity and probability of
> failure
> >> > > > blocking
> >> > > > > > PRs
> >> > > > > > > > and
> >> > > > > > > > > > > > > frustrating developers:
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > > > * Upgrade Windows visual studio from VS 2015 to
> VS
> >> > > 2017.
> >> > > > > The
> >> > > > > > > > > > > > > build_windows.py infrastructure should easily
> work
> >> > with
> >> > > > the
> >> > > > > > new
> >> > > > > > > > > > > version.
> >> > > > > > > > > > > > > Currently some PRs are blocked by this:
> >> > > > > > > > > > > > >
> >> > https://github.com/apache/incubator-mxnet/issues/13958
> >> > > > > > > > > > > > > * Move Gluon Model zoo tests to nightly.
> Tracked at
> >> > > > > > > > > > > > >
> >> > https://github.com/apache/incubator-mxnet/issues/15295
> >> > > > > > > > > > > > > * Move non-python bindings tests to nightly. If
> a
> >> > > commit
> >> > > > is
> >> > > > > > > > > touching
> >> > > > > > > > > > > > other
> >> > > > > > > > > > > > > bindings, the reviewer should ask for a full run
> >> > which
> >> > > > can
> >> > > > > be
> >> > > > > > > > done
> >> > > > > > > > > > > > locally,
> >> > > > > > > > > > > > > use the label bot to trigger a full CI build, or
> >> > defer
> >> > > to
> >> > > > > > > > nightly.
> >> > > > > > > > > > > > > * Provide a couple of basic sanity performance
> >> tests
> >> > on
> >> > > > > small
> >> > > > > > > > > models
> >> > > > > > > > > > > that
> >> > > > > > > > > > > > > are run on CI and can be echoed by the label bot
> >> as a
> >> > > > > comment
> >> > > > > > > for
> >> > > > > > > > > > PRs.
> >> > > > > > > > > > > > > * Address unit tests that take more than 10-20s,
> >> > > > streamline
> >> > > > > > > them
> >> > > > > > > > or
> >> > > > > > > > > > > move
> >> > > > > > > > > > > > > them to nightly if it can't be done.
> >> > > > > > > > > > > > > * Open sourcing the remaining CI infrastructure
> >> > scripts
> >> > > > so
> >> > > > > > the
> >> > > > > > > > > > > community
> >> > > > > > > > > > > > > can contribute.
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > > > I think our goal should be turnaround under
> 30min.
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > > > I would also like to touch base with the
> community
> >> > that
> >> > > > > some
> >> > > > > > > PRs
> >> > > > > > > > > are
> >> > > > > > > > > > > not
> >> > > > > > > > > > > > > being followed up by committers asking for
> changes.
> >> > For
> >> > > > > > example
> >> > > > > > > > > this
> >> > > > > > > > > > PR
> >> > > > > > > > > > > > is
> >> > > > > > > > > > > > > importtant and is hanging for a long time.
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > > >
> >> https://github.com/apache/incubator-mxnet/pull/15051
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > > > This is another, less important but more
> trivial to
> >> > > > review:
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > > >
> >> https://github.com/apache/incubator-mxnet/pull/14940
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > > > I think comitters requesting changes and not
> >> > folllowing
> >> > > > up
> >> > > > > in
> >> > > > > > > > > > > reasonable
> >> > > > > > > > > > > > > time is not healthy for the project. I suggest
> >> > > > configuring
> >> > > > > > > github
> >> > > > > > > > > > > > > Notifications for a good SNR and following up.
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > > > Regards.
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > > > Pedro.
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > >
> >> > > > > > > > > > >
> >> > > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > >
> >> > > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
>

Reply via email to