Issue with marking replicas down at startup

2023-10-05 Thread Vincent Primault
Hello,

I have been looking at a previous investigation we had about an unexpected
behaviour where a node was taking traffic for a replica that was not ready
to take it. It seems to happen when the node is marked as live and the
replica is marked as active, while the corresponding core was not loaded
yet on the node.

I looked at the code and in theory it should not happen, since the
following happens in ZkController#init: mark node as down, wait for
replicas to be marked as down, and then register the node as live. However,
after looking at the code of publishAndWaitForDownStates, I observed that
we wait for down states for replicas associated with cores as returned by
CoreContainer#getCoreDescriptors... which is empty at this point since
ZkController#init is called before cores are discovered (which happens
later in CoreContainer#load).

It hence seems to me that we basically never wait for any replicas to be
marked as down, and continue the startup sequence by marking the node as
live, and hence *might* take traffic for a short period of time for a
replica that is not ready (e.g., if the node previously crashed and the
replica stayed active).

As I am new to investigating this kind of stuff in Solr Cloud, I want to
share my findings and get feedback about whether it was possibly correct
(in which case I'd be happy to contribute a bug fix), or whether I was
missing something else.

Thank you,

Vincent Primault.


Re: 8.11.3 release

2023-10-05 Thread Pierre Salagnac
I just opened a pull request.
https://github.com/apache/lucene-solr/pull/2679
Details are in the PR.

Thanks !

Le lun. 2 oct. 2023 à 20:25, Ishan Chattopadhyaya 
a écrit :

> Thanks Pierre, I'll have it included in 8.11 once you are able to have a PR
> for this. Thanks!
>
> On Mon, 2 Oct 2023 at 22:22, Pierre Salagnac 
> wrote:
>
> > Hi Ishan,
> > Sorry for the late chime in.
> >
> > Some time ago I filled a Jira for a Solr 8 specific bug:
> > https://issues.apache.org/jira/browse/SOLR-16843
> >
> > At that time, I wasn't expecting more 8.x releases, so I did not open a
> PR
> > for it.
> > I can work on a fix if we have a few days more before the release. I
> think
> > it is worth to have it in solr 8 (that's not a backport)
> >
> > Thanks
> >
> >
> > Le ven. 29 sept. 2023 à 23:24, Ishan Chattopadhyaya <
> > ichattopadhy...@gmail.com> a écrit :
> >
> > > I'm going to track items from 9x releases in the following sheet.
> > >
> > >
> >
> https://docs.google.com/spreadsheets/d/1FADkjDS0yfXpaUi3fbwsa7Izy5HOxE9AR_NKiss1sIo/edit#gid=0
> > >
> > > Please let me know if there's any item there that you think would be
> > useful
> > > to backport (if easy) to 8.11, and want me to take a look at
> backporting
> > > it.
> > > Regards,
> > > Ishan
> > >
> > > On Tue, 19 Sept 2023 at 01:15, Ishan Chattopadhyaya <
> > > ichattopadhy...@gmail.com> wrote:
> > >
> > > > Just a reminder to backport any issues one sees fit for a 8.11.3
> > release.
> > > > I'll try to get an RC out by the end of September, so still more
> than a
> > > > week away.
> > > >
> > > > On Wed, 23 Aug 2023 at 17:09, Ishan Chattopadhyaya <
> > > > ichattopadhy...@gmail.com> wrote:
> > > >
> > > >> Hi Jan,
> > > >> Yes, still targeting September. But I will slip on my initial plan
> of
> > > >> doing it by first week of September. I'm foreseeing mid September
> > > timeframe.
> > > >> Thanks for checking in.
> > > >> Regards,
> > > >> Ishan
> > > >>
> > > >> On Wed, 23 Aug, 2023, 5:05 pm Jan Høydahl, 
> > > wrote:
> > > >>
> > > >>> Hi,
> > > >>>
> > > >>> Following up on Ishan's proposed 8.11.3 release (
> > > >>> https://lists.apache.org/thread/3xjtv1sxqx8f9nvhkc0cb90b2p76nfx2)
> > > >>>
> > > >>> Does the Lucene project have any bugfix candidates for backporting?
> > > >>>
> > > >>> Ishan, are you still targeting September?
> > > >>>
> > > >>> Jan
> > > >>>
> > > >>>
> > > >>> > 1. aug. 2023 kl. 14:57 skrev Ishan Chattopadhyaya <
> > > >>> ichattopadhy...@gmail.com>:
> > > >>> >
> > > >>> > Oh yes, good idea. Forgot about the split!
> > > >>> >
> > > >>> > +Lucene Dev 
> > > >>> >
> > > >>> > On Tue, 1 Aug, 2023, 6:17 pm Uwe Schindler, 
> > wrote:
> > > >>> >
> > > >>> >> Maybe ask on Lucene list, too, if there are some bug people like
> > to
> > > >>> have
> > > >>> >> fixed in Lucene.
> > > >>> >>
> > > >>> >> Uwe
> > > >>> >>
> > > >>> >> Am 01.08.2023 um 11:10 schrieb Ishan Chattopadhyaya:
> > > >>> >>> Hi all,
> > > >>> >>> There have been lots of bug fixes that have gone into 9x that
> > > should
> > > >>> >>> benefit all 8x users as well.
> > > >>> >>> I thought of volunteering for such a 8.x release based on this
> > > >>> comment
> > > >>> >> [0].
> > > >>> >>>
> > > >>> >>> Unless someone has any objections or concerns, can we
> tentatively
> > > >>> plan
> > > >>> >> 1st
> > > >>> >>> September 2023 (1 month from now) as a tentative release for
> > > 8.11.3?
> > > >>> I
> > > >>> >>> think we will get ample time to backport relevant fixes to 8x
> by
> > > >>> then.
> > > >>> >>>
> > > >>> >>> Best regards,
> > > >>> >>> Ishan
> > > >>> >>>
> > > >>> >>> [0] -
> > > >>> >>>
> > > >>> >>
> > > >>>
> > >
> >
> https://issues.apache.org/jira/browse/SOLR-16777?focusedCommentId=17742854&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17742854
> > > >>> >>>
> > > >>> >> --
> > > >>> >> Uwe Schindler
> > > >>> >> Achterdiek 19, D-28357 Bremen
> > > >>> >> https://www.thetaphi.de
> > > >>> >> eMail: u...@thetaphi.de
> > > >>> >>
> > > >>> >>
> > > >>> >>
> > > -
> > > >>> >> To unsubscribe, e-mail: dev-unsubscr...@solr.apache.org
> > > >>> >> For additional commands, e-mail: dev-h...@solr.apache.org
> > > >>> >>
> > > >>> >>
> > > >>>
> > > >>>
> > > >>>
> -
> > > >>> To unsubscribe, e-mail: dev-unsubscr...@solr.apache.org
> > > >>> For additional commands, e-mail: dev-h...@solr.apache.org
> > > >>>
> > > >>>
> > >
> >
>


[VOTE] Release Solr 9.4.0 RC1

2023-10-05 Thread Alex Deparvu
Please vote for release candidate 1 for Solr 9.4.0

The artifacts can be downloaded from:
https://dist.apache.org/repos/dist/dev/solr/solr-9.4.0-RC1-rev-ee474b7db483c2242ce1d75074258236ca22103b

You can run the smoke tester directly with this command:

python3 -u dev-tools/scripts/smokeTestRelease.py \
https://dist.apache.org/repos/dist/dev/solr/solr-9.4.0-RC1-rev-ee474b7db483c2242ce1d75074258236ca22103b

You can build a release-candidate of the official docker images (full &
slim) using the following command:

SOLR_DOWNLOAD_SERVER=
https://dist.apache.org/repos/dist/dev/solr/solr-9.4.0-RC1-rev-ee474b7db483c2242ce1d75074258236ca22103b/solr
&& \
  docker build $SOLR_DOWNLOAD_SERVER/9.4.0/docker/Dockerfile.official-full \
--build-arg SOLR_DOWNLOAD_SERVER=$SOLR_DOWNLOAD_SERVER \
-t solr-rc:9.4.0-1 && \
  docker build $SOLR_DOWNLOAD_SERVER/9.4.0/docker/Dockerfile.official-slim \
--build-arg SOLR_DOWNLOAD_SERVER=$SOLR_DOWNLOAD_SERVER \
-t solr-rc:9.4.0-1-slim

The vote will be open for at least 72 hours i.e. until 2023-10-08 20:00 UTC.

[ ] +1  approve
[ ] +0  no opinion
[ ] -1  disapprove (and reason why)

Here is my +1 (which I am not completely sure, but I think it is
non-binding)

best,
alex


Crave is doing well

2023-10-05 Thread David Smiley
I believe the Crave issues with branch merging seem to have been fixed.  If
someone sees otherwise, please let me know.

And boy Crave is fast!  The whole GHA action takes 8m but Crave side is 6m
of which 4m of it is tests running.  It's faster than "precommit" will is
still running in a standard GHA.  Isn't that crazy!  Yes, there's room for
improvement.

There are opportunities for Crave to come up with a GHA self hosted runner
to substantially eat away at that 2m, like a needless checkout of all the
code on the GHA side that basically isn't used.

There are opportunities for our project to try to optimize the Gradle build
so that it can start running tests (or whatever task) as soon as possible
no matter where it runs.  There's a whole section to the Gradle docs on
build optimization.  Maybe someone would like to explore that, like trying
the "configuration cache"
https://docs.gradle.org/current/userguide/configuration_cache.html

I have access to build analytics in Crave that give some insights:  The
first 48 seconds is not very concurrent and not downloading anything.  The
next 36 seconds it downloads 100MB of something (don't know what).  Then
CPUs go full tilt with tests.  It's very apparent that Gradle testing has
no "work stealing" algorithm amongst the runners.

[image: image.png]

I'm a bit perplexed at the downloading of 100MB because the image for the
build machine has commands I added to pre-download stuff.  That looks like
the following:


# Pre-download what we can through Gradle
./gradlew --write-verification-metadata sha256 --dry-run
rm gradle/verification-metadata.dryrun.xml
./gradlew -p solr/solr-ref-guide downloadAntora
./gradlew -p solr/packaging downloadBats
# May need more memory
sed -i 's/-Xmx1g/-Xmx2g/g' gradle.properties
# Use lots of CPUs
sed -i 's/org.gradle.workers.max=.*/org.gradle.workers.max=96/' gradle.
properties
sed -i 's/tests.jvms=.*/tests.jvms=96/' gradle.properties

./gradlew assemble || true


~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley