Issue with marking replicas down at startup
Hello, I have been looking at a previous investigation we had about an unexpected behaviour where a node was taking traffic for a replica that was not ready to take it. It seems to happen when the node is marked as live and the replica is marked as active, while the corresponding core was not loaded yet on the node. I looked at the code and in theory it should not happen, since the following happens in ZkController#init: mark node as down, wait for replicas to be marked as down, and then register the node as live. However, after looking at the code of publishAndWaitForDownStates, I observed that we wait for down states for replicas associated with cores as returned by CoreContainer#getCoreDescriptors... which is empty at this point since ZkController#init is called before cores are discovered (which happens later in CoreContainer#load). It hence seems to me that we basically never wait for any replicas to be marked as down, and continue the startup sequence by marking the node as live, and hence *might* take traffic for a short period of time for a replica that is not ready (e.g., if the node previously crashed and the replica stayed active). As I am new to investigating this kind of stuff in Solr Cloud, I want to share my findings and get feedback about whether it was possibly correct (in which case I'd be happy to contribute a bug fix), or whether I was missing something else. Thank you, Vincent Primault.
Re: 8.11.3 release
I just opened a pull request. https://github.com/apache/lucene-solr/pull/2679 Details are in the PR. Thanks ! Le lun. 2 oct. 2023 à 20:25, Ishan Chattopadhyaya a écrit : > Thanks Pierre, I'll have it included in 8.11 once you are able to have a PR > for this. Thanks! > > On Mon, 2 Oct 2023 at 22:22, Pierre Salagnac > wrote: > > > Hi Ishan, > > Sorry for the late chime in. > > > > Some time ago I filled a Jira for a Solr 8 specific bug: > > https://issues.apache.org/jira/browse/SOLR-16843 > > > > At that time, I wasn't expecting more 8.x releases, so I did not open a > PR > > for it. > > I can work on a fix if we have a few days more before the release. I > think > > it is worth to have it in solr 8 (that's not a backport) > > > > Thanks > > > > > > Le ven. 29 sept. 2023 à 23:24, Ishan Chattopadhyaya < > > ichattopadhy...@gmail.com> a écrit : > > > > > I'm going to track items from 9x releases in the following sheet. > > > > > > > > > https://docs.google.com/spreadsheets/d/1FADkjDS0yfXpaUi3fbwsa7Izy5HOxE9AR_NKiss1sIo/edit#gid=0 > > > > > > Please let me know if there's any item there that you think would be > > useful > > > to backport (if easy) to 8.11, and want me to take a look at > backporting > > > it. > > > Regards, > > > Ishan > > > > > > On Tue, 19 Sept 2023 at 01:15, Ishan Chattopadhyaya < > > > ichattopadhy...@gmail.com> wrote: > > > > > > > Just a reminder to backport any issues one sees fit for a 8.11.3 > > release. > > > > I'll try to get an RC out by the end of September, so still more > than a > > > > week away. > > > > > > > > On Wed, 23 Aug 2023 at 17:09, Ishan Chattopadhyaya < > > > > ichattopadhy...@gmail.com> wrote: > > > > > > > >> Hi Jan, > > > >> Yes, still targeting September. But I will slip on my initial plan > of > > > >> doing it by first week of September. I'm foreseeing mid September > > > timeframe. > > > >> Thanks for checking in. > > > >> Regards, > > > >> Ishan > > > >> > > > >> On Wed, 23 Aug, 2023, 5:05 pm Jan Høydahl, > > > wrote: > > > >> > > > >>> Hi, > > > >>> > > > >>> Following up on Ishan's proposed 8.11.3 release ( > > > >>> https://lists.apache.org/thread/3xjtv1sxqx8f9nvhkc0cb90b2p76nfx2) > > > >>> > > > >>> Does the Lucene project have any bugfix candidates for backporting? > > > >>> > > > >>> Ishan, are you still targeting September? > > > >>> > > > >>> Jan > > > >>> > > > >>> > > > >>> > 1. aug. 2023 kl. 14:57 skrev Ishan Chattopadhyaya < > > > >>> ichattopadhy...@gmail.com>: > > > >>> > > > > >>> > Oh yes, good idea. Forgot about the split! > > > >>> > > > > >>> > +Lucene Dev > > > >>> > > > > >>> > On Tue, 1 Aug, 2023, 6:17 pm Uwe Schindler, > > wrote: > > > >>> > > > > >>> >> Maybe ask on Lucene list, too, if there are some bug people like > > to > > > >>> have > > > >>> >> fixed in Lucene. > > > >>> >> > > > >>> >> Uwe > > > >>> >> > > > >>> >> Am 01.08.2023 um 11:10 schrieb Ishan Chattopadhyaya: > > > >>> >>> Hi all, > > > >>> >>> There have been lots of bug fixes that have gone into 9x that > > > should > > > >>> >>> benefit all 8x users as well. > > > >>> >>> I thought of volunteering for such a 8.x release based on this > > > >>> comment > > > >>> >> [0]. > > > >>> >>> > > > >>> >>> Unless someone has any objections or concerns, can we > tentatively > > > >>> plan > > > >>> >> 1st > > > >>> >>> September 2023 (1 month from now) as a tentative release for > > > 8.11.3? > > > >>> I > > > >>> >>> think we will get ample time to backport relevant fixes to 8x > by > > > >>> then. > > > >>> >>> > > > >>> >>> Best regards, > > > >>> >>> Ishan > > > >>> >>> > > > >>> >>> [0] - > > > >>> >>> > > > >>> >> > > > >>> > > > > > > https://issues.apache.org/jira/browse/SOLR-16777?focusedCommentId=17742854&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17742854 > > > >>> >>> > > > >>> >> -- > > > >>> >> Uwe Schindler > > > >>> >> Achterdiek 19, D-28357 Bremen > > > >>> >> https://www.thetaphi.de > > > >>> >> eMail: u...@thetaphi.de > > > >>> >> > > > >>> >> > > > >>> >> > > > - > > > >>> >> To unsubscribe, e-mail: dev-unsubscr...@solr.apache.org > > > >>> >> For additional commands, e-mail: dev-h...@solr.apache.org > > > >>> >> > > > >>> >> > > > >>> > > > >>> > > > >>> > - > > > >>> To unsubscribe, e-mail: dev-unsubscr...@solr.apache.org > > > >>> For additional commands, e-mail: dev-h...@solr.apache.org > > > >>> > > > >>> > > > > > >
[VOTE] Release Solr 9.4.0 RC1
Please vote for release candidate 1 for Solr 9.4.0 The artifacts can be downloaded from: https://dist.apache.org/repos/dist/dev/solr/solr-9.4.0-RC1-rev-ee474b7db483c2242ce1d75074258236ca22103b You can run the smoke tester directly with this command: python3 -u dev-tools/scripts/smokeTestRelease.py \ https://dist.apache.org/repos/dist/dev/solr/solr-9.4.0-RC1-rev-ee474b7db483c2242ce1d75074258236ca22103b You can build a release-candidate of the official docker images (full & slim) using the following command: SOLR_DOWNLOAD_SERVER= https://dist.apache.org/repos/dist/dev/solr/solr-9.4.0-RC1-rev-ee474b7db483c2242ce1d75074258236ca22103b/solr && \ docker build $SOLR_DOWNLOAD_SERVER/9.4.0/docker/Dockerfile.official-full \ --build-arg SOLR_DOWNLOAD_SERVER=$SOLR_DOWNLOAD_SERVER \ -t solr-rc:9.4.0-1 && \ docker build $SOLR_DOWNLOAD_SERVER/9.4.0/docker/Dockerfile.official-slim \ --build-arg SOLR_DOWNLOAD_SERVER=$SOLR_DOWNLOAD_SERVER \ -t solr-rc:9.4.0-1-slim The vote will be open for at least 72 hours i.e. until 2023-10-08 20:00 UTC. [ ] +1 approve [ ] +0 no opinion [ ] -1 disapprove (and reason why) Here is my +1 (which I am not completely sure, but I think it is non-binding) best, alex
Crave is doing well
I believe the Crave issues with branch merging seem to have been fixed. If someone sees otherwise, please let me know. And boy Crave is fast! The whole GHA action takes 8m but Crave side is 6m of which 4m of it is tests running. It's faster than "precommit" will is still running in a standard GHA. Isn't that crazy! Yes, there's room for improvement. There are opportunities for Crave to come up with a GHA self hosted runner to substantially eat away at that 2m, like a needless checkout of all the code on the GHA side that basically isn't used. There are opportunities for our project to try to optimize the Gradle build so that it can start running tests (or whatever task) as soon as possible no matter where it runs. There's a whole section to the Gradle docs on build optimization. Maybe someone would like to explore that, like trying the "configuration cache" https://docs.gradle.org/current/userguide/configuration_cache.html I have access to build analytics in Crave that give some insights: The first 48 seconds is not very concurrent and not downloading anything. The next 36 seconds it downloads 100MB of something (don't know what). Then CPUs go full tilt with tests. It's very apparent that Gradle testing has no "work stealing" algorithm amongst the runners. [image: image.png] I'm a bit perplexed at the downloading of 100MB because the image for the build machine has commands I added to pre-download stuff. That looks like the following: # Pre-download what we can through Gradle ./gradlew --write-verification-metadata sha256 --dry-run rm gradle/verification-metadata.dryrun.xml ./gradlew -p solr/solr-ref-guide downloadAntora ./gradlew -p solr/packaging downloadBats # May need more memory sed -i 's/-Xmx1g/-Xmx2g/g' gradle.properties # Use lots of CPUs sed -i 's/org.gradle.workers.max=.*/org.gradle.workers.max=96/' gradle. properties sed -i 's/tests.jvms=.*/tests.jvms=96/' gradle.properties ./gradlew assemble || true ~ David Smiley Apache Lucene/Solr Search Developer http://www.linkedin.com/in/davidwsmiley