Re: Issue with marking replicas down at startup

2024-05-05 Thread Rajani M
Thanks a ton for this contribution Houston. I tried to work on this myself but it seemed pretty complicated, I could only spot the issue in the ZkController but not the rest of the workflow. I couldn't tell where to implement the getReplicaNamesPerCollectionOnNode method. Appreciate your time and

Re: Issue with marking replicas down at startup

2024-04-30 Thread Houston Putman
I've created a PR to address this: https://github.com/apache/solr/pull/2432 Open to other ways of approaching it though. - Houston On Tue, Apr 30, 2024 at 4:44 AM Mark Miller wrote: > There is a publish node as down and wait method that just waits until then > down states show up in the cluste

Re: Issue with marking replicas down at startup

2024-04-30 Thread Mark Miller
There is a publish node as down and wait method that just waits until then down states show up in the cluster state. But waiting won't do any good until down is actually published and it still is not. I'm pretty down has never been published on startup despite appearances. I've seen two ramificatio

Re: Issue with marking replicas down at startup

2024-04-29 Thread Rajani M
Makes sense, thank you, Vincent. On Mon, Apr 29, 2024 at 9:45 AM Vincent Primault wrote: > Hello, > > The cause is similar but SOLR-17200 being fixed does not mean that > SOLR-17049 is. The latter might be a bit trickier to fix. > > Vincent > > On Mon, Apr 29, 2024 at 3:41 PM Rajani M wrote: >

Re: Issue with marking replicas down at startup

2024-04-29 Thread Vincent Primault
Hello, The cause is similar but SOLR-17200 being fixed does not mean that SOLR-17049 is. The latter might be a bit trickier to fix. Vincent On Mon, Apr 29, 2024 at 3:41 PM Rajani M wrote: > Hi All, > Saw this SOLR-17200 > fixed > in 9.6. which

Re: Issue with marking replicas down at startup

2024-04-29 Thread Rajani M
Hi All, Saw this SOLR-17200 fixed in 9.6. which seems to be similar to SOLR-17049 . Could you please take a look and let me know your thoughts? Thank you, Rajani On Thu, Oct 26, 2023 at 9:43 AM Vi

Re: Issue with marking replicas down at startup

2023-10-26 Thread Vincent Primault
Hello, I created a JIRA to track this: https://issues.apache.org/jira/browse/SOLR-17049 On Thu, Oct 26, 2023 at 3:30 PM rajani m wrote: > Is this an issue in that case? If so, should we create a jira to address > it? > > On Sat, Oct 7, 2023 at 8:32 PM Mark Miller wrote: > > > Yeah, it’s not goi

Re: Issue with marking replicas down at startup

2023-10-26 Thread rajani m
Is this an issue in that case? If so, should we create a jira to address it? On Sat, Oct 7, 2023 at 8:32 PM Mark Miller wrote: > Yeah, it’s not going to fix that updates can come in too early if you just > delay when the replica publishes active. It’s still going to show up active > when it’s no

Re: Issue with marking replicas down at startup

2023-10-07 Thread Mark Miller
Yeah, it’s not going to fix that updates can come in too early if you just delay when the replica publishes active. It’s still going to show up active when it’s not. That gets rectified if you end up replicating the index, it’s when you peer sync that it can be a persistent problem. And in both cas

Re: Issue with marking replicas down at startup

2023-10-06 Thread Mark Miller
Yes, you are correct. It doesn’t really work. Depending on the distributed mode you are running in, it may still publish the cores as down, in one of the modes it sends a down node cmd to the Overseer which should do it based on what cores are in the cluster state. In that case it should still publ

Re: Issue with marking replicas down at startup

2023-10-06 Thread rajani m
Hi Vincent, I have seen that behavior, node gets re provisioned, replica on that node is back up live and zk starts routing traffic, however the response time from that replica is really high for a short period. Worked around it by adding some hundreds of warming queries which puts the replica i

Issue with marking replicas down at startup

2023-10-05 Thread Vincent Primault
Hello, I have been looking at a previous investigation we had about an unexpected behaviour where a node was taking traffic for a replica that was not ready to take it. It seems to happen when the node is marked as live and the replica is marked as active, while the corresponding core was not load