Re: Issue with marking replicas down at startup

2023-10-06 Thread rajani m
Hi Vincent,

I have seen that behavior,  node gets re provisioned, replica on that node
is back up live and zk starts routing traffic, however the response time
from that replica is really high for a short period.

Worked around it by adding some hundreds of warming queries which puts the
replica into recovery until all the queries are replayed and hence delay
the live state. But yeah not a good solution as this always puts replica
into recovery for minutes which may not be required, if it's what you said
is issue in the replica being live before the core is loaded.

Thank you,
Rajani

On Thu, Oct 5, 2023, 3:26 AM Vincent Primault
 wrote:

> Hello,
>
> I have been looking at a previous investigation we had about an unexpected
> behaviour where a node was taking traffic for a replica that was not ready
> to take it. It seems to happen when the node is marked as live and the
> replica is marked as active, while the corresponding core was not loaded
> yet on the node.
>
> I looked at the code and in theory it should not happen, since the
> following happens in ZkController#init: mark node as down, wait for
> replicas to be marked as down, and then register the node as live. However,
> after looking at the code of publishAndWaitForDownStates, I observed that
> we wait for down states for replicas associated with cores as returned by
> CoreContainer#getCoreDescriptors... which is empty at this point since
> ZkController#init is called before cores are discovered (which happens
> later in CoreContainer#load).
>
> It hence seems to me that we basically never wait for any replicas to be
> marked as down, and continue the startup sequence by marking the node as
> live, and hence *might* take traffic for a short period of time for a
> replica that is not ready (e.g., if the node previously crashed and the
> replica stayed active).
>
> As I am new to investigating this kind of stuff in Solr Cloud, I want to
> share my findings and get feedback about whether it was possibly correct
> (in which case I'd be happy to contribute a bug fix), or whether I was
> missing something else.
>
> Thank you,
>
> Vincent Primault.
>


Re: Issue with marking replicas down at startup

2023-10-26 Thread rajani m
Is this an issue in that case? If so, should we create a jira to address it?

On Sat, Oct 7, 2023 at 8:32 PM Mark Miller  wrote:

> Yeah, it’s not going to fix that updates can come in too early if you just
> delay when the replica publishes active. It’s still going to show up active
> when it’s not. That gets rectified if you end up replicating the index,
> it’s when you peer sync that it can be a persistent problem. And in both
> cases, you can end up with a window of incorrect queries.
>
> The most straightforward way to handle it is to use the cluster state
> rather than the cores from the core container when publishing down (where
> it doesn’t currently use the downnode command) and when waiting to see the
> state.
>


Re: Query Limits - extending timeAllowed to implement thread CPU/memory limits for a query

2024-02-06 Thread rajani m
I have updated the jira with my vote and totally agree that solr lacks this
way to limit the usage and a single query bringing all the nodes in the
cluster down. I'd love to review/contribute to PR .

Thank you for bringing up this issue Andrzej.



On Mon, Feb 5, 2024 at 4:16 PM Andrzej Białecki  wrote:

> Hi all,
>
> I’d like to draw your attention to SOLR-17138. Jira description gives a
> more detailed background of the issue - but in short, Solr lacks a robust
> way to limit the resource usage (be it CPU time or memory consumption or
> other resource) on a per-query basis. CircuitBreakers help to an extend but
> they are a global measure. This work focuses on monitoring and terminating
> individual queries that exceed configured limits.
>
> Reviews / comments / suggestions are highly appreciated!
>
> —
>
> Andrzej Białecki
>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@solr.apache.org
> For additional commands, e-mail: dev-h...@solr.apache.org
>
>


Multiple query fields with block max WAND

2024-02-15 Thread rajani m
Hi Solr Devs,

   Does the Block Max WAND feature support querying multiple fields in a
single query? I tested using dismax query fields "qf" with 2 fields
"qf=title description" and it didn't work but it works perfectly if "qf" is
any one field.  Is this expected?

Thank you,
Rajani


Re: Issue with marking replicas down at startup

2024-04-29 Thread Rajani M
Hi All,
Saw this SOLR-17200 <https://issues.apache.org/jira/browse/SOLR-17200> fixed
in 9.6. which seems to be similar to SOLR-17049
<https://issues.apache.org/jira/browse/SOLR-17049>. Could you please take a
look and let me know your thoughts?

Thank you,
Rajani

On Thu, Oct 26, 2023 at 9:43 AM Vincent Primault
 wrote:

> Hello, I created a JIRA to track this:
> https://issues.apache.org/jira/browse/SOLR-17049
>
> On Thu, Oct 26, 2023 at 3:30 PM rajani m  wrote:
>
> > Is this an issue in that case? If so, should we create a jira to address
> > it?
> >
> > On Sat, Oct 7, 2023 at 8:32 PM Mark Miller 
> wrote:
> >
> > > Yeah, it’s not going to fix that updates can come in too early if you
> > just
> > > delay when the replica publishes active. It’s still going to show up
> > active
> > > when it’s not. That gets rectified if you end up replicating the index,
> > > it’s when you peer sync that it can be a persistent problem. And in
> both
> > > cases, you can end up with a window of incorrect queries.
> > >
> > > The most straightforward way to handle it is to use the cluster state
> > > rather than the cores from the core container when publishing down
> (where
> > > it doesn’t currently use the downnode command) and when waiting to see
> > the
> > > state.
> > >
> >
>


Re: Issue with marking replicas down at startup

2024-04-29 Thread Rajani M
Makes sense, thank you, Vincent.

On Mon, Apr 29, 2024 at 9:45 AM Vincent Primault
 wrote:

> Hello,
>
> The cause is similar but SOLR-17200 being fixed does not mean that
> SOLR-17049 is. The latter might be a bit trickier to fix.
>
> Vincent
>
> On Mon, Apr 29, 2024 at 3:41 PM Rajani M  wrote:
>
> > Hi All,
> > Saw this SOLR-17200 <https://issues.apache.org/jira/browse/SOLR-17200>
> > fixed
> > in 9.6. which seems to be similar to SOLR-17049
> > <https://issues.apache.org/jira/browse/SOLR-17049>. Could you please
> take
> > a
> > look and let me know your thoughts?
> >
> > Thank you,
> > Rajani
> >
> > On Thu, Oct 26, 2023 at 9:43 AM Vincent Primault
> >  wrote:
> >
> > > Hello, I created a JIRA to track this:
> > > https://issues.apache.org/jira/browse/SOLR-17049
> > >
> > > On Thu, Oct 26, 2023 at 3:30 PM rajani m 
> wrote:
> > >
> > > > Is this an issue in that case? If so, should we create a jira to
> > address
> > > > it?
> > > >
> > > > On Sat, Oct 7, 2023 at 8:32 PM Mark Miller 
> > > wrote:
> > > >
> > > > > Yeah, it’s not going to fix that updates can come in too early if
> you
> > > > just
> > > > > delay when the replica publishes active. It’s still going to show
> up
> > > > active
> > > > > when it’s not. That gets rectified if you end up replicating the
> > index,
> > > > > it’s when you peer sync that it can be a persistent problem. And in
> > > both
> > > > > cases, you can end up with a window of incorrect queries.
> > > > >
> > > > > The most straightforward way to handle it is to use the cluster
> state
> > > > > rather than the cores from the core container when publishing down
> > > (where
> > > > > it doesn’t currently use the downnode command) and when waiting to
> > see
> > > > the
> > > > > state.
> > > > >
> > > >
> > >
> >
>


Re: Issue with marking replicas down at startup

2024-05-05 Thread Rajani M
Thanks a ton for this contribution Houston. I tried to work on this myself
but it seemed pretty complicated, I could only spot the issue in the
ZkController but not the rest of the workflow.  I couldn't tell where to
implement the getReplicaNamesPerCollectionOnNode method. Appreciate your
time and effort. This PR was a great learning experience for me and I look
forward to this feature being merged.

This is definitely a major bug fix for large size collections ( > 50
shards) hosted on multiple nodes.  Any node restart was causing at least
2-4 minutes of requests to fail which is not acceptable for any cluster
serving more than a 100s of requests per second.  The work around was not
feasible.

Thank you,
Rajani


On Tue, Apr 30, 2024 at 3:42 PM Houston Putman  wrote:

> I've created a PR to address this:
> https://github.com/apache/solr/pull/2432
>
> Open to other ways of approaching it though.
>
> - Houston
>
> On Tue, Apr 30, 2024 at 4:44 AM Mark Miller  wrote:
>
> > There is a publish node as down and wait method that just waits until
> then
> > down states show up in the cluster state. But waiting won't do any good
> > until down is actually published and it still is not. I'm pretty down has
> > never been published on startup despite appearances. I've seen two
> > ramifications from this. One is that it's much easier for replicas to get
> > out of sync when restarting a cluster while updates are coming in. I say
> > easier, because that's notnair tight regardless, but it does make it much
> > easier to happen. The second is that cores that are not ready can
> > participate in leader elections, receiving updates and queries. And if a
> > core fails to load, it will participate indefinitely. Retries and fault
> > tolerance will plaster over a good chunk of that though, less so for
> leader
> > election when a core fails to load
> >
>