I just backported Michael’s fix to be released in 8.5.2

On Fri, May 15, 2020 at 6:38 AM Michael Gibney <mich...@michaelgibney.net>
wrote:

> Hi Wei,
> SOLR-14471 has been merged, so this issue should be fixed in 8.6.
> Thanks for reporting the problem!
> Michael
>
> On Mon, May 11, 2020 at 7:51 PM Wei <weiwan...@gmail.com> wrote:
> >
> > Thanks Michael!  Yes in each shard I have 10 Tlog replicas,  no other
> type
> > of replicas, and each Tlog replica is an individual solr instance on its
> > own physical machine.  In the jira you mentioned 'when "last place
> matches"
> > == "first place matches" – e.g. when shards.preference specified matches
> > *all* available replicas'.   My setting is
> > shards.preference=replica.location:local,replica.type:TLOG,
> > I also tried just shards.preference=replica.location:local and it still
> has
> > the issue. Can you explain a bit more?
> >
> > On Mon, May 11, 2020 at 12:26 PM Michael Gibney <
> mich...@michaelgibney.net>
> > wrote:
> >
> > > FYI: https://issues.apache.org/jira/browse/SOLR-14471
> > > Wei, assuming you have only TLOG replicas, your "last place" matches
> > > (to which the random fallback ordering would not be applied -- see
> > > above issue) would be the same as the "first place" matches selected
> > > for executing distributed requests.
> > >
> > >
> > > On Mon, May 11, 2020 at 1:49 PM Michael Gibney
> > > <mich...@michaelgibney.net> wrote:
> > > >
> > > > Wei, probably no need to answer my earlier questions; I think I see
> > > > the problem here, and believe it is indeed a bug, introduced in 8.3.
> > > > Will file an issue and submit a patch shortly.
> > > > Michael
> > > >
> > > > On Mon, May 11, 2020 at 12:49 PM Michael Gibney
> > > > <mich...@michaelgibney.net> wrote:
> > > > >
> > > > > Hi Wei,
> > > > >
> > > > > In considering this problem, I'm stumbling a bit on terminology
> > > > > (particularly, where you mention "nodes", I think you're referring
> to
> > > > > "replicas"?). Could you confirm that you have 10 TLOG replicas per
> > > > > shard, for each of 6 shards? How many *nodes* (i.e., running solr
> > > > > server instances) do you have, and what is the replica placement
> like
> > > > > across those nodes? What, if any, non-TLOG replicas do you have per
> > > > > shard (not that it's necessarily relevant, but just to get a
> complete
> > > > > picture of the situation)?
> > > > >
> > > > > If you're able without too much trouble, can you determine what the
> > > > > behavior is like on Solr 8.3? (there were different changes
> introduced
> > > > > to potentially relevant code in 8.3 and 8.4, and knowing whether
> the
> > > > > behavior you're observing manifests on 8.3 would help narrow down
> > > > > where to look for an explanation).
> > > > >
> > > > > Michael
> > > > >
> > > > > On Fri, May 8, 2020 at 7:34 PM Wei <weiwan...@gmail.com> wrote:
> > > > > >
> > > > > > Update:  after I remove the shards.preference parameter from
> > > > > > solrconfig.xml,  issue is gone and internal shard requests are
> now
> > > > > > balanced. The same parameter works fine with solr 7.6.  Still not
> > > sure of
> > > > > > the root cause, but I observed a strange coincidence: the nodes
> that
> > > are
> > > > > > most frequently picked for shard requests are the first node in
> each
> > > shard
> > > > > > returned from the CLUSTERSTATUS api.  Seems something wrong with
> > > shuffling
> > > > > > equally compared nodes when shards.preference is set.  Will
> report
> > > back if
> > > > > > I find more.
> > > > > >
> > > > > > On Mon, Apr 27, 2020 at 5:59 PM Wei <weiwan...@gmail.com> wrote:
> > > > > >
> > > > > > > Hi Eric,
> > > > > > >
> > > > > > > I am measuring the number of shard requests, and it's for query
> > > only, no
> > > > > > > indexing requests.  I have an external load balancer and see
> each
> > > node
> > > > > > > received about the equal number of external queries. However
> for
> > > the
> > > > > > > internal shard queries,  the distribution is uneven:    6 nodes
> > > (one in
> > > > > > > each shard,  some of them are leaders and some are non-leaders
> )
> > > gets about
> > > > > > > 80% of the shard requests, the other 54 nodes gets about 20% of
> > > the shard
> > > > > > > requests.   I checked a few other parameters set:
> > > > > > >
> > > > > > > -Dsolr.disable.shardsWhitelist=true
> > > > > > > shards.preference=replica.location:local,replica.type:TLOG
> > > > > > >
> > > > > > > Nothing seems to cause the strange behavior.  Any suggestions
> how
> > > to
> > > > > > > debug this?
> > > > > > >
> > > > > > > -Wei
> > > > > > >
> > > > > > >
> > > > > > > On Mon, Apr 27, 2020 at 5:42 PM Erick Erickson <
> > > erickerick...@gmail.com>
> > > > > > > wrote:
> > > > > > >
> > > > > > >> Wei:
> > > > > > >>
> > > > > > >> How are you measuring utilization here? The number of incoming
> > > requests
> > > > > > >> or CPU?
> > > > > > >>
> > > > > > >> The leader for each shard are certainly handling all of the
> > > indexing
> > > > > > >> requests since they’re TLOG replicas, so that’s one thing that
> > > might
> > > > > > >> skewing your measurements.
> > > > > > >>
> > > > > > >> Best,
> > > > > > >> Erick
> > > > > > >>
> > > > > > >> > On Apr 27, 2020, at 7:13 PM, Wei <weiwan...@gmail.com>
> wrote:
> > > > > > >> >
> > > > > > >> > Hi everyone,
> > > > > > >> >
> > > > > > >> > I have a strange issue after upgrade from 7.6.0 to 8.4.1. My
> > > cloud has 6
> > > > > > >> > shards with 10 TLOG replicas each shard.  After upgrade I
> > > noticed that
> > > > > > >> one
> > > > > > >> > of the replicas in each shard is handling most of the
> > > distributed shard
> > > > > > >> > requests, so 6 nodes are heavily loaded while other nodes
> are
> > > idle.
> > > > > > >> There
> > > > > > >> > is no change in shard handler configuration:
> > > > > > >> >
> > > > > > >> > <shardHandlerFactory name="shardHandlerFactory" class=
> > > > > > >> > "HttpShardHandlerFactory">
> > > > > > >> >
> > > > > > >> >    <int name="socketTimeout">30000</int>
> > > > > > >> >
> > > > > > >> >    <int name="connTimeout">30000</int>
> > > > > > >> >
> > > > > > >> >    <int name="maxConnectionsPerHost">500</int>
> > > > > > >> >
> > > > > > >> > </shardHandlerFactory>
> > > > > > >> >
> > > > > > >> >
> > > > > > >> > What could cause the unbalanced internal distributed
> request?
> > > > > > >> >
> > > > > > >> >
> > > > > > >> > Thanks in advance.
> > > > > > >> >
> > > > > > >> >
> > > > > > >> >
> > > > > > >> > Wei
> > > > > > >>
> > > > > > >>
> > >
>

Reply via email to