Re: field collapsing performance in sharded environment

Paul Masurel Fri, 15 Nov 2013 08:09:29 -0800

That's not the way grouping is done.
On a first round all shards return their 10 best group (represented as
their 10 best grouping values).


As a result it's a three round thing instead of the two round for regular
search, so observing an increasing in latency is normal but not in the
realm of what you are seeing here.

Most probably it is due to the performance issue of TermAllGroupsCollector
which you can patch very easily.


On Thu, Nov 14, 2013 at 3:56 PM, Erick Erickson <erickerick...@gmail.com>wrote:

> bq:   Of the 10k docs,
> most have a unique near duplicate hash value, so there are about 10k unique
> values for the field that I'm grouping on.
>
> I suspect (but don't know the grouping code well) that this is the issue.
> You're
> getting the top N groups, right? But in the general case, you can't insure
> that the
> topN from shard1 has any relation to the topN from shard2. So I _suspect_
> that
> the code returns all of the groups. Say that shard1 for group 5 has 3 docs,
> but
> for shard2 has 3,000 docs. Do get the true top N, you need to collate all
> the values
> from all the groups; you can't just return the top 10 groups from each
> shard and
> get correct counts.
>
> Since your group cardinality is about 10K/shard, you're pushing 10 packets
> each
> containing 10K entries back to the originating shard, which has to
> combine/sort
> them all to get the true top N. At least that's my theory.
>
> Your situation is special in that you say that your groups don't appear on
> more than
> one shard, so you'd probably have to write something that aborted this
> behavior and
> returned only the top N, if I'm right.
>
> But that begs the question of why you're doing this. What purpose is served
> by
> grouping on documents that probably only have 1 member?
>
> Best,
> Erick
>
>
> On Wed, Nov 13, 2013 at 2:46 PM, David Anthony Troiano <
> dtroi...@basistech.com> wrote:
>
> > Hello,
> >
> > I'm hitting a performance issue when using field collapsing in a
> > distributed Solr setup and I'm wondering if others have seen it and if
> > anyone has an idea to work around. it.
> >
> > I'm using field collapsing to deduplicate documents that have the same
> near
> > duplicate hash value, and deduplicating at query time (as opposed to
> > filtering at index time) is a requirement.  I have a sharded setup with
> 10
> > cores (not SolrCloud), each having ~1000 documents each.  Of the 10k
> docs,
> > most have a unique near duplicate hash value, so there are about 10k
> unique
> > values for the field that I'm grouping on.  The grouping parameters that
> > I'm using are:
> >
> > group=true
> > group.field=<near dupe hash field>
> > group.main=true
> >
> > I'm attempting distributed queries (&shards=s1,s2,...,s10) where the only
> > difference is the absence or presence of these three grouping parameters
> > and I'm consistently seeing a marked difference in performance (as a
> > representative data point, 200ms latency without grouping and 1600ms with
> > grouping).  Interestingly, if I put all 10k docs on the same core and
> query
> > that core independently with and without grouping, I don't see much of a
> > latency difference, so the performance degradation seems to exist only in
> > the sharded setup.
> >
> > Is there a known performance issue when field collapsing in a sharded
> setup
> > (perhaps only manifests when the grouping field has many unique values),
> or
> > have other people observed this?  Any ideas for a workaround?  Note that
> > docs in my sharded setup can only have the same signature if they're in
> the
> > same shard, so perhaps that can be used to boost perf, though I don't see
> > an exposed way to do so.
> >
> > A follow-on question is whether we're likely to see the same issue if /
> > when we move to SolrCloud.
> >
> > Thanks,
> > Dave
> >
>



-- 
______________________________________________

 Masurel Paul
 e-mail: paul.masu...@gmail.com

Re: field collapsing performance in sharded environment

Reply via email to