Re: Bug in Collapsing QParserPlugin : Sort by 3 or more fields is broken

Umesh Prasad Tue, 24 Jun 2014 00:19:32 -0700

Hi Joel,
   Had missed this email .. Some issue with my gmail setting.

The reason CollapsignQParserPlugin is more performant than regular grouping
is because


1.  QParser refers to global ords for group.field and avoids storing
strings in a set. This has two advantage.
  a) Terms of memory (storing millions of ints vs strings, results in major
savings).
  b)  No binary search / look up is necessary when segment changes.
Resulting in huge computation savings.

2. The cost
    CollapsingFieldValue has to maintain score/field value for each unique
ord.
   Memory requirement = number of ords * size of 1 field value.
   The basic types byte, int, float , long etc will consume reasonable
memory.
    String/Text value can be stored as ords and will consume only 4 bytes.

The memory requirement is because arrays are dense and it is per request.
    Taking an example :
     Index Size = 100 million documents
     Unique ords =  10 million
     Sort field = 4   ( 1 int field + 1 long  field + 2 string/text field)
     Memory  requirement =  40 MB  for  int field  +  80 MB for long field
+ 80 MB for string ords  = 200 MB


I agree 200 MB per request just for collapsing the search results is huge
but at least it increases linearly with number of sort fields.. For my use
case, I am willing to pay the linear cost specially when I can't combine
the sort fields intelligently into a sort function. Plus it allows me to
sort by String/Text fields also which is a big win.

PS :
1. We can store long/string fields also as byte/short ords ..For sort
fields, where number of unique values are smaller ( example sort by date ,
sales rank etc), this will result into significant memory savings.








On 19 June 2014 19:40, Joel Bernstein <joels...@gmail.com> wrote:

> Umesh, this is a good summary.
>
> So, the question is what is the cost (performance and memory) of having the
> CollapsingQParserPlugin choose the group head by using the Solr sort
> criteria?
>
> Keep in mind that the CollapsingQParserPlugin's main design goal is to
> provide fast performance when collapsing on a high cardinality field. How
> you choose the group head can have a big impact here, both on memory
> consumption performance.
>
> The function query collapse criteria was added to allow you to come up with
> custom formulas for selecting the group head, with little or no impact on
> performance and memory. Using Solr's recip() function query it seems like
> you could come up with some nice scenarios where two variables could be
> used to select the group head. For example:
>
> fq={!collapse field=a max='sub(prod(cscore(),1000), recip(field(x),1, 1000,
> 1000))'}
>
> This seems like it would basically give you two sort critea: cscore(),
> which returns the score, would be the primary criteria. The recip of field
> "x" would be the secondary criteria.
>
>
>
>
>
>
>
>
>
>
>
>
>
> Joel Bernstein
> Search Engineer at Heliosearch
>
>
> On Thu, Jun 19, 2014 at 2:18 AM, Umesh Prasad <umesh.i...@gmail.com>
> wrote:
>
> > Continuing the discussion on mailing list from Jira.
> >
> > An Example
> >
> >
> > *id      group           f1              f2*1       g1
> > 5               10
> > 2       g1                 5               1000
> > 3       g1                 5               1000
> > 4       g1                 10              100
> > 5       g2                 5               10
> > 6       g2                 5               1000
> > 7       g2                 5               1000
> > 8       g2                10              100
> >
> > sort= f1 asc, f2 desc , id desc
> >
> >
> > *Without collapse will give : *
> > (7,g2), (6,g2),  (3,g1), (2,g1), (5,g2), (1,g1), (8,g2), (4,g1)
> >
> >
> > *On collapsing by group_s  expected output is : *  (7,g2), (3,g1)
> >
> > solr standard collapsing does give this output  with
> > group=on,group.field=group_s,group.main=true
> >
> > * Collapsing with CollapsingQParserPlugin* fq={!collapse field=group_s} :
> >   (5,g2), (1,g1)
> >
> >
> >
> > * Summarizing Jira Discussion :*
> > 1. CollapsingQParserPlugin picks up the group heads from matching results
> > and passes those further. So in essence filtering some of the matching
> > documents, so that subsequent collectors never see them. It can also pass
> > on score to subsequent collectors using a dummy scorer.
> >
> > 2. TopDocCollector comes later in hierarchy and it will sort on the
> > collapsed set. That works fine.
> >
> > The issue is with step 1. Collapsing is done by a single comparator which
> > can take its value from a field or function. It defaults to score.
> > Function queries do allow us to combine multiple fields / value sources,
> > however it would be difficult to construct a function for given sort
> > fields. Primarily because
> >     a) The range of values for a given sort field is not known in
> advance.
> > It is possible for one sort field to unbounded, but other to be bounded
> > within a small range.
> >     b) The sort field can itself hold custom logic.
> >
> > Because of (a) the group head selected by CollapsingQParserPlugin will be
> > incorrect and subsequent sorting will break.
> >
> >
> >
> > On 14 June 2014 12:38, Umesh Prasad <umesh.i...@gmail.com> wrote:
> >
> >> Thanks Joel for the quick response. I have opened a new jira ticket.
> >>
> >> https://issues.apache.org/jira/browse/SOLR-6168
> >>
> >>
> >>
> >>
> >> On 13 June 2014 17:45, Joel Bernstein <joels...@gmail.com> wrote:
> >>
> >>> Let's open a new ticket.
> >>>
> >>> Joel Bernstein
> >>> Search Engineer at Heliosearch
> >>>
> >>>
> >>> On Fri, Jun 13, 2014 at 8:08 AM, Umesh Prasad <umesh.i...@gmail.com>
> >>> wrote:
> >>>
> >>> > The patch in SOLR-5408 fixes the issue with sorting only for two sort
> >>> > fields. Sorting still breaks when 3 or more sort fields are used.
> >>> >
> >>> > I have attached a test case, which demonstrates the broken behavior
> >>> when 3
> >>> > sort fields are used.
> >>> >
> >>> > The failing test case patch is against Lucene/Solr 4.7 revision
>  number
> >>> > 1602388
> >>> >
> >>> > Can someone apply and verify the bug ?
> >>> >
> >>> > Also, should I re-open SOLR-5408  or open a new ticket ?
> >>> >
> >>> >
> >>> > ---
> >>> > Thanks & Regards
> >>> > Umesh Prasad
> >>> >
> >>>
> >>
> >>
> >>
> >> --
> >> ---
> >> Thanks & Regards
> >> Umesh Prasad
> >>
> >
> >
> >
> > --
> > ---
> > Thanks & Regards
> > Umesh Prasad
> >
>



-- 
---
Thanks & Regards
Umesh Prasad

Re: Bug in Collapsing QParserPlugin : Sort by 3 or more fields is broken

Reply via email to