Hi Joel, Had missed this email .. Some issue with my gmail setting. The reason CollapsignQParserPlugin is more performant than regular grouping is because
1. QParser refers to global ords for group.field and avoids storing strings in a set. This has two advantage. a) Terms of memory (storing millions of ints vs strings, results in major savings). b) No binary search / look up is necessary when segment changes. Resulting in huge computation savings. 2. The cost CollapsingFieldValue has to maintain score/field value for each unique ord. Memory requirement = number of ords * size of 1 field value. The basic types byte, int, float , long etc will consume reasonable memory. String/Text value can be stored as ords and will consume only 4 bytes. The memory requirement is because arrays are dense and it is per request. Taking an example : Index Size = 100 million documents Unique ords = 10 million Sort field = 4 ( 1 int field + 1 long field + 2 string/text field) Memory requirement = 40 MB for int field + 80 MB for long field + 80 MB for string ords = 200 MB I agree 200 MB per request just for collapsing the search results is huge but at least it increases linearly with number of sort fields.. For my use case, I am willing to pay the linear cost specially when I can't combine the sort fields intelligently into a sort function. Plus it allows me to sort by String/Text fields also which is a big win. PS : 1. We can store long/string fields also as byte/short ords ..For sort fields, where number of unique values are smaller ( example sort by date , sales rank etc), this will result into significant memory savings. On 19 June 2014 19:40, Joel Bernstein <joels...@gmail.com> wrote: > Umesh, this is a good summary. > > So, the question is what is the cost (performance and memory) of having the > CollapsingQParserPlugin choose the group head by using the Solr sort > criteria? > > Keep in mind that the CollapsingQParserPlugin's main design goal is to > provide fast performance when collapsing on a high cardinality field. How > you choose the group head can have a big impact here, both on memory > consumption performance. > > The function query collapse criteria was added to allow you to come up with > custom formulas for selecting the group head, with little or no impact on > performance and memory. Using Solr's recip() function query it seems like > you could come up with some nice scenarios where two variables could be > used to select the group head. For example: > > fq={!collapse field=a max='sub(prod(cscore(),1000), recip(field(x),1, 1000, > 1000))'} > > This seems like it would basically give you two sort critea: cscore(), > which returns the score, would be the primary criteria. The recip of field > "x" would be the secondary criteria. > > > > > > > > > > > > > > Joel Bernstein > Search Engineer at Heliosearch > > > On Thu, Jun 19, 2014 at 2:18 AM, Umesh Prasad <umesh.i...@gmail.com> > wrote: > > > Continuing the discussion on mailing list from Jira. > > > > An Example > > > > > > *id group f1 f2*1 g1 > > 5 10 > > 2 g1 5 1000 > > 3 g1 5 1000 > > 4 g1 10 100 > > 5 g2 5 10 > > 6 g2 5 1000 > > 7 g2 5 1000 > > 8 g2 10 100 > > > > sort= f1 asc, f2 desc , id desc > > > > > > *Without collapse will give : * > > (7,g2), (6,g2), (3,g1), (2,g1), (5,g2), (1,g1), (8,g2), (4,g1) > > > > > > *On collapsing by group_s expected output is : * (7,g2), (3,g1) > > > > solr standard collapsing does give this output with > > group=on,group.field=group_s,group.main=true > > > > * Collapsing with CollapsingQParserPlugin* fq={!collapse field=group_s} : > > (5,g2), (1,g1) > > > > > > > > * Summarizing Jira Discussion :* > > 1. CollapsingQParserPlugin picks up the group heads from matching results > > and passes those further. So in essence filtering some of the matching > > documents, so that subsequent collectors never see them. It can also pass > > on score to subsequent collectors using a dummy scorer. > > > > 2. TopDocCollector comes later in hierarchy and it will sort on the > > collapsed set. That works fine. > > > > The issue is with step 1. Collapsing is done by a single comparator which > > can take its value from a field or function. It defaults to score. > > Function queries do allow us to combine multiple fields / value sources, > > however it would be difficult to construct a function for given sort > > fields. Primarily because > > a) The range of values for a given sort field is not known in > advance. > > It is possible for one sort field to unbounded, but other to be bounded > > within a small range. > > b) The sort field can itself hold custom logic. > > > > Because of (a) the group head selected by CollapsingQParserPlugin will be > > incorrect and subsequent sorting will break. > > > > > > > > On 14 June 2014 12:38, Umesh Prasad <umesh.i...@gmail.com> wrote: > > > >> Thanks Joel for the quick response. I have opened a new jira ticket. > >> > >> https://issues.apache.org/jira/browse/SOLR-6168 > >> > >> > >> > >> > >> On 13 June 2014 17:45, Joel Bernstein <joels...@gmail.com> wrote: > >> > >>> Let's open a new ticket. > >>> > >>> Joel Bernstein > >>> Search Engineer at Heliosearch > >>> > >>> > >>> On Fri, Jun 13, 2014 at 8:08 AM, Umesh Prasad <umesh.i...@gmail.com> > >>> wrote: > >>> > >>> > The patch in SOLR-5408 fixes the issue with sorting only for two sort > >>> > fields. Sorting still breaks when 3 or more sort fields are used. > >>> > > >>> > I have attached a test case, which demonstrates the broken behavior > >>> when 3 > >>> > sort fields are used. > >>> > > >>> > The failing test case patch is against Lucene/Solr 4.7 revision > number > >>> > 1602388 > >>> > > >>> > Can someone apply and verify the bug ? > >>> > > >>> > Also, should I re-open SOLR-5408 or open a new ticket ? > >>> > > >>> > > >>> > --- > >>> > Thanks & Regards > >>> > Umesh Prasad > >>> > > >>> > >> > >> > >> > >> -- > >> --- > >> Thanks & Regards > >> Umesh Prasad > >> > > > > > > > > -- > > --- > > Thanks & Regards > > Umesh Prasad > > > -- --- Thanks & Regards Umesh Prasad