I'm wondering what the expected behavior is for the following scenario...

We receive the same document in multiple formats and we handle this by
grouping, sorting the group by date received, and limiting the group to 1,
resulting in getting the most recent version of a document.

Here is an example, the id field is something like "identifier!date_format"

doc {
 id: doc1!20130618_formatX
 docId: doc1
 dateReceived: 20130620
}

doc {
 id: doc1!20130621_formatY
 docId: doc1
 dateReceived: 20130621
}

doc {
 id: doc2!20130619_formatX
 docId: doc2
 dateReceived: 20130619
}


So in this case we would want to group on docId so all the doc1 docs were
together and all doc2 docs together, sort with in the groups on
dateReceived descending and limit the groups to 1 to get the most recent
doc in the group, then sort the whole result set on dateReceived ascending.

So we expect to get:
doc2!20130619_formatX
doc1!20130621_formatY

In a regular single node Solr instance, running Solr 4.3, everything I
described above works perfectly fine. When running on a sharded
configuration with two nodes, the results are different. It will still do
the grouping, sorting with in group, and limiting as expected, but the
overall sort on dateReceived is not the same.

The results end up being:
doc1!20130621_formatY
doc2!20130619_formatX

It seems like this is because the doc1 group has another document with
dateReceived of 0618 which is somehow being used for the overall sort, and
then the group.sort and group.limit is being applied after this ???

I realize there could be limitations of grouping and sorting in a sharded
setup, but I wanted to know if this is correct behavior, or if there is
something I am doing wrong.

Any help would be appreciated.

Thanks,

Bryan

Reply via email to