>From my reading of the solr docs (e.g. 
>https://cwiki.apache.org/confluence/display/solr/Collapse+and+Expand+Results 
>and https://cwiki.apache.org/confluence/display/solr/Result+Grouping), I've 
>been under the impression that these two methods (result grouping and 
>collapsing query parser) can both be used to eliminate duplicates from a 
>result set (in our case, we have a duplication field that contains a 
>'signature' that identifies duplicates. We use our own signature for a variety 
>of reasons that are tied to complex business requirements.).

In a test environment I scattered 15 duplicate records (with another 10 unique 
records) across a test system running Solr Cloud (Solr version 5.2.1) that had 
4 shards and a replication factor of 2. I tried both result grouping and the 
collapsing query parser to remove duplicates. The result grouping worked as 
expected...the collapsing query parser did not.

My results in using the collapsing query parser showed that Solr was in fact 
including into the result set one of the duplicate records from each shard 
(that is, I received FOUR duplicate records...and turning on debug showed that 
each of the four records came from a  unique shard)...when I was expecting solr 
to do the collapsing on the aggregated result and return only ONE of the 
duplicated records across ALL shards. It appears that solr is performing the 
collapsing query parsing on each individual shard, but then NOT performing the 
operation on the aggregated results from each shard.

I have searched through the forums and checked the documentation as carefully 
as I can. I find no documentation or mention of this effect (one record being 
returned per shard) when using collapsing query parsing.

Is this a known behavior? Am I just doing something wrong? Am I missing some 
search parameter? Am I simply not understanding correctly how this is supposed 
to work?

For reference, I am including below the search url and the response I received. 
Any insights would be appreciated.

Query: 
http://172.26.250.150:8983/solr/AcaColl/select?q=*%3A*&wt=json&indent=true&rows=1000&fq={!collapse%20field=dupid_s}&debugQuery=true<http://172.26.250.150:8983/solr/AcaColl/select?q=*%3A*&wt=json&indent=true&rows=1000&fq=%7b!collapse%20field=dupid_s%7d&debugQuery=true>

Response (note that dupid_s = 900 is the duplicate value and that I have added 
comments in the output ***<comment>*** pointing out which shard responses came 
from):

{
  "responseHeader":{
    "status":0,
    "QTime":31,
    "params":{
      "debugQuery":"true",
      "indent":"true",
      "q":"*:*",
      "wt":"json",
      "fq":"{!collapse field=dupid_s}",
      "rows":"1000"}},
  "response":{"numFound":14,"start":0,"maxScore":1.0,"docs":[
      {
        "storeid_s":"1002",
        "dupid_s":"900", ***AcaColl_shard2_replica2***
        "title_pqth":["Dupe Record #2"],
        "_version_":1508241005512491008,
        "indexTime_dt":"2015-07-31T19:25:09.914Z"},
      {
        "storeid_s":"8020",
        "dupid_s":"2005",
        "title_pqth":["Unique Record #5"],
        "_version_":1508241005539753984,
        "indexTime_dt":"2015-07-31T19:25:09.94Z"},
      {
        "storeid_s":"8023",
        "dupid_s":"2008",
        "title_pqth":["Unique Record #8"],
        "_version_":1508241005540802560,
        "indexTime_dt":"2015-07-31T19:25:09.94Z"},
      {
        "storeid_s":"8024",
        "dupid_s":"2009",
        "title_pqth":["Unique Record #9"],
        "_version_":1508241005541851136,
        "indexTime_dt":"2015-07-31T19:25:09.94Z"},
      {
        "storeid_s":"1007",
        "dupid_s":"900", ***AcaColl_shard4_replica2***
        "title_pqth":["Dupe Record #7"],
        "_version_":1508241005515636736,
        "indexTime_dt":"2015-07-31T19:25:09.91Z"},
      {
        "storeid_s":"8016",
        "dupid_s":"2001",
        "title_pqth":["Unique Record #1"],
        "_version_":1508241005526122496,
        "indexTime_dt":"2015-07-31T19:25:09.91Z"},
      {
        "storeid_s":"8019",
        "dupid_s":"2004",
        "title_pqth":["Unique Record #4"],
        "_version_":1508241005528219648,
        "indexTime_dt":"2015-07-31T19:25:09.91Z"},
      {
        "storeid_s":"1003",
        "dupid_s":"900", ***AcaColl_shard1_replica1***
        "title_pqth":["Dupe Record #3"],
        "_version_":1508241005515636736,
        "indexTime_dt":"2015-07-31T19:25:09.917Z"},
      {
        "storeid_s":"8017",
        "dupid_s":"2002",
        "title_pqth":["Unique Record #2"],
        "_version_":1508241005518782464,
        "indexTime_dt":"2015-07-31T19:25:09.917Z"},
      {
        "storeid_s":"8018",
        "dupid_s":"2003",
        "title_pqth":["Unique Record #3"],
        "_version_":1508241005519831040,
        "indexTime_dt":"2015-07-31T19:25:09.917Z"},
      {
        "storeid_s":"1001",
        "dupid_s":"900", ***AcaColl_shard3_replica1***
        "title_pqth":["Dupe Record #1"],
        "_version_":1508241005511442432,
        "indexTime_dt":"2015-07-31T19:25:09.912Z"},
      {
        "storeid_s":"8021",
        "dupid_s":"2006",
        "title_pqth":["Unique Record #6"],
        "_version_":1508241005532413952,
        "indexTime_dt":"2015-07-31T19:25:09.929Z"},
      {
        "storeid_s":"8022",
        "dupid_s":"2007",
        "title_pqth":["Unique Record #7"],
        "_version_":1508241005533462528,
        "indexTime_dt":"2015-07-31T19:25:09.938Z"},
      {
        "storeid_s":"8015",
        "dupid_s":"2010",
        "title_pqth":["Unique Record #10"],
        "_version_":1508241005534511104,
        "indexTime_dt":"2015-07-31T19:25:09.938Z"}]
  },


More background information:

The following lists show the StoreIDs (unique key values) present on each 
shard. The asterisked StoreID is the one that was returned in the response 
shown above. Easy to see that one record per shard was returned.
=Shard 1 StoreIDs=
*1003
1010
8017
8018

=Shard 2 StoreIDs=
*1002
1004
1005
1006
1011
1015
8020
8023
8024

= Shard 3 StoreIDs=
*1001
1008
1014
8015
8021
8022

= Shard 4 StoreIDs=
*1007
1009
1012
1013
8016
8019

Any relevant insights that can be offered would be appreciated...

Reply via email to