>From my reading of the solr docs (e.g. >https://cwiki.apache.org/confluence/display/solr/Collapse+and+Expand+Results >and https://cwiki.apache.org/confluence/display/solr/Result+Grouping), I've >been under the impression that these two methods (result grouping and >collapsing query parser) can both be used to eliminate duplicates from a >result set (in our case, we have a duplication field that contains a >'signature' that identifies duplicates. We use our own signature for a variety >of reasons that are tied to complex business requirements.).
In a test environment I scattered 15 duplicate records (with another 10 unique records) across a test system running Solr Cloud (Solr version 5.2.1) that had 4 shards and a replication factor of 2. I tried both result grouping and the collapsing query parser to remove duplicates. The result grouping worked as expected...the collapsing query parser did not. My results in using the collapsing query parser showed that Solr was in fact including into the result set one of the duplicate records from each shard (that is, I received FOUR duplicate records...and turning on debug showed that each of the four records came from a unique shard)...when I was expecting solr to do the collapsing on the aggregated result and return only ONE of the duplicated records across ALL shards. It appears that solr is performing the collapsing query parsing on each individual shard, but then NOT performing the operation on the aggregated results from each shard. I have searched through the forums and checked the documentation as carefully as I can. I find no documentation or mention of this effect (one record being returned per shard) when using collapsing query parsing. Is this a known behavior? Am I just doing something wrong? Am I missing some search parameter? Am I simply not understanding correctly how this is supposed to work? For reference, I am including below the search url and the response I received. Any insights would be appreciated. Query: http://172.26.250.150:8983/solr/AcaColl/select?q=*%3A*&wt=json&indent=true&rows=1000&fq={!collapse%20field=dupid_s}&debugQuery=true<http://172.26.250.150:8983/solr/AcaColl/select?q=*%3A*&wt=json&indent=true&rows=1000&fq=%7b!collapse%20field=dupid_s%7d&debugQuery=true> Response (note that dupid_s = 900 is the duplicate value and that I have added comments in the output ***<comment>*** pointing out which shard responses came from): { "responseHeader":{ "status":0, "QTime":31, "params":{ "debugQuery":"true", "indent":"true", "q":"*:*", "wt":"json", "fq":"{!collapse field=dupid_s}", "rows":"1000"}}, "response":{"numFound":14,"start":0,"maxScore":1.0,"docs":[ { "storeid_s":"1002", "dupid_s":"900", ***AcaColl_shard2_replica2*** "title_pqth":["Dupe Record #2"], "_version_":1508241005512491008, "indexTime_dt":"2015-07-31T19:25:09.914Z"}, { "storeid_s":"8020", "dupid_s":"2005", "title_pqth":["Unique Record #5"], "_version_":1508241005539753984, "indexTime_dt":"2015-07-31T19:25:09.94Z"}, { "storeid_s":"8023", "dupid_s":"2008", "title_pqth":["Unique Record #8"], "_version_":1508241005540802560, "indexTime_dt":"2015-07-31T19:25:09.94Z"}, { "storeid_s":"8024", "dupid_s":"2009", "title_pqth":["Unique Record #9"], "_version_":1508241005541851136, "indexTime_dt":"2015-07-31T19:25:09.94Z"}, { "storeid_s":"1007", "dupid_s":"900", ***AcaColl_shard4_replica2*** "title_pqth":["Dupe Record #7"], "_version_":1508241005515636736, "indexTime_dt":"2015-07-31T19:25:09.91Z"}, { "storeid_s":"8016", "dupid_s":"2001", "title_pqth":["Unique Record #1"], "_version_":1508241005526122496, "indexTime_dt":"2015-07-31T19:25:09.91Z"}, { "storeid_s":"8019", "dupid_s":"2004", "title_pqth":["Unique Record #4"], "_version_":1508241005528219648, "indexTime_dt":"2015-07-31T19:25:09.91Z"}, { "storeid_s":"1003", "dupid_s":"900", ***AcaColl_shard1_replica1*** "title_pqth":["Dupe Record #3"], "_version_":1508241005515636736, "indexTime_dt":"2015-07-31T19:25:09.917Z"}, { "storeid_s":"8017", "dupid_s":"2002", "title_pqth":["Unique Record #2"], "_version_":1508241005518782464, "indexTime_dt":"2015-07-31T19:25:09.917Z"}, { "storeid_s":"8018", "dupid_s":"2003", "title_pqth":["Unique Record #3"], "_version_":1508241005519831040, "indexTime_dt":"2015-07-31T19:25:09.917Z"}, { "storeid_s":"1001", "dupid_s":"900", ***AcaColl_shard3_replica1*** "title_pqth":["Dupe Record #1"], "_version_":1508241005511442432, "indexTime_dt":"2015-07-31T19:25:09.912Z"}, { "storeid_s":"8021", "dupid_s":"2006", "title_pqth":["Unique Record #6"], "_version_":1508241005532413952, "indexTime_dt":"2015-07-31T19:25:09.929Z"}, { "storeid_s":"8022", "dupid_s":"2007", "title_pqth":["Unique Record #7"], "_version_":1508241005533462528, "indexTime_dt":"2015-07-31T19:25:09.938Z"}, { "storeid_s":"8015", "dupid_s":"2010", "title_pqth":["Unique Record #10"], "_version_":1508241005534511104, "indexTime_dt":"2015-07-31T19:25:09.938Z"}] }, More background information: The following lists show the StoreIDs (unique key values) present on each shard. The asterisked StoreID is the one that was returned in the response shown above. Easy to see that one record per shard was returned. =Shard 1 StoreIDs= *1003 1010 8017 8018 =Shard 2 StoreIDs= *1002 1004 1005 1006 1011 1015 8020 8023 8024 = Shard 3 StoreIDs= *1001 1008 1014 8015 8021 8022 = Shard 4 StoreIDs= *1007 1009 1012 1013 8016 8019 Any relevant insights that can be offered would be appreciated...