I have a SolrCloud setup with two shards.
When I use "query.set("fq","{!collapse field=title_s}");" the results
show duplicates because of the sharding.
EX:
{status=0,QTime=1141,params={fl=id,code_s,issuedate_tdt,pageno_i,subhead_s,title_s,type_s,citation_articleTitle_s,citation_articlePageNo_i,citation_corp_s,citation_publicationCode_s,citation_issn_s,citation_articleId_i,citation_scPublicationCode_i,citation_publicationTitle_s,citation_articleIssueDate_dt,score,df=[plain_abstract_en,
plain_title_en,
plain_subhead_en],debugQuery=false,uf=-*,start=0,q={!boost
b=recip(ms(NOW/DAY,issuedate_tdt),3.16e-11,1,1)}"dog
cancer",bf=[plain_title_en, plain_subhead_en],wt=javabin,fq={!collapse
field=title_s},version=2,defType=edismax,rows=5}}
{"articleid":573891,"code":"LB","formattedCitation":"(2004-07-04),
Plasmid-based hormone therapy could improve disease or age-induced
wasting, <i>Lab Business Week</i>, 14, ISSN:
1552-647X","issuedate":1088913600000,"pageno":14,"score":0.04496974,"subhead":"ADViSYS
Inc.","title":"Plasmid-based hormone therapy could improve disease or
age-induced wasting","type":"PressRelease","weight":0}
{"articleid":574262,"code":"NH","formattedCitation":"(2004-07-04),
Plasmid-based hormone therapy could improve disease or age-induced
wasting, <i>Nursing Home & Elder Business Week</i>, 2, ISSN:
1552-2571","issuedate":1088913600000,"pageno":2,"score":0.044759396,"subhead":"ADViSYS
Inc.","title":"Plasmid-based hormone therapy could improve disease or
age-induced wasting","type":"PressRelease","weight":0}
FACET COUNTS: subhead_s: ADViSYS Inc. -> 2
If I instead use:
query.set("group", "true"); query.set("group.field", "title_s");
query.set("group.main", "true"); query.set("group.truncate", "true");
query.set("group.facet", "true");
I receive back:
{status=0,QTime=72,params={uf=-*,group.main=true,wt=javabin,group.facet=true,version=2,rows=5,defType=edismax,fl=id,code_s,issuedate_tdt,pageno_i,subhead_s,title_s,type_s,citation_articleTitle_s,citation_articlePageNo_i,citation_corp_s,citation_publicationCode_s,citation_issn_s,citation_articleId_i,citation_scPublicationCode_i,citation_publicationTitle_s,citation_articleIssueDate_dt,score,debugQuery=false,df=[plain_abstract_en,
plain_title_en, plain_subhead_en],start=0,q={!boost
b=recip(ms(NOW/DAY,issuedate_tdt),3.16e-11,1,1)}"dog
cancer",group.truncate=true,bf=[plain_title_en,
plain_subhead_en],group.field=title_s,group=true}}
{"articleid":573891,"code":"LB","formattedCitation":"(2004-07-04),
Plasmid-based hormone therapy could improve disease or age-induced
wasting, <i>Lab Business Week</i>, 14, ISSN:
1552-647X","issuedate":1088913600000,"pageno":14,"score":0.04494973,"subhead":"ADViSYS
Inc.","title":"Plasmid-based hormone therapy could improve disease or
age-induced wasting","type":"PressRelease","weight":0}
FACET COUNTS: subhead_s: ADViSYS Inc. -> 2
??? If I combine the two together I get no results back ??? I was trying
to combine the two together because I will be searching 45+ million
records with duplication based on title_s by an approximate factor of 10.
{status=0,QTime=1103,params={facet=true,facet.mincount=1,uf=-*,facet.limit=10,group.main=true,wt=javabin,group.facet=true,version=2,rows=5,defType=edismax,fl=id,code_s,issuedate_tdt,pageno_i,subhead_s,title_s,type_s,citation_articleTitle_s,citation_articlePageNo_i,citation_corp_s,citation_publicationCode_s,citation_issn_s,citation_articleId_i,citation_scPublicationCode_i,citation_publicationTitle_s,citation_articleIssueDate_dt,score,debugQuery=false,df=[plain_abstract_en,
plain_title_en, plain_subhead_en],start=0,q={!boost
b=recip(ms(NOW/DAY,issuedate_tdt),3.16e-11,1,1)}"dog
cancer",group.truncate=true,bf=[plain_title_en,
plain_subhead_en],group.field=title_s,facet.field=subhead_s,group=true,fq={!collapse
field=title_s}}}