Hi Yonik: On Jul 17, 2011, at 9:30 AM, Yonik Seeley wrote:
> On Sun, Jul 17, 2011 at 10:38 AM, Jeff Schmidt <j...@535consulting.com> wrote: >> I don't want to query for a particular facet value, but rather have Solr do >> a grouping of facet values. I'm not sure about the appropriate nomenclature >> there. But, I have a multi-valued field named "process" that can have values >> such as "catalysis", "activation", "inhibition", "expression", >> "modification", "reaction" etc. About ~100K documents are indexed where >> this field may have none or one or more of these processes. >> >> When the client makes a request, I need to tell it that for the process >> "catalysis", refer to documents 1,5,6,8,32 etc., and for "modification", >> documents 43545,22,2134, etc. > > This sounds like grouping: > http://wiki.apache.org/solr/FieldCollapsing > > Unfortunately it only works on single values fields, and you can't > sort based on numbers of matches either. Oh man, so close! The looks very usable for dealing with my problem, well except for the multi-valued fields thing... :( > The closest you can get today is to issue 2 requests... the first a > faceting request to get the top constraints, and then a second that > uses group.query for each constraint you are interested in. Hmm, this gets onerous rather quickly. I need to get the document IDs for all (non-zero count) facet values, not just the top ones. I can see where you're going with this. For example, I issue the faceting query to learn all relevant values for the disease facet: <response> <lst name="responseHeader"> <int name="status">0</int> <int name="QTime">15</int> <lst name="params"> <str name="facet">true</str> <str name="fl">id</str> <str name="facet.mincount">1</str> <str name="q.alt">*:*</str> <str name="facet.field">n_cellreg_diseaseExact</str> <str name="qt">partner-xyz</str> <str name="fq">n_pathway_id:ING\:ci0</str> <str name="rows">0</str> </lst> </lst> <result name="response" numFound="59" start="0"/> <lst name="facet_counts"> <lst name="facet_queries"/> <lst name="facet_fields"> <lst name="n_cellreg_diseaseExact"> <int name="hypertrophy">29</int> <int name="neoplasia">26</int> <int name="cancer">21</int> <int name="tumorigenesis">21</int> <int name="insulin resistance">18</int> <int name="anaphylaxis">15</int> <int name="infection by Vaccinia virus WR">15</int> ... <int name="autosomal recessive polycystic kidney disease">2</int> <int name="bone cancer">2</int> <int name="carcinoma">2</int> </lst> </lst> <lst name="facet_dates"/> <lst name="facet_ranges"/> </lst> </response> Note the .... There are actually 100 diseases returned for this one (of five) facet. The filter query on n_pathway_id defines a set of documents that represent nodes on a biological pathway. Using just the top three values for that particular facet, the grouping query gives me what I want: <?xml version="1.0"?> <response> <lst name="responseHeader"> <int name="status">0</int> <int name="QTime">1</int> <lst name="params"> <str name="fl">id</str> <str name="group.limit">3</str> <str name="q.alt">*:*</str> <arr name="group.query"> <str>n_cellreg_diseaseExact:hypertrophy</str> <str>n_cellreg_diseaseExact:neoplasia</str> <str>n_cellreg_diseaseExact:cancer</str> </arr> <str name="group">true</str> <str name="qt">partner-xyz</str> <str name="fq">n_pathway_id:ING\:ci0</str> </lst> </lst> <lst name="grouped"> <lst name="n_cellreg_diseaseExact:hypertrophy"> <int name="matches">59</int> <result name="doclist" numFound="29" start="0"> <doc> <str name="id">ING:5z7</str> </doc> <doc> <str name="id">ING:61b</str> </doc> <doc> <str name="id">ING:6ii</str> </doc> </result> </lst> <lst name="n_cellreg_diseaseExact:neoplasia"> <int name="matches">59</int> <result name="doclist" numFound="26" start="0"> <doc> <str name="id">ING:61b</str> </doc> <doc> <str name="id">ING:6ii</str> </doc> <doc> <str name="id">ING:592</str> </doc> </result> </lst> <lst name="n_cellreg_diseaseExact:cancer"> <int name="matches">59</int> <result name="doclist" numFound="21" start="0"> <doc> <str name="id">ING:5fz</str> </doc> <doc> <str name="id">ING:61b</str> </doc> <doc> <str name="id">ING:6ii</str> </doc> </result> </lst> </lst> </response> So, for each value of the facet, there are the document IDs. But, if I want this for all 100 diseases, I need to add 100 group.query parameters. Is that a problem, other than URL length? But, I have other facets that can also have a large number of values with non-zero counts. Also, it seems SolrJ 3.3.0 does not support grok'ing the group query response. Just for grins I did try using group.field, and like you said, Solr does not like that on multi-valued fields. :) I guess I'll have to keep thinking on this one. If per chance I get inspired to look at the Solr source code for how the facet counts are calculated to see if the document IDs can be made available, can you help localize where I should be looking? Or, better yet, do you have any idea when group.field will support multi-valued fields? Thanks! Jeff -- Jeff Schmidt 535 Consulting j...@535consulting.com http://www.535consulting.com (650) 423-1068