Re: Document IDs instead of count for facets?

Jeff Schmidt Mon, 25 Jul 2011 21:21:17 -0700

Hi Yonik:

On Jul 17, 2011, at 9:30 AM, Yonik Seeley wrote:


> On Sun, Jul 17, 2011 at 10:38 AM, Jeff Schmidt <j...@535consulting.com> wrote:
>> I don't want to query for a particular facet value, but rather have Solr do 
>> a grouping of facet values. I'm not sure about the appropriate nomenclature 
>> there. But, I have a multi-valued field named "process" that can have values 
>> such as "catalysis", "activation", "inhibition", "expression", 
>> "modification", "reaction" etc.  About ~100K documents are indexed where 
>> this field may have none or one or more of these processes.
>> 
>> When the client makes a request, I need to tell it that for the process 
>> "catalysis", refer to documents 1,5,6,8,32 etc., and for "modification", 
>> documents 43545,22,2134, etc.
> 
> This sounds like grouping:
> http://wiki.apache.org/solr/FieldCollapsing
> 
> Unfortunately it only works on single values fields, and you can't
> sort based on numbers of matches either.

Oh man, so close!  The looks very usable for dealing with my problem, well 
except for the multi-valued fields thing... :(

> The closest you can get today is to issue 2 requests... the first a
> faceting request to get the top constraints, and then a second that
> uses group.query for each constraint you are interested in.

Hmm, this gets onerous rather quickly. I need to get the document IDs for all 
(non-zero count) facet values, not just the top ones. I can see where you're 
going with this.  For example, I issue the faceting query to learn all relevant 
values for the disease facet:

<response>
    <lst name="responseHeader">
        <int name="status">0</int>
        <int name="QTime">15</int>
        <lst name="params">
            <str name="facet">true</str>
            <str name="fl">id</str>
            <str name="facet.mincount">1</str>
            <str name="q.alt">*:*</str>
            <str name="facet.field">n_cellreg_diseaseExact</str>
            <str name="qt">partner-xyz</str>
            <str name="fq">n_pathway_id:ING\:ci0</str>
            <str name="rows">0</str>
        </lst>
    </lst>
    <result name="response" numFound="59" start="0"/>
    <lst name="facet_counts">
        <lst name="facet_queries"/>
        <lst name="facet_fields">
            <lst name="n_cellreg_diseaseExact">
                <int name="hypertrophy">29</int>
                <int name="neoplasia">26</int>
                <int name="cancer">21</int>
                <int name="tumorigenesis">21</int>
                <int name="insulin resistance">18</int>
                <int name="anaphylaxis">15</int>
                <int name="infection by Vaccinia virus WR">15</int>
                ...
                <int name="autosomal recessive polycystic kidney 
disease">2</int>
                <int name="bone cancer">2</int>
                <int name="carcinoma">2</int>
            </lst>
        </lst>
        <lst name="facet_dates"/>
        <lst name="facet_ranges"/>
    </lst>
</response>

Note the .... There are actually 100 diseases returned for this one (of five) 
facet. The filter query on n_pathway_id defines a set of documents that 
represent nodes on a biological pathway.  Using just the top three values for 
that particular facet, the grouping query gives me what I want:

<?xml version="1.0"?>
<response>
    <lst name="responseHeader">
        <int name="status">0</int>
        <int name="QTime">1</int>
        <lst name="params">
            <str name="fl">id</str>
            <str name="group.limit">3</str>
            <str name="q.alt">*:*</str>
            <arr name="group.query">
                <str>n_cellreg_diseaseExact:hypertrophy</str>
                <str>n_cellreg_diseaseExact:neoplasia</str>
                <str>n_cellreg_diseaseExact:cancer</str>
            </arr>
            <str name="group">true</str>
            <str name="qt">partner-xyz</str>
            <str name="fq">n_pathway_id:ING\:ci0</str>
        </lst>
    </lst>
    <lst name="grouped">
        <lst name="n_cellreg_diseaseExact:hypertrophy">
            <int name="matches">59</int>
            <result name="doclist" numFound="29" start="0">
                <doc>
                    <str name="id">ING:5z7</str>
                </doc>
                <doc>
                    <str name="id">ING:61b</str>
                </doc>
                <doc>
                    <str name="id">ING:6ii</str>
                </doc>
            </result>
        </lst>
        <lst name="n_cellreg_diseaseExact:neoplasia">
            <int name="matches">59</int>
            <result name="doclist" numFound="26" start="0">
                <doc>
                    <str name="id">ING:61b</str>
                </doc>
                <doc>
                    <str name="id">ING:6ii</str>
                </doc>
                <doc>
                    <str name="id">ING:592</str>
                </doc>
            </result>
        </lst>
        <lst name="n_cellreg_diseaseExact:cancer">
            <int name="matches">59</int>
            <result name="doclist" numFound="21" start="0">
                <doc>
                    <str name="id">ING:5fz</str>
                </doc>
                <doc>
                    <str name="id">ING:61b</str>
                </doc>
                <doc>
                    <str name="id">ING:6ii</str>
                </doc>
            </result>
        </lst>
    </lst>
</response>

So, for each value of the facet, there are the document IDs.  But, if I want 
this for all 100 diseases, I need to add 100 group.query parameters.  Is that a 
problem, other than URL length? But, I have other facets that can also have a 
large number of values with non-zero counts. Also, it seems SolrJ 3.3.0 does 
not support grok'ing the group query response. 

Just for grins I did try using group.field, and like you said, Solr does not 
like that on multi-valued fields. :) I guess I'll have to keep thinking on this 
one.  If per chance I get inspired to look at the Solr source code for how the 
facet counts are calculated to see if the document IDs can be made available, 
can you help localize where I should be looking?  Or, better yet, do you have 
any idea when group.field will support multi-valued fields?

Thanks!

Jeff
--
Jeff Schmidt
535 Consulting
j...@535consulting.com
http://www.535consulting.com
(650) 423-1068

Re: Document IDs instead of count for facets?

Reply via email to