how do do most efficient: collapsing facets into top-N results

Britske Thu, 13 Dec 2007 07:48:21 -0800

I've subclassed StandardRequestHandler to be able to show top-N results for
some of the facet-values that I'm interested in. The functionality resembles
the solr-236 field collapsing a bit, with the difference that I can
arbitrarily specify which facet-query to collapse and to what extend.
(possibility to specify N independently)


The code for this is now quite simple, but (maybe because of that) I've got
the feeling that it can be optimized quite a bit. The question is how? 

first some explanation and code:

I extended the standardrequesthandler and execute
super.handleRequestBody(req,rsp) to be able to fetch the facetquery results.
>From that I copy the facets that I wish to collapse to a NamedList
facet_results and execute code (see below) that basically splits a (possibly
combined) facetquery into independent queries which are added to a FQ-list. 
That FQ-list is appended to the original query and FQ-list and the new query
is executed.

for(int i = 0; i < facetresults.size(); i++)
{
        List<Query> fqList = new ArrayList<Query>();
        String[] fqsplit = facetresults.getName(i).split("[+]");
        for(int j = 0; j< fqsplit.length; j++)
        {
          Query fqNew = QueryParsing.parseQuery(fqsplit[j].trim(),
req.getSchema());
          fqList.add(fqNew);
        }
        fqList.addAll(fqsExisting);
        DocListAndSet resultList = new DocListAndSet();

        SolrIndexSearcher s = req.getSearcher();
        resultList.docList = s.getDocList(query,fqList, sort,start, rows ,0);
        NamedList facetValue = new SimpleOrderedMap(); 
        facetValue.add("results",resultList.docList);
        facetresults.setVal(i, facetValue);
}

This all works okay, but I'm still thinking that there must be a better way
than executing queries over and over again, for which only the fq's are
different: Q and Sort are the same for the executed queries per facet as for
the same already exectuted overall query.

Obviously doing a intersect on the original result would by far be the
fastest solution but Mike mentioned that this wasn't doable, since the
overall sorted resultlist is not available. see: 
http://www.nabble.com/showing-results-per-facet-value-efficiently-to13133815.html

Is there anything else I can do to speedup the queries? 

for reference I'm now seeing 15-16ms for each exectued query which is not in
the query-cache.
This seems independent whether of not Fq's are already in the filtercache or
not, which strikes me as odd.

For example see the performance measure of the collapsed facet-queries below
(and make up 1 call to Solr). Tested on an unwarmed solr-server. 20.000
docs. intel Core 2 Duo 2ghz. 800 MB Ram assigned to Solr. 

15 : ms for: _ddp_p_dc_dc_2_dc_dc:[0 TO 50]
16 : ms for: _ddp_p_dc_dc_2_dc_dc:[51 TO 100]
16 : ms for: _ddp_p_dc_dc_2_dc_dc:[101 TO 200]
15 : ms for: _ddp_p_dc_dc_2_dc_dc:[201 TO 300]
16 : ms for: idA:2140479
15 : ms for: idA:1456928
16 : ms for: idA:2601889
0 : ms for: _ddp_p_dc_dc_2_dc_dc:[0 TO 50]
0 : ms for: _ddp_p_dc_dc_2_dc_dc:[51 TO 100]
0 : ms for: _ddp_p_dc_dc_2_dc_dc:[101 TO 200]
0 : ms for: _ddp_p_dc_dc_2_dc_dc:[201 TO 300]
15 : ms for: _ddp_p_dc_dc_2_dc_dc:[0 TO 50] + idA:2140479
16 : ms for: _ddp_p_dc_dc_2_dc_dc:[0 TO 50] + idA:1456928
16 : ms for: _ddp_p_dc_dc_2_dc_dc:[0 TO 50] + idA:2601889
15 : ms for: _ddp_p_dc_dc_2_dc_dc:[51 TO 100] + idA:2140479
16 : ms for: _ddp_p_dc_dc_2_dc_dc:[51 TO 100] + idA:1456928
15 : ms for: _ddp_p_dc_dc_2_dc_dc:[51 TO 100] + idA:2601889
16 : ms for: _ddp_p_dc_dc_2_dc_dc:[101 TO 200] + idA:2140479
16 : ms for: _ddp_p_dc_dc_2_dc_dc:[101 TO 200] + idA:1456928
15 : ms for: _ddp_p_dc_dc_2_dc_dc:[101 TO 200] + idA:2601889
16 : ms for: _ddp_p_dc_dc_2_dc_dc:[201 TO 300] + idA:2140479
16 : ms for: _ddp_p_dc_dc_2_dc_dc:[201 TO 300] + idA:1456928
15 : ms for: _ddp_p_dc_dc_2_dc_dc:[201 TO 300] + idA:2601889
 
The strange thing here is that for example the query: 

_ddp_p_dc_dc_2_dc_dc:[0 TO 50] + idA:2140479 

takes 15 ms 
although it's independent parts:         
-  _ddp_p_dc_dc_2_dc_dc:[0 TO 50] 
-  idA:2140479

have already been executed (they also take 15/16 ms)

so all FQ's for _ddp_p_dc_dc_2_dc_dc:[0 TO 50] + idA:2140479 must be in the
filter-cache and hence the query must execute quicker than the very first
query: 
_ddp_p_dc_dc_2_dc_dc:[0 TO 50] for which the FQ wasn't in the filter-cache
at that moment.

So to summarize my 2 questions: 
1. is there any way to get better performance for what 'm trying to achieve?
Perhaps a custom hitcollector or something? 
2. do you have any explanation for the fact the the filter-cache doens't
seem to matter for executing the queries? 

Thanks in advance for making it to the end of this post and for any help you
might give me ;-)

Geert-Jan

-- 
View this message in context: 
http://www.nabble.com/how-do-do-most-efficient%3A-collapsing-facets-into-top-N-results-tp14318577p14318577.html
Sent from the Solr - User mailing list archive at Nabble.com.

how do do most efficient: collapsing facets into top-N results

Reply via email to