Re: Trimming the list of docs returned.

Tom Wed, 15 Nov 2006 13:22:57 -0800

Hi -

Recap:

 > > I'd like to be able to limit the number of documents returned from
 > > any particular group of documents, much as Google only shows a max of
 > > two results from any one website.
 > >
 > > The docs are all marked as to which group they belong to. There will
 > > probably be multiple groups returned from any search. Documents
 > > belong to only one group

It looks like that for trimming, the places I want to modify are inScorePriorityQueue and FieldSortedHitQueue. When trimming, I want toreturn the top item in the group that matches, whether by score orsort, not just the first one that goes through the HitCollector.

But since I want to enable this per request basis, I need some way toget the parameters from the original request, and pass it down to myimplementation of ScorePriorityQueue.

I'm trying to minimize the number of changes I'd have to make, soI've defined another flag (like SolrIndexHandler.GET_SCORES), and Icheck and set it in a modified version of StandardRequestHandler.This seems to work, and doesn't require me to change any methodsignatures. Suggestions for other implementations welcome!


Index: src/java/org/apache/solr/request/StandardRequestHandler.java
===================================================================

---src/java/org/apache/solr/request/StandardRequestHandler.java(revision 470495)+++src/java/org/apache/solr/request/StandardRequestHandler.java(working copy)

@@ -97,6 +97,10 @@
       // find fieldnames to return (fieldlist)
       String fl = p.get(SolrParams.FL);
       int flags = 0;
+      String trim = p.get("trim");
+      if ((trim == null) || !trim.equals("0"))
+       flags |= SolrIndexSearcher.TRIM_RESULTS;
+
       if (fl != null) {
         flags |= U.setReturnFields(fl, rsp);
       }

But, unsurprisingly, trimming vs. not trimming is being ignored withregard to caching. How would I indicate that a query with trim=0 isnot the same as trim=1? I do still want to cache. But obviously, myimplementation won't work at the moment, since all queries will cachethe value generated using the results generated by the value of trimon the initial query.


Any suggestions for where to go poking around to fix this vs. caching?

Thanks,

Tom





At 11:10 AM 11/8/2006, you wrote:

On 11/8/06, Tom <[EMAIL PROTECTED]> wrote:

On 10/30/06, Yonik Seeley <[EMAIL PROTECTED]> wrote:
 > Yes, a custom hit collector would work.  Searcher.doc() would be
 > deadly... but since each doc has at most one category, the FieldCache
 > could be used (it quickly maps id to field value and was historically
 > used for sorting).

Not to be dense, but how do I use a custom HitCollector with Solr?


You would need a custom request handler, then just use the
SolrIndexSearcher you get with a request... it exposes all of the
Lucene IndexSearcher methods.

-Yonik

On 10/30/06, Yonik Seeley <[EMAIL PROTECTED]> wrote:
 > Hi Tom, I moderated your email in... you need to subscribe to prevent
 > your emails being blocked in the future.

Thanks. That's fixed, I hope. I was using the wrong address.

 > http://incubator.apache.org/solr/mailing_lists.html
 >
 > On 10/30/06, Tom <[EMAIL PROTECTED]> wrote:
 > > I'd like to be able to limit the number of documents returned from
 > > any particular group of documents, much as Google only shows a max of
 > > two results from any one website.
 >
 > You bring up an interesting problem that may be of general use.
 > Solr doesn't currently do this, but it should be possible (with some
 > work in the internals).
 >
 > > The docs are all marked as to which group they belong to. There will
 > > probably be multiple groups returned from any search. Documents
 > > belong to only one group
 >
 > Documents belonging to only one group does make things easier.
 >
 > > I could just examine each returned document, and discard documents
 > > from groups I have seen before, but that seems slow (but I'm not sure
 > > there is a better alternative).
 > >
 > > The number of groups is fairly high percentage of the number of
 > > documents (maybe 5% of all documents), so building something like a
 > > filter for each group doesn't seem feasible.
 > >
 > > CustomHitCollector of some sort could work, but there is the comment
 > > in the javadoc about "should not call  Searcher.doc(int)
 > > or  IndexReader.document(int) on every  document number encountered."
 > > which would seem to be necessary to get the group id.
 >
 > Yes, a custom hit collector would work.  Searcher.doc() would be
 > deadly... but since each doc has at most one category, the FieldCache
 > could be used (it quickly maps id to field value and was historically
 > used for sorting).
 >
 > It might be useful to see what Nutch does in this regard too.
 >
 > -Yonik

Re: Trimming the list of docs returned.

Reply via email to