Hi -
Recap:
> > I'd like to be able to limit the number of documents returned from
> > any particular group of documents, much as Google only shows a max of
> > two results from any one website.
> >
> > The docs are all marked as to which group they belong to. There will
> > probably be multiple groups returned from any search. Documents
> > belong to only one group
It looks like that for trimming, the places I want to modify are in
ScorePriorityQueue and FieldSortedHitQueue. When trimming, I want to
return the top item in the group that matches, whether by score or
sort, not just the first one that goes through the HitCollector.
But since I want to enable this per request basis, I need some way to
get the parameters from the original request, and pass it down to my
implementation of ScorePriorityQueue.
I'm trying to minimize the number of changes I'd have to make, so
I've defined another flag (like SolrIndexHandler.GET_SCORES), and I
check and set it in a modified version of StandardRequestHandler.
This seems to work, and doesn't require me to change any method
signatures. Suggestions for other implementations welcome!
Index: src/java/org/apache/solr/request/StandardRequestHandler.java
===================================================================
---
src/java/org/apache/solr/request/StandardRequestHandler.java
(revision 470495)
+++
src/java/org/apache/solr/request/StandardRequestHandler.java
(working copy)
@@ -97,6 +97,10 @@
// find fieldnames to return (fieldlist)
String fl = p.get(SolrParams.FL);
int flags = 0;
+ String trim = p.get("trim");
+ if ((trim == null) || !trim.equals("0"))
+ flags |= SolrIndexSearcher.TRIM_RESULTS;
+
if (fl != null) {
flags |= U.setReturnFields(fl, rsp);
}
But, unsurprisingly, trimming vs. not trimming is being ignored with
regard to caching. How would I indicate that a query with trim=0 is
not the same as trim=1? I do still want to cache. But obviously, my
implementation won't work at the moment, since all queries will cache
the value generated using the results generated by the value of trim
on the initial query.
Any suggestions for where to go poking around to fix this vs. caching?
Thanks,
Tom
At 11:10 AM 11/8/2006, you wrote:
On 11/8/06, Tom <[EMAIL PROTECTED]> wrote:
On 10/30/06, Yonik Seeley <[EMAIL PROTECTED]> wrote:
> Yes, a custom hit collector would work. Searcher.doc() would be
> deadly... but since each doc has at most one category, the FieldCache
> could be used (it quickly maps id to field value and was historically
> used for sorting).
Not to be dense, but how do I use a custom HitCollector with Solr?
You would need a custom request handler, then just use the
SolrIndexSearcher you get with a request... it exposes all of the
Lucene IndexSearcher methods.
-Yonik
On 10/30/06, Yonik Seeley <[EMAIL PROTECTED]> wrote:
> Hi Tom, I moderated your email in... you need to subscribe to prevent
> your emails being blocked in the future.
Thanks. That's fixed, I hope. I was using the wrong address.
> http://incubator.apache.org/solr/mailing_lists.html
>
> On 10/30/06, Tom <[EMAIL PROTECTED]> wrote:
> > I'd like to be able to limit the number of documents returned from
> > any particular group of documents, much as Google only shows a max of
> > two results from any one website.
>
> You bring up an interesting problem that may be of general use.
> Solr doesn't currently do this, but it should be possible (with some
> work in the internals).
>
> > The docs are all marked as to which group they belong to. There will
> > probably be multiple groups returned from any search. Documents
> > belong to only one group
>
> Documents belonging to only one group does make things easier.
>
> > I could just examine each returned document, and discard documents
> > from groups I have seen before, but that seems slow (but I'm not sure
> > there is a better alternative).
> >
> > The number of groups is fairly high percentage of the number of
> > documents (maybe 5% of all documents), so building something like a
> > filter for each group doesn't seem feasible.
> >
> > CustomHitCollector of some sort could work, but there is the comment
> > in the javadoc about "should not call Searcher.doc(int)
> > or IndexReader.document(int) on every document number encountered."
> > which would seem to be necessary to get the group id.
>
> Yes, a custom hit collector would work. Searcher.doc() would be
> deadly... but since each doc has at most one category, the FieldCache
> could be used (it quickly maps id to field value and was historically
> used for sorting).
>
> It might be useful to see what Nutch does in this regard too.
>
> -Yonik