Re: Index & search questions; special cases
Erik Hatcher wrote: Yeah, the Nutch code is highly intertwined with its unique configuration infrastructure and makes it hard to pull pieces of it out like this. This is a critique that has been heard a lot (mainly because its true :) It would be really cool if different camps of lucene could build these nice utilities to be usable between projects. Not exactly sure how this could be accomplished but anyway something to consider. -- Sami Siren
Re: New Feature: ${solr.home}/lib/ dir for "plugins"
Very nice. This will help me also. I will try this out and let you know how it goes. (Windows XP with a custom request handler and some other custom classes)
Re: Trimming the list of docs returned.
Hi - Recap: > > I'd like to be able to limit the number of documents returned from > > any particular group of documents, much as Google only shows a max of > > two results from any one website. > > > > The docs are all marked as to which group they belong to. There will > > probably be multiple groups returned from any search. Documents > > belong to only one group It looks like that for trimming, the places I want to modify are in ScorePriorityQueue and FieldSortedHitQueue. When trimming, I want to return the top item in the group that matches, whether by score or sort, not just the first one that goes through the HitCollector. But since I want to enable this per request basis, I need some way to get the parameters from the original request, and pass it down to my implementation of ScorePriorityQueue. I'm trying to minimize the number of changes I'd have to make, so I've defined another flag (like SolrIndexHandler.GET_SCORES), and I check and set it in a modified version of StandardRequestHandler. This seems to work, and doesn't require me to change any method signatures. Suggestions for other implementations welcome! Index: src/java/org/apache/solr/request/StandardRequestHandler.java === --- src/java/org/apache/solr/request/StandardRequestHandler.java (revision 470495) +++ src/java/org/apache/solr/request/StandardRequestHandler.java (working copy) @@ -97,6 +97,10 @@ // find fieldnames to return (fieldlist) String fl = p.get(SolrParams.FL); int flags = 0; + String trim = p.get("trim"); + if ((trim == null) || !trim.equals("0")) + flags |= SolrIndexSearcher.TRIM_RESULTS; + if (fl != null) { flags |= U.setReturnFields(fl, rsp); } But, unsurprisingly, trimming vs. not trimming is being ignored with regard to caching. How would I indicate that a query with trim=0 is not the same as trim=1? I do still want to cache. But obviously, my implementation won't work at the moment, since all queries will cache the value generated using the results generated by the value of trim on the initial query. Any suggestions for where to go poking around to fix this vs. caching? Thanks, Tom At 11:10 AM 11/8/2006, you wrote: On 11/8/06, Tom <[EMAIL PROTECTED]> wrote: On 10/30/06, Yonik Seeley <[EMAIL PROTECTED]> wrote: > Yes, a custom hit collector would work. Searcher.doc() would be > deadly... but since each doc has at most one category, the FieldCache > could be used (it quickly maps id to field value and was historically > used for sorting). Not to be dense, but how do I use a custom HitCollector with Solr? You would need a custom request handler, then just use the SolrIndexSearcher you get with a request... it exposes all of the Lucene IndexSearcher methods. -Yonik On 10/30/06, Yonik Seeley <[EMAIL PROTECTED]> wrote: > Hi Tom, I moderated your email in... you need to subscribe to prevent > your emails being blocked in the future. Thanks. That's fixed, I hope. I was using the wrong address. > http://incubator.apache.org/solr/mailing_lists.html > > On 10/30/06, Tom <[EMAIL PROTECTED]> wrote: > > I'd like to be able to limit the number of documents returned from > > any particular group of documents, much as Google only shows a max of > > two results from any one website. > > You bring up an interesting problem that may be of general use. > Solr doesn't currently do this, but it should be possible (with some > work in the internals). > > > The docs are all marked as to which group they belong to. There will > > probably be multiple groups returned from any search. Documents > > belong to only one group > > Documents belonging to only one group does make things easier. > > > I could just examine each returned document, and discard documents > > from groups I have seen before, but that seems slow (but I'm not sure > > there is a better alternative). > > > > The number of groups is fairly high percentage of the number of > > documents (maybe 5% of all documents), so building something like a > > filter for each group doesn't seem feasible. > > > > CustomHitCollector of some sort could work, but there is the comment > > in the javadoc about "should not call Searcher.doc(int) > > or IndexReader.document(int) on every document number encountered." > > which would seem to be necessary to get the group id. > > Yes, a custom hit collector would work. Searcher.doc() would be > deadly... but since each doc has at most one category, the FieldCache > could be used (it quickly maps id to field value and was historically > used for sorting). > > It might be useful to see what Nutch does in this regard too. > > -Yonik
Re: Trimming the list of docs returned.
On 11/15/06, Tom <[EMAIL PROTECTED]> wrote: It looks like that for trimming, the places I want to modify are in ScorePriorityQueue and FieldSortedHitQueue. When trimming, I want to return the top item in the group that matches, whether by score or sort, not just the first one that goes through the HitCollector. Wouldn't you actually need a priority queue per group? But, unsurprisingly, trimming vs. not trimming is being ignored with regard to caching. How would I indicate that a query with trim=0 is not the same as trim=1? I do still want to cache. One hack: implement a simple query that delegates to another query and encapsulates the trim value... that way hashCode/equals won't match unless the trim does. -Yonik But obviously, my implementation won't work at the moment, since all queries will cache the value generated using the results generated by the value of trim on the initial query. Any suggestions for where to go poking around to fix this vs. caching? Thanks, Tom
Re: Trimming the list of docs returned.
At 01:35 PM 11/15/2006, you wrote: On 11/15/06, Tom <[EMAIL PROTECTED]> wrote: It looks like that for trimming, the places I want to modify are in ScorePriorityQueue and FieldSortedHitQueue. When trimming, I want to return the top item in the group that matches, whether by score or sort, not just the first one that goes through the HitCollector. Wouldn't you actually need a priority queue per group? I'm still playing with implementations, but I think you just need a max score for each group. You can't just do a PrioirtyQueue (of either max, or PriorityQueues) since I don't think the Lucene PriorityQueue handles entries whose value changes after insertion. But, unsurprisingly, trimming vs. not trimming is being ignored with regard to caching. How would I indicate that a query with trim=0 is not the same as trim=1? I do still want to cache. One hack: implement a simple query that delegates to another query and encapsulates the trim value... that way hashCode/equals won't match unless the trim does. Not sure what you mean by "delegates to another query". Could you clarify or give me a pointer? I was thinking in terms of just adding some guaranteed true clause to the end when trimming, is that similar to what you were talking about? Thanks, Tom -Yonik But obviously, my implementation won't work at the moment, since all queries will cache the value generated using the results generated by the value of trim on the initial query. Any suggestions for where to go poking around to fix this vs. caching? Thanks, Tom
Re: Trimming the list of docs returned.
On 11/15/06, Tom <[EMAIL PROTECTED]> wrote: >One hack: implement a simple query that delegates to another query and >encapsulates the trim value... that way hashCode/equals won't match >unless the trim does. Not sure what you mean by "delegates to another query". Could you clarify or give me a pointer? Something like public class TrimmedQuery extends Query { Query delegate; public TrimmedQuert(Query delegate, int trim) { this.delegate = delegate; } // now override hashCode + equals to include trim and implement all other // methods by delegating them. } I was thinking in terms of just adding some guaranteed true clause to the end when trimming, is that similar to what you were talking about? Yes, that should work too. -Yonik
Re: Trimming the list of docs returned.
One other thing you'll need to watch out for is the filterCache ... Solr has a setting (i forget the name at the moment) which tells the SolrIndexSearcher that for sorted queries, it can reuse the DocSet from a previous invocation of the Query and sort the cached DocSet to generate the list -- but your set of documents returned is dependent on your sort order, so you may actually want to put the sort option in your TrimmedQuery as well to denote the uniqueness of the set of matched Documents. if you think about it, a completley generalized solution would allow the "trimming" order to be independent of the sorting order, so a user could ask for "Books matching the word 'Lucene' trimmed so only the most popular matching book per publisher is returned, sorted by price." .. in which case your Query needs to know that "Publisher" i the field you grouped on, and "Popularity"/"desc" is the trimming you applied to each group --and now the usual DocList and DocSet caching will work flawlessly, regardless of the fact that you sorted on "Price" this time, but next time you might sort on "Popularity". : Date: Wed, 15 Nov 2006 19:26:39 -0500 : From: Yonik Seeley <[EMAIL PROTECTED]> : Reply-To: solr-user@lucene.apache.org : To: solr-user@lucene.apache.org : Subject: Re: Trimming the list of docs returned. : : On 11/15/06, Tom <[EMAIL PROTECTED]> wrote: : > >One hack: implement a simple query that delegates to another query and : > >encapsulates the trim value... that way hashCode/equals won't match : > >unless the trim does. : > : > Not sure what you mean by "delegates to another query". Could you : > clarify or give me a pointer? : : Something like : public class TrimmedQuery extends Query { :Query delegate; :public TrimmedQuert(Query delegate, int trim) { : this.delegate = delegate; :} :// now override hashCode + equals to include trim and implement all other :// methods by delegating them. : } : : > I was thinking in terms of just adding some guaranteed true clause to : > the end when trimming, is that similar to what you were talking about? : : Yes, that should work too. : : -Yonik : -Hoss
Re: Index & search questions; special cases
: > Yeah, the Nutch code is highly intertwined with its unique configuration : > infrastructure and makes it hard to pull pieces of it out like this. that CacheGrams inner Filter classe seemed like it could be extracted easily enough. : This is a critique that has been heard a lot (mainly because its true :) : It would be really cool if different camps of lucene could build these : nice utilities to be usable between projects. Not exactly sure how this : could be accomplished but anyway something to consider. [EMAIL PROTECTED] is probably the best place to raise this discussion if you're interested in pursuing it ... i think the best way to deal with it may just be on a case by case basis ... if you find cool code in sub-project XYZ, start by working with XYZ-dev to refactor it into an extractable chunk, then work with java-dev to "promote" it up in the lucene Java code base, and then circle back to XYZ-dev to deprecate the copy in the XYZ code repository and replace it with a dependency on the newly promoted version. -Hoss