Re: Index & search questions; special cases

2006-11-15 Thread Sami Siren

Erik Hatcher wrote:

Yeah, the Nutch code is highly intertwined with its unique configuration 
infrastructure and makes it hard to pull pieces of it out like this.


This is a critique that has been heard a lot (mainly because its true :)
It would be really cool if different camps of lucene could build these 
nice utilities to be usable between projects. Not exactly sure how this 
could be accomplished but anyway something to consider.


--
 Sami Siren


Re: New Feature: ${solr.home}/lib/ dir for "plugins"

2006-11-15 Thread Mike Austin

Very nice. This will help me also.  I will try this out and let you know how
it goes. (Windows XP with a custom request handler and some other custom
classes)


Re: Trimming the list of docs returned.

2006-11-15 Thread Tom

Hi -

Recap:

 > > I'd like to be able to limit the number of documents returned from
 > > any particular group of documents, much as Google only shows a max of
 > > two results from any one website.
 > >
 > > The docs are all marked as to which group they belong to. There will
 > > probably be multiple groups returned from any search. Documents
 > > belong to only one group



It looks like that for trimming, the places I want to modify are in 
ScorePriorityQueue and FieldSortedHitQueue. When trimming, I want to 
return the top item in the group that matches, whether by score or 
sort, not just the first one that goes through the HitCollector.


But since I want to enable this per request basis, I need some way to 
get the parameters from the original request, and pass it down to my 
implementation of ScorePriorityQueue.


I'm trying to minimize the number of changes I'd have to make, so 
I've defined another flag (like SolrIndexHandler.GET_SCORES), and I 
check and set it in a modified version of StandardRequestHandler. 
This seems to work, and doesn't require me to change any method 
signatures. Suggestions for other implementations welcome!


Index: src/java/org/apache/solr/request/StandardRequestHandler.java
===
--- 
src/java/org/apache/solr/request/StandardRequestHandler.java 
(revision 470495)
+++ 
src/java/org/apache/solr/request/StandardRequestHandler.java 
(working copy)

@@ -97,6 +97,10 @@
   // find fieldnames to return (fieldlist)
   String fl = p.get(SolrParams.FL);
   int flags = 0;
+  String trim = p.get("trim");
+  if ((trim == null) || !trim.equals("0"))
+   flags |= SolrIndexSearcher.TRIM_RESULTS;
+
   if (fl != null) {
 flags |= U.setReturnFields(fl, rsp);
   }

But, unsurprisingly, trimming vs. not trimming is being ignored with 
regard to caching. How would I indicate that a query with trim=0 is 
not the same as trim=1? I do still want to cache. But obviously, my 
implementation won't work at the moment, since all queries will cache 
the value generated using the results generated by the value of trim 
on the initial query.


Any suggestions for where to go poking around to fix this vs. caching?

Thanks,

Tom





At 11:10 AM 11/8/2006, you wrote:

On 11/8/06, Tom <[EMAIL PROTECTED]> wrote:

On 10/30/06, Yonik Seeley <[EMAIL PROTECTED]> wrote:
 > Yes, a custom hit collector would work.  Searcher.doc() would be
 > deadly... but since each doc has at most one category, the FieldCache
 > could be used (it quickly maps id to field value and was historically
 > used for sorting).

Not to be dense, but how do I use a custom HitCollector with Solr?


You would need a custom request handler, then just use the
SolrIndexSearcher you get with a request... it exposes all of the
Lucene IndexSearcher methods.

-Yonik



On 10/30/06, Yonik Seeley <[EMAIL PROTECTED]> wrote:
 > Hi Tom, I moderated your email in... you need to subscribe to prevent
 > your emails being blocked in the future.

Thanks. That's fixed, I hope. I was using the wrong address.

 > http://incubator.apache.org/solr/mailing_lists.html
 >
 > On 10/30/06, Tom <[EMAIL PROTECTED]> wrote:
 > > I'd like to be able to limit the number of documents returned from
 > > any particular group of documents, much as Google only shows a max of
 > > two results from any one website.
 >
 > You bring up an interesting problem that may be of general use.
 > Solr doesn't currently do this, but it should be possible (with some
 > work in the internals).
 >
 > > The docs are all marked as to which group they belong to. There will
 > > probably be multiple groups returned from any search. Documents
 > > belong to only one group
 >
 > Documents belonging to only one group does make things easier.
 >
 > > I could just examine each returned document, and discard documents
 > > from groups I have seen before, but that seems slow (but I'm not sure
 > > there is a better alternative).
 > >
 > > The number of groups is fairly high percentage of the number of
 > > documents (maybe 5% of all documents), so building something like a
 > > filter for each group doesn't seem feasible.
 > >
 > > CustomHitCollector of some sort could work, but there is the comment
 > > in the javadoc about "should not call  Searcher.doc(int)
 > > or  IndexReader.document(int) on every  document number encountered."
 > > which would seem to be necessary to get the group id.
 >
 > Yes, a custom hit collector would work.  Searcher.doc() would be
 > deadly... but since each doc has at most one category, the FieldCache
 > could be used (it quickly maps id to field value and was historically
 > used for sorting).
 >
 > It might be useful to see what Nutch does in this regard too.
 >
 > -Yonik




Re: Trimming the list of docs returned.

2006-11-15 Thread Yonik Seeley

On 11/15/06, Tom <[EMAIL PROTECTED]> wrote:

It looks like that for trimming, the places I want to modify are in
ScorePriorityQueue and FieldSortedHitQueue. When trimming, I want to
return the top item in the group that matches, whether by score or
sort, not just the first one that goes through the HitCollector.


Wouldn't you actually need a priority queue per group?


But, unsurprisingly, trimming vs. not trimming is being ignored with
regard to caching. How would I indicate that a query with trim=0 is
not the same as trim=1? I do still want to cache.


One hack: implement a simple query that delegates to another query and
encapsulates the trim value... that way hashCode/equals won't match
unless the trim does.

-Yonik


But obviously, my
implementation won't work at the moment, since all queries will cache
the value generated using the results generated by the value of trim
on the initial query.

Any suggestions for where to go poking around to fix this vs. caching?

Thanks,

Tom


Re: Trimming the list of docs returned.

2006-11-15 Thread Tom

At 01:35 PM 11/15/2006, you wrote:

On 11/15/06, Tom <[EMAIL PROTECTED]> wrote:

It looks like that for trimming, the places I want to modify are in
ScorePriorityQueue and FieldSortedHitQueue. When trimming, I want to
return the top item in the group that matches, whether by score or
sort, not just the first one that goes through the HitCollector.


Wouldn't you actually need a priority queue per group?


I'm still playing with implementations, but I think you just need a 
max score for each group.


You can't just do a PrioirtyQueue (of either max, or PriorityQueues) 
since I don't think the Lucene PriorityQueue handles entries whose 
value changes after insertion.




But, unsurprisingly, trimming vs. not trimming is being ignored with
regard to caching. How would I indicate that a query with trim=0 is
not the same as trim=1? I do still want to cache.


One hack: implement a simple query that delegates to another query and
encapsulates the trim value... that way hashCode/equals won't match
unless the trim does.


Not sure what you mean by "delegates to another query". Could you 
clarify or give me a pointer?


I was thinking in terms of just adding some guaranteed true clause to 
the end when trimming, is that similar to what you were talking about?


Thanks,

Tom




-Yonik


But obviously, my
implementation won't work at the moment, since all queries will cache
the value generated using the results generated by the value of trim
on the initial query.

Any suggestions for where to go poking around to fix this vs. caching?

Thanks,

Tom




Re: Trimming the list of docs returned.

2006-11-15 Thread Yonik Seeley

On 11/15/06, Tom <[EMAIL PROTECTED]> wrote:

>One hack: implement a simple query that delegates to another query and
>encapsulates the trim value... that way hashCode/equals won't match
>unless the trim does.

Not sure what you mean by "delegates to another query". Could you
clarify or give me a pointer?


Something like
public class TrimmedQuery extends Query {
  Query delegate;
  public TrimmedQuert(Query delegate, int trim) {
this.delegate = delegate;
  }
  // now override hashCode + equals to include trim and implement all other
  // methods by delegating them.
}


I was thinking in terms of just adding some guaranteed true clause to
the end when trimming, is that similar to what you were talking about?


Yes, that should work too.

-Yonik


Re: Trimming the list of docs returned.

2006-11-15 Thread Chris Hostetter

One other thing you'll need to watch out for is the filterCache ... Solr
has a setting (i forget the name at the moment) which tells the
SolrIndexSearcher that for sorted queries, it can reuse the DocSet from a
previous invocation of the Query and sort the cached DocSet to generate
the list -- but your set of documents returned is dependent on your sort
order, so you may actually want to put the sort option in your
TrimmedQuery as well to denote the uniqueness of the set of matched
Documents.

if you think about it, a completley generalized solution would allow the
"trimming" order to be independent of the sorting order, so a user could
ask for "Books matching the word 'Lucene' trimmed so only the most popular
matching book per publisher is returned, sorted by price." .. in which
case your Query needs to know that "Publisher" i the field you grouped on,
and "Popularity"/"desc" is the trimming you applied to each group --and
now the usual DocList and DocSet caching will work flawlessly, regardless
of the fact that you sorted on "Price" this time, but next time you might
sort on "Popularity".


: Date: Wed, 15 Nov 2006 19:26:39 -0500
: From: Yonik Seeley <[EMAIL PROTECTED]>
: Reply-To: solr-user@lucene.apache.org
: To: solr-user@lucene.apache.org
: Subject: Re: Trimming the list of docs returned.
:
: On 11/15/06, Tom <[EMAIL PROTECTED]> wrote:
: > >One hack: implement a simple query that delegates to another query and
: > >encapsulates the trim value... that way hashCode/equals won't match
: > >unless the trim does.
: >
: > Not sure what you mean by "delegates to another query". Could you
: > clarify or give me a pointer?
:
: Something like
: public class TrimmedQuery extends Query {
:Query delegate;
:public TrimmedQuert(Query delegate, int trim) {
:  this.delegate = delegate;
:}
:// now override hashCode + equals to include trim and implement all other
:// methods by delegating them.
: }
:
: > I was thinking in terms of just adding some guaranteed true clause to
: > the end when trimming, is that similar to what you were talking about?
:
: Yes, that should work too.
:
: -Yonik
:



-Hoss



Re: Index & search questions; special cases

2006-11-15 Thread Chris Hostetter

: > Yeah, the Nutch code is highly intertwined with its unique configuration
: > infrastructure and makes it hard to pull pieces of it out like this.

that CacheGrams inner Filter classe seemed like it could be extracted
easily enough.

: This is a critique that has been heard a lot (mainly because its true :)
: It would be really cool if different camps of lucene could build these
: nice utilities to be usable between projects. Not exactly sure how this
: could be accomplished but anyway something to consider.

[EMAIL PROTECTED] is probably the best place to raise this discussion if
you're interested in pursuing it ... i think the best way to deal with it
may just be on a case by case basis ... if you find cool code in
sub-project XYZ, start by working with XYZ-dev to refactor it into an
extractable chunk, then work with java-dev to "promote" it up in the
lucene Java code base, and then circle back to XYZ-dev to deprecate the
copy in the XYZ code repository and replace it with a dependency on the
newly promoted version.


-Hoss