Re: What's the bottleneck?

Sean Timm Wed, 17 Sep 2008 08:42:48 -0700

The HitCollector used by the Searcher is wrapped by aTimeLimitedCollector<http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc//org/apache/lucene/search/TimeLimitedCollector.html>which times out search requests that take longer than the maximumallowed search time limit during the collect. Any hits that have beencollected before the time expires are returned and a partialResults flagis set.


This is the use case that I had in mind:


   The timeout is to protect the server side. The client side can be
   largely protected by setting a read timeout, but if the client
   aborts before the server responds, the server is just wasting
   resources processing a request that will never be used. The partial
   results is useful in a couple of scenarios, probably the most
   important is a large distributed complex where you would rather get
   whatever results you can from a slow shard rather than throw them away.

   As a real world example, the query "contact us about our site" on a
   2.3MM document index (partial Dmoz crawl) takes several seconds to
   complete, while the mean response time is sub 50 ms. We've had cases
   where a bot walks the next page links (including expensive queries
   such as this). Also users are prone to repeatedly click the query
   button if they get impatient on a slow site. Without a server side
   timeout, this is a real issue.

But, you may find it useful for your scenario. You aren't guaranteed toget the most relevant documents returned however, since they may nothave been collected. The new distributed search features of 1.3 may besomething you want to look into. That will allow you to decrease yourresponse time by dividing your index into smaller partitions.


-Sean

Grant Ingersoll wrote:

See also https://issues.apache.org/jira/browse/SOLR-502 (timeoutsearches)
and https://issues.apache.org/jira/browse/LUCENE-997
This is committed on trunk and will be in 1.3. Don't ask me how itworks, b/c I haven't tried it yet, but maybe Sean Timm or someone canhelp out. I'm not sure if returns partial results or not.
Also, what kind of caching/warming do you do? How often do these slowqueries appear? Have you profiled your application yet? How manyresults are you retrieving?
In some cases, you may just want to figure out how to just return acached set of results for your most frequent, slow queries. I mean,if you know "shirt" is going to retrieve 2 million docs, whatdifference does it make if it really has 2 million and 1 documents?Do the query once, cache the top, oh 1000, and be done. Doesn't evennecessarily need to hit Solr. I know, I know, it's not search, butmost search applications do these kinds of things.
Still, would be nice if there were a little better solution for you.

On Sep 12, 2008, at 2:17 PM, Jason Rennie wrote:
Thanks for all the replies!
Mike: we're not using pf. Our qf is always "status:0". The "status"fieldis "0" for all good docs (90%+) and some other integer for any docswe don't
want returned.

Jeyrl: federated search is definitely something we'll consider.
On Fri, Sep 12, 2008 at 8:39 AM, Grant Ingersoll<[EMAIL PROTECTED]>wrote:
The bottleneck may simply be there are a lot of docs to score sinceyou are
using fairly common terms.
Yeah, I'm coming to the realization that it may be as simple asthat. Evena short, simple query like "shirt" can take seconds to return,presumably
because it hits ("numFound") 2 million docs.
Also, what file format (compound, non-compound) are you using?  Is it
optimized? Have you profiled your app for these queries? When yousay the"query is longer", define "longer"... 5 terms? 50 terms? Do youhave lotsof deleted docs? Can you share your DisMax params? Are you doingwildcard
queries?  Can you share the syntax of one of the offending queries?
I think we're using the non-compound format. We see eight differentfiles
(fdt, fdx, fnm, etc.) in an optimized index.  Yes, it's optimized.  It's
also read-only---we don't update/delete. DisMax: we specify qf, fl,mm, fq;mm=1; we use boosts for qf. No wildcards. Example query: "shirt";takes 2
secs to run according to the solr log, hits 2 million docs.
Since you want to keep "stopwords", you might consider a slightlybetter
use of them, whereby you use them in n-grams only during query parsing.
Not sure what you mean here...
See also https://issues.apache.org/jira/browse/LUCENE-494 for related
stuff.
Thanks for the pointer.

Jason
--------------------------
Grant Ingersoll
http://www.lucidimagination.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ

Re: What's the bottleneck?

Reply via email to