Re: What's the bottleneck?

Ken Krugler Fri, 12 Sep 2008 13:08:47 -0700

Thanks for all the replies!

Mike: we're not using pf.  Our qf is always "status:0".  The "status" field
is "0" for all good docs (90%+) and some other integer for any docs we don't
want returned.


Jeyrl: federated search is definitely something we'll consider.

On Fri, Sep 12, 2008 at 8:39 AM, Grant Ingersoll <[EMAIL PROTECTED]>wrote:

 The bottleneck may simply be there are a lot of docs to score since you are
 using fairly common terms.


Yeah, I'm coming to the realization that it may be as simple as that.  Even
a short, simple query like "shirt" can take seconds to return, presumably
because it hits ("numFound") 2 million docs.

 Also, what file format (compound, non-compound) are you using?  Is it
 optimized?  Have you profiled your app for these queries?  When you say the
 "query is longer", define "longer"...  5 terms?  50 terms?  Do you have lots
 of deleted docs?  Can you share your DisMax params?  Are you doing wildcard
 queries?  Can you share the syntax of one of the offending queries?



I think we're using the non-compound format.  We see eight different files
(fdt, fdx, fnm, etc.) in an optimized index.  Yes, it's optimized.  It's
also read-only---we don't update/delete.  DisMax: we specify qf, fl, mm, fq;
mm=1; we use boosts for qf.  No wildcards.  Example query: "shirt"; takes 2
secs to run according to the solr log, hits 2 million docs.


 > Since you want to keep "stopwords", you might consider a slightly better

 use of them, whereby you use them in n-grams only during query parsing.



Not sure what you mean here...

You might want to look at how Nutch handles this issue. Nutch alsohas stopwords that it wants to keep around. So what it does isgenerates combo terms like the-<next term> in the index. The queryparser does the same thing, so that if your query phrase has commonterms, you wind up searching across a much smaller slice of yourindex.

This comes, of course, at the expense of a larger index with a lotmore unique terms (due to all of the combo terms).

But this can be a big win - for example, at our site(http://www.krugle.org) we index source files. Without thisoptimization, searches could take several seconds. With it, we gotdown to < 100ms with lots of breathing room.


-- Ken
--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"If you can't find it, you can't fix it"

Re: What's the bottleneck?

Reply via email to