Thanks for all the replies!
Mike: we're not using pf. Our qf is always "status:0". The "status" field
is "0" for all good docs (90%+) and some other integer for any docs we don't
want returned.
Jeyrl: federated search is definitely something we'll consider.
On Fri, Sep 12, 2008 at 8:39 AM, Grant Ingersoll <[EMAIL PROTECTED]>wrote:
The bottleneck may simply be there are a lot of docs to score since you are
using fairly common terms.
Yeah, I'm coming to the realization that it may be as simple as that. Even
a short, simple query like "shirt" can take seconds to return, presumably
because it hits ("numFound") 2 million docs.
Also, what file format (compound, non-compound) are you using? Is it
optimized? Have you profiled your app for these queries? When you say the
"query is longer", define "longer"... 5 terms? 50 terms? Do you have lots
of deleted docs? Can you share your DisMax params? Are you doing wildcard
queries? Can you share the syntax of one of the offending queries?
I think we're using the non-compound format. We see eight different files
(fdt, fdx, fnm, etc.) in an optimized index. Yes, it's optimized. It's
also read-only---we don't update/delete. DisMax: we specify qf, fl, mm, fq;
mm=1; we use boosts for qf. No wildcards. Example query: "shirt"; takes 2
secs to run according to the solr log, hits 2 million docs.
> Since you want to keep "stopwords", you might consider a slightly better
use of them, whereby you use them in n-grams only during query parsing.
Not sure what you mean here...
You might want to look at how Nutch handles this issue. Nutch also
has stopwords that it wants to keep around. So what it does is
generates combo terms like the-<next term> in the index. The query
parser does the same thing, so that if your query phrase has common
terms, you wind up searching across a much smaller slice of your
index.
This comes, of course, at the expense of a larger index with a lot
more unique terms (due to all of the combo terms).
But this can be a big win - for example, at our site
(http://www.krugle.org) we index source files. Without this
optimization, searches could take several seconds. With it, we got
down to < 100ms with lots of breathing room.
-- Ken
--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"If you can't find it, you can't fix it"