On 10/20/06, Walter Underwood <[EMAIL PROTECTED]> wrote:
On 10/20/06 10:24 AM, "Mike Klaas" <[EMAIL PROTECTED]> wrote:
> Finally, it can be much faster to search a single field rather than
> multiple fields. One hacky way of achieving this is to make a field
> which receives a single copy of contents and eight copies of title.
> This is imperfect, as it messes up length normalization and
> summarizing.
Matching a token eight times is probably faster than fetching
a second field. For titles, the normalization probably should
be turned off anyway. Normalization is really there to compare
1000 word docs with 8000 word docs, not 3 word titles with 6 word
titles.
Right. Depending on the nature of your titles, turning off length
normalization can sometimes improve relevance.
Maybe I'll try one searchable field per weight and check that
for performance. Any rule of thumbs about how the performance
changes when different numbers of fields are searched?
If it's a disjunction, it's pretty linear I'd say. I think
time(A OR B) will be close to time(A) + time(B)
Thanks for all the help. I'm trying to avoid premature optimization,
but I'm starting with a load of 1-2 million queries/day, so I need
to be ready to make it perform.
That definitely seems doable.
How big is your index?
What's the form of your queries (AND, or sloppy phrase queries I'd imagine?)
If this is for netflix (and isn't confidential), are you just
searching across DVD info/description, or in customer comments too?
If it is DVD's you're searching, that can't be a large collection, and
you should be in really good shape. You might even try indexing
things in separate fields and searching across all those fields while
assigning boosts separately... it should be fast enough. You might
also check out the dismax handler if you haven't yet.
Any future plans for utilizing the faceted search?
-Yonik