On 10/20/06, Walter Underwood <[EMAIL PROTECTED]> wrote:
On 10/20/06 10:24 AM, "Mike Klaas" <[EMAIL PROTECTED]> wrote:
> Finally, it can be much faster to search a single field rather than
> multiple fields. One hacky way of achieving this is to make a field
> which receives a single copy of contents and eight copies of title.
> This is imperfect, as it messes up length normalization and
> summarizing.
Matching a token eight times is probably faster than fetching
a second field.
Definitely. Particularly if you are no using span queries, in which
case, the eight times is just a change in count.
For titles, the normalization probably should
be turned off anyway. Normalization is really there to compare
1000 word docs with 8000 word docs, not 3 word titles with 6 word
titles.
Ah, but normalization is extremely valuable to make the title weigh
more heavily than the 1000-word content field. I generally leave the
default normalization for title fields, and do a hack for content
fields where I set a minimum length (you generally don't prefer 5-word
docs to 1000-word docs)
Maybe I'll try one searchable field per weight and check that
for performance. Any rule of thumbs about how the performance
changes when different numbers of fields are searched?
With OR queries I'd expect it to be linear for similarly-sized fields.
Smaller fields will be much faster than longer ones, of course
(searching title+contents should be much less than double the cost of
search just contents).
Thanks for all the help. I'm trying to avoid premature optimization,
but I'm starting with a load of 1-2 million queries/day, so I need
to be ready to make it perform.
With that kind of query load, your optimization work should be largely
focused on caching, imo. Also consider that solr should be able to
scale up in terms of query rate well by adding another server, using
the built-in replication, and throwing a load-balancer in front of it.
-Mike