On 2-Jan-08, at 9:52 PM, Alex Benjamen wrote:
Thanks for the input, it's really valueable. Several forum users
have suggested using fq to separate
the caching of filters, and I can immediately see how this would
help. I'm changing the code right now
and going to run some benchmarks, hopefully see a big gain just
from that
Sure. Make sure you are using a realistic query distribution. If
you are always picking random unique values for everything, there
might be less of a gain. Also, even without profiling, it can be
quite valuable to track the time for each query and look at the
reverse sorted list: it tends to quickly identify troublesome
inputs. For instance, you might find that the slowest search
contains a 63-clause disjunction (age:22-85).
- use range queries when querying contiguous disjunctions (age:[28
TO 33] rather than what you have above).
I actually started with the above, using int type field, and it
somehow seemed slower than using explicit, but I will
certainly try again.
- convert the expensive, heap-based age filter disjunction into a
bitset created directly from the term enum
Can you pls. elaborate a little more? Are you advising to use
fq=age:[28 TO 33], or should that simply be part
of the regular query? Also, what is the best "type" to use when
defining age? I'm currently using "text", should
I use "int" instead... I didn't see any difference with using the
type "int".
It doesn't matter if you're just searching like this. I was going to
warn you about padding issues, but since this is for a dating app it
is unlikely that you will have to worry about 1- or 3-digit ages.
One of the issues is that the age ranges are not "pre-defined" -
they can be any combination, 22-23, 22-85, 45-49, etc.
I realize that pre-defining age ranges would drastically improve
performance but then we're greatly reducing the value
of this type of search
Yes, but you can compose age ranges to gain performance without
losing flexibility. Imagine you index a field "age_mod_five" where
age_mod_five:25 means the person is between the ages of 25-29
inclusive. Then you can transform a 63-clause disjunction into a 3-
clause disjunction and 11-clause range query:
fq=age:(22 OR 23 OR 85) OR age_mod_five:[25 TO 80]
By the way, the OR's are implied, so they are not necessary in the
above (nor in the other examples you posted).
Perhaps a better option is to do an "inclusion-exclusion" trick.
Take again the example of age:[22 TO 85]. It is really just the
range 20-89, excluding a few years. So convert it into:
fq=age_mod_five:[20 TO 85]
fq=-age:20
fq=-age:21
fq=-age:86
fq=-age:87
fq=-age:88
fq=-age:89
Hopefully the mod-5 filters will be reused enough to be performant,
and so to with the individual ages. '-' is the NOT operator, btw.
If you go far down this route, one more idea for you is to use open-
ended ranges. Imagine a field called "younger_than", in which you
index _every_ age less than the age of the person, down to some
minimum (like 18). You can then create any range given two
constraints. age:22-85 becomes:
fq=younger_than:86
fq=-younger_than:22
There would be one bitset per valid age, and soon every possible
range will be served from cache with one bitset operation. Your
index might be a tad large, depending on the median age of your
"documents". If your target tends toward the gerontological, it is
better to index "older_than" and invert the logic.
These gymnastics are somewhat silly, since (as others have mentioned)
full-text search logic is a poor tool for this job. Usually it
doesn't make much of a difference, but when you're scaling to a huge
level, you need to use the right tool for the job.
This doesn't necessarily mean that you should use Solr, though, just
that it would be best to solve this problem using a better data
structure. (In this case, the most memory-efficient and probably
fastest method is to implement a filter using a FieldCache that looks
up the age of each doc under consideration and does a range check.
Coaxing Solr into using it correctly might be a little tricky).
good luck,
-Mike