On 2-Jan-08, at 9:52 PM, Alex Benjamen wrote:


Thanks for the input, it's really valueable. Several forum users have suggested using fq to separate the caching of filters, and I can immediately see how this would help. I'm changing the code right now and going to run some benchmarks, hopefully see a big gain just from that

Sure. Make sure you are using a realistic query distribution. If you are always picking random unique values for everything, there might be less of a gain. Also, even without profiling, it can be quite valuable to track the time for each query and look at the reverse sorted list: it tends to quickly identify troublesome inputs. For instance, you might find that the slowest search contains a 63-clause disjunction (age:22-85).


- use range queries when querying contiguous disjunctions (age:[28 TO 33] rather than what you have above).
I actually started with the above, using int type field, and it somehow seemed slower than using explicit, but I will
certainly try again.


- convert the expensive, heap-based age filter disjunction into a bitset created directly from the term enum
Can you pls. elaborate a little more? Are you advising to use fq=age:[28 TO 33], or should that simply be part of the regular query? Also, what is the best "type" to use when defining age? I'm currently using "text", should I use "int" instead... I didn't see any difference with using the type "int".

It doesn't matter if you're just searching like this. I was going to warn you about padding issues, but since this is for a dating app it is unlikely that you will have to worry about 1- or 3-digit ages.

One of the issues is that the age ranges are not "pre-defined" - they can be any combination, 22-23, 22-85, 45-49, etc. I realize that pre-defining age ranges would drastically improve performance but then we're greatly reducing the value
of this type of search

Yes, but you can compose age ranges to gain performance without losing flexibility. Imagine you index a field "age_mod_five" where age_mod_five:25 means the person is between the ages of 25-29 inclusive. Then you can transform a 63-clause disjunction into a 3- clause disjunction and 11-clause range query:

fq=age:(22 OR 23 OR 85) OR age_mod_five:[25 TO 80]

By the way, the OR's are implied, so they are not necessary in the above (nor in the other examples you posted).

Perhaps a better option is to do an "inclusion-exclusion" trick. Take again the example of age:[22 TO 85]. It is really just the range 20-89, excluding a few years. So convert it into:

fq=age_mod_five:[20 TO 85]
fq=-age:20
fq=-age:21
fq=-age:86
fq=-age:87
fq=-age:88
fq=-age:89

Hopefully the mod-5 filters will be reused enough to be performant, and so to with the individual ages. '-' is the NOT operator, btw.

If you go far down this route, one more idea for you is to use open- ended ranges. Imagine a field called "younger_than", in which you index _every_ age less than the age of the person, down to some minimum (like 18). You can then create any range given two constraints. age:22-85 becomes:

fq=younger_than:86
fq=-younger_than:22

There would be one bitset per valid age, and soon every possible range will be served from cache with one bitset operation. Your index might be a tad large, depending on the median age of your "documents". If your target tends toward the gerontological, it is better to index "older_than" and invert the logic.

These gymnastics are somewhat silly, since (as others have mentioned) full-text search logic is a poor tool for this job. Usually it doesn't make much of a difference, but when you're scaling to a huge level, you need to use the right tool for the job.

This doesn't necessarily mean that you should use Solr, though, just that it would be best to solve this problem using a better data structure. (In this case, the most memory-efficient and probably fastest method is to implement a filter using a FieldCache that looks up the age of each doc under consideration and does a range check. Coaxing Solr into using it correctly might be a little tricky).

good luck,
-Mike





Reply via email to