Re: Performance stats for indeces with over 10MM documents

Mike Klaas Wed, 02 Jan 2008 23:15:00 -0800

On 2-Jan-08, at 9:52 PM, Alex Benjamen wrote:

Thanks for the input, it's really valueable. Several forum usershave suggested using fq to separatethe caching of filters, and I can immediately see how this wouldhelp. I'm changing the code right nowand going to run some benchmarks, hopefully see a big gain justfrom that

Sure. Make sure you are using a realistic query distribution. Ifyou are always picking random unique values for everything, theremight be less of a gain. Also, even without profiling, it can bequite valuable to track the time for each query and look at thereverse sorted list: it tends to quickly identify troublesomeinputs. For instance, you might find that the slowest searchcontains a 63-clause disjunction (age:22-85).

- use range queries when querying contiguous disjunctions (age:[28TO 33] rather than what you have above).
I actually started with the above, using int type field, and itsomehow seemed slower than using explicit, but I will
certainly try again.
- convert the expensive, heap-based age filter disjunction into abitset created directly from the term enum
Can you pls. elaborate a little more? Are you advising to usefq=age:[28 TO 33], or should that simply be partof the regular query? Also, what is the best "type" to use whendefining age? I'm currently using "text", shouldI use "int" instead... I didn't see any difference with using thetype "int".

It doesn't matter if you're just searching like this. I was going towarn you about padding issues, but since this is for a dating app itis unlikely that you will have to worry about 1- or 3-digit ages.

One of the issues is that the age ranges are not "pre-defined" -they can be any combination, 22-23, 22-85, 45-49, etc.I realize that pre-defining age ranges would drastically improveperformance but then we're greatly reducing the value
of this type of search

Yes, but you can compose age ranges to gain performance withoutlosing flexibility. Imagine you index a field "age_mod_five" whereage_mod_five:25 means the person is between the ages of 25-29inclusive. Then you can transform a 63-clause disjunction into a 3-clause disjunction and 11-clause range query:


fq=age:(22 OR 23 OR 85) OR age_mod_five:[25 TO 80]

By the way, the OR's are implied, so they are not necessary in theabove (nor in the other examples you posted).

Perhaps a better option is to do an "inclusion-exclusion" trick.Take again the example of age:[22 TO 85]. It is really just therange 20-89, excluding a few years. So convert it into:


fq=age_mod_five:[20 TO 85]
fq=-age:20
fq=-age:21
fq=-age:86
fq=-age:87
fq=-age:88
fq=-age:89

Hopefully the mod-5 filters will be reused enough to be performant,and so to with the individual ages. '-' is the NOT operator, btw.

If you go far down this route, one more idea for you is to use open-ended ranges. Imagine a field called "younger_than", in which youindex _every_ age less than the age of the person, down to someminimum (like 18). You can then create any range given twoconstraints. age:22-85 becomes:


fq=younger_than:86
fq=-younger_than:22

There would be one bitset per valid age, and soon every possiblerange will be served from cache with one bitset operation. Yourindex might be a tad large, depending on the median age of your"documents". If your target tends toward the gerontological, it isbetter to index "older_than" and invert the logic.

These gymnastics are somewhat silly, since (as others have mentioned)full-text search logic is a poor tool for this job. Usually itdoesn't make much of a difference, but when you're scaling to a hugelevel, you need to use the right tool for the job.

This doesn't necessarily mean that you should use Solr, though, justthat it would be best to solve this problem using a better datastructure. (In this case, the most memory-efficient and probablyfastest method is to implement a filter using a FieldCache that looksup the age of each doc under consideration and does a range check.Coaxing Solr into using it correctly might be a little tricky).


good luck,
-Mike

Re: Performance stats for indeces with over 10MM documents

Reply via email to