RegexQuery performance

Jay Luker Thu, 08 Dec 2011 08:02:13 -0800

Hi,

I am trying to provide a means to search our corpus of nearly 2
million fulltext astronomy and physics articles using regular
expressions. A small percentage of our users need to be able to
locate, for example, certain types of identifiers that are present
within the fulltext (grant numbers, dataset identifers, etc).


My straightforward attempts to do this using RegexQuery have been
successful only in the sense that I get the results I'm looking for.
The performance, however, is pretty terrible, with most queries taking
five minutes or longer. Is this the performance I should expect
considering the size of my index and the massive number of terms? Are
there any alternative approaches I could try?

Things I've already tried:
  * reducing the sheer number of terms by adding a LengthFilter,
min=6, to my index analysis chain
  * swapping in the JakartaRegexpCapabilities

Things I intend to try if no one has any better suggestions:
  * chunk up the index and search concurrently, either by sharding or
using a RangeQuery based on document id

Any suggestions appreciated.

Thanks,
--jay

RegexQuery performance

Reply via email to