RE: Improving proximity search performance

Bryan Loofbourrow Fri, 17 Feb 2012 16:52:02 -0800

Apologies. I meant to type “1.4 TB” and somehow typed “1.4 GB.” Little
wonder that no one thought the question was interesting, or figured I must
be using Sneakernet to run my searches.




-- Bryan Loofbourrow


  ------------------------------

*From:* Bryan Loofbourrow [mailto:bloofbour...@knowledgemosaic.com]
*Sent:* Thursday, February 16, 2012 7:07 PM
*To:* 'solr-user@lucene.apache.org'
*Subject:* Improving proximity search performance



Here’s my use case. I expect to set up a Solr index that is approximately
1.4GB (this is a real number from the proof-of-concept using the real data,
which consists of about 10 million documents, many of significant size, and
making use of the FastVectorHighlighter to do highlighting on the body text
field, which is of course stored, and with termVectors, termPositions, and
termOffsets on).



I no longer have the proof-of-concept Solr core available (our live site
uses Solr 1.4 and the ordinary Highlighter), so I can’t get an empirical
answer to this question: Will storing that extra information about the
location of terms help the performance of proximity searches?



A significant and important subset of my users make extensive use of
proximity searches. These sophisticated users have found that they are best
able to locate what they want by doing searches about THISWORD within 5
words of THATWORD, or much more sophisticated variants on that theme,
including plenty of booleans and wildcards. The problem I’m facing is
performance. Some of these searches, when common words are used, can take
many minutes, even with the index on an SSD.



The question is, how to improve the performance. It occurred to me as
possible that all of that term vector information, stored for the benefit
of the FastVectorHighlighter, might be a significant aid to the performance
of these searches.



First question: is that already the case? Will storing this extra
information automatically improve my proximity search performance?



Second question: If not, I’m very willing to dive into the code and come up
with a patch that would do this. Can someone with knowledge of the
internals comment on whether this is a plausible strategy for improving
performance, and, if so, give tips about the outlines of what a successful
approach to the problem might look like?



Third question: Any tips in general for improving the performance of these
proximity searches? I have explored the question of whether the customers
might be weaned off of them, and that does not appear to be an option.



Thanks,



-- Bryan Loofbourrow

RE: Improving proximity search performance

Reply via email to