Apologies. I meant to type “1.4 TB” and somehow typed “1.4 GB.” Little wonder that no one thought the question was interesting, or figured I must be using Sneakernet to run my searches.
-- Bryan Loofbourrow ------------------------------ *From:* Bryan Loofbourrow [mailto:bloofbour...@knowledgemosaic.com] *Sent:* Thursday, February 16, 2012 7:07 PM *To:* 'solr-user@lucene.apache.org' *Subject:* Improving proximity search performance Here’s my use case. I expect to set up a Solr index that is approximately 1.4GB (this is a real number from the proof-of-concept using the real data, which consists of about 10 million documents, many of significant size, and making use of the FastVectorHighlighter to do highlighting on the body text field, which is of course stored, and with termVectors, termPositions, and termOffsets on). I no longer have the proof-of-concept Solr core available (our live site uses Solr 1.4 and the ordinary Highlighter), so I can’t get an empirical answer to this question: Will storing that extra information about the location of terms help the performance of proximity searches? A significant and important subset of my users make extensive use of proximity searches. These sophisticated users have found that they are best able to locate what they want by doing searches about THISWORD within 5 words of THATWORD, or much more sophisticated variants on that theme, including plenty of booleans and wildcards. The problem I’m facing is performance. Some of these searches, when common words are used, can take many minutes, even with the index on an SSD. The question is, how to improve the performance. It occurred to me as possible that all of that term vector information, stored for the benefit of the FastVectorHighlighter, might be a significant aid to the performance of these searches. First question: is that already the case? Will storing this extra information automatically improve my proximity search performance? Second question: If not, I’m very willing to dive into the code and come up with a patch that would do this. Can someone with knowledge of the internals comment on whether this is a plausible strategy for improving performance, and, if so, give tips about the outlines of what a successful approach to the problem might look like? Third question: Any tips in general for improving the performance of these proximity searches? I have explored the question of whether the customers might be weaned off of them, and that does not appear to be an option. Thanks, -- Bryan Loofbourrow