Hi Jaco,

On 4/9/2009 at 2:58 PM, Jaco wrote:
> I'm struggling with some ideas, maybe somebody can help me with past
> experiences or tips. I have loaded a dictionary into a Solr index,
> using stemming and some stopwords in analysis part of the schema.
> Each record holds a term from the dictionary, which can consist of
> multiple words. For some data analysis work, I want to send pieces
> of text (sentences actually) to Solr to retrieve all possible
> dictionary terms that could occur. Ideally, I want to construct a
> query that only returns those Solr records for which all individual
> words in that record are matched.
> 
> For instance, my dictionary holds the following terms:
> 1 - a b c d
> 2 - c d e
> 3 - a b
> 4 - a e f g h
> 
> If I put the sentence [a b c d f g h] in as a query, I want to recieve
> dictionary items 1 (matching all words a b c d) and 3 (matching words a
> b) as matches
> 
> I have been puzzling about how to do this. The only way I found so far
> was to construct an OR query with all words of the sentence in it. In
> this case, that would result in all dictionary items being returned.
> This would then require some code to go over the search results and
> analyse each of them (i.e. by using the highlight function) to kick
> out 'false' matches, but I am looking for a more efficient way.
> 
> Is there a way to do this with Solr functionality, or do I need to
> start looking into the Lucene API ..?

Your problem could be modeled as a set of standing queries, where your 
dictionary entries are the *queries* (with all words required, maybe using a 
PhraseQuery or a SpanNearQuery), and the sentence is the document.

Solr may not be usable in this context (extremely high volume queries), 
depending on your throughput requirements, but Lucene's MemoryIndex was 
designed for this kind of thing:

<http://lucene.apache.org/java/2_4_1/api/contrib-memory/org/apache/lucene/index/memory/MemoryIndex.html>

Steve

Reply via email to