Hello,

Thanks for replying! Yes, I am storing the whole document. The document is 
indexed with a unique id. There are only 3 fields in the schema - id, 
rawDocument, tikaDocument. Search uses the tikaDocument field. Against this I 
am throwing 2-5 word phrases and getting highlighting matches to each 
individual word in the phrases instead of just the phrase. The highlighted text 
that is matched is read by another application for display in the front end UI. 
Right now my app has logic to figure out that multiple highlights indicate a 
phrase, but it isn't perfect. 

In this case Solr is reporting a single 3 word phrase as 2 hits one with 2 of 
the phrase words, the other with 1 of the phrase words. This only happens in 
large documents where the multi word phrase appears across the boundary of one 
of the document fragments that Solr in analyzing (this is a hunch - I really 
don't know the mechanics for certain, but the next statement makes evident how 
I came to this conclusion). However if I make a one sentence document with the 
same multi word phrase, Solr will report 1 hit with all three words 
individually highlighted. At the very least I know Solr is getting the phrase 
correct. It is the method of highlighting (I'm trying to get one set of tags 
per phrase) and the occasional breaking of a single phrase into 2 hits.

Given that setup, what do you recommend? I'm not sure I understand the approach 
you're describing. I appreciate the help!

-Teague James

> On Dec 2, 2015, at 10:09 AM, Rick Leir <richard.l...@canadiana.ca> wrote:
> 
> For performance, if you have many large documents, you want to index the
> whole document but only store some identifiers. (Maybe this is not a
> consideration for you, stop reading now )
> 
> If you are not storing the whole document, then Solr cannot do the
> highlighting.  You would get an id, then locate your source document (maybe
> in your filesystem) and do highlighting yourself.
> 
>> Can anyone offer any solutions for searching large documents and
> returning a
>> single phrase highlight?

Reply via email to