I'm running into some highlighting issues that appear to arise only when I'm using a bigram shingle (ShingleFilterFactory) analyzer.
I started with a bigram-free situation along these lines: <field name="body" type="noshingleText" indexed="false" stored="false" /> <!-- Stored text for use with highlighting: --> <field name="kwic" type="noshingleText" indexed="false" stored="true" compressed="true" multiValued="false" /> <copyField source="body" dest="kwic" maxLength="100000" /> <fieldType name="noshingleText" class="solr.TextField" positionIncrementGap="100"> <analyzer type="index"> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.StandardFilterFactory"/> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> <analyzer type="query"> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.StandardFilterFactory"/> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> </fieldType> For performance reasons, though, I wanted to turn on bigram shingle indexing on the body field. (For more information see http://www.nabble.com/Using-Shingles-to-Increase-Phrase-Search-Performance-td19015758.html#a19015758) In particular, I wanted to use this field type: <fieldType name="shingleText" class="solr.TextField" positionIncrementGap="100"> <analyzer type="index"> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.StandardFilterFactory"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.ShingleFilterFactory" outputUnigrams="true" /> </analyzer> <analyzer type="query"> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.StandardFilterFactory"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.ShingleFilterFactory" outputUnigrams="false" outputUnigramIfNoNgram="true" /> </analyzer> </fieldType> (Regarding outputUnigramsIfNoNgram parameter, see http://issues.apache.org/jira/browse/SOLR-744.) I wasn't sure if I should want to define my kwic field (the one I use for highlighting) as type shingleText, to match the body field, or type noshingleText. So I tried both. Neither work quite as desired. [kwic as type shingleText] If I have both body and kwic as type shingleText, then highlighting more or less works, but there are some anomolies. The main thing is that it really likes to pick fragments where the highlighted term (e.g. "car") is the last term in the fragment: ... la la la la la <em>car</em> ... ... foo foo foo foo foo <em>car</em> ... This should obviously happen some of the time, but this is happening with like 95% of my fragments, which is statistically unexpected. And unfortunate. And it doesn't happen if I turn of shingling. Another issue is that, if there are two instances of a highlighted term within a given fragment, it will often highlight not just those instances, but all the terms in between, like this: ... boo boo bar <em>car la la la car</em> bar bar bar ... This too doesn't seem to happen if I disable bigram indexing. I haven't figured out why this is the case. One potential issue is that the TokenGroup abstraction doesn't necessarily make sense if you have a token stream of alternating unigrams and bigrams like this: the, the cat, cat, cat went, went, went for, for, ... Even if you could have a TokenGroup abstraction that makes sense, the current implementation of TokenGroup.isDistinct looks like this: return token.startOffset()>=endOffset and it turns false most of the time in this case. (I can give some explanation of why, but maybe I'll save that for later.) I'm not sure if the highlighter can easily be made to accomodate sequences of alternating unigrams and bigrams, or if highlighting should really only be attempted on bigram-free token streams. [kwic with type noshingleText] If I set kwic to be of type noshingleText, then the above symptoms go away. Some things are not quite right, though. The particular symptom now is that if I do a quoted query like "big dog" then the correct results get returned, but no preview fragments are returned. The underlying reason this happens is that an inappropriate Query object is being passed to the constructor for QueryScorer. The query that gets passed is TermQuery:"big dog" That is the Query that should be used for *searching* on my bigram body field, but it's *not* the Query that should be used for *highlighting*; the Query that should be used for highlighting is something like PhraseQuery:"big dog"~0 What apparently is going on is that the highlighter is using the Query object generated by the the *search* component to do highlighting. One possibility is that the highlighter should instead create a separate Query object for each hl.fl parameter; each one would use the analyzer particular to the given *highlighting* field, rather than the one for the default search field. There might be reasons why that would be crazy, though. Sorry this post is a little half-baked, but I'd really like to hear if anyone has any ideas about how I might proceed with debugging. Chris