Highlighting Trouble With Bigram Shingle Index

Chris Harris Mon, 12 Jan 2009 16:30:15 -0800

I'm running into some highlighting issues that appear to arise only
when I'm using a bigram shingle (ShingleFilterFactory) analyzer.


I started with a bigram-free situation along these lines:

   <field name="body" type="noshingleText" indexed="false" stored="false" />
   <!-- Stored text for use with highlighting: -->
   <field name="kwic" type="noshingleText" indexed="false"
stored="true" compressed="true" multiValued="false" />
   <copyField source="body" dest="kwic" maxLength="100000" />

    <fieldType name="noshingleText" class="solr.TextField"
positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StandardFilterFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StandardFilterFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
    </fieldType>

For performance reasons, though, I wanted to turn on bigram shingle
indexing on the body field. (For more information see
http://www.nabble.com/Using-Shingles-to-Increase-Phrase-Search-Performance-td19015758.html#a19015758)
In particular, I wanted to use this field type:

    <fieldType name="shingleText" class="solr.TextField"
positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StandardFilterFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.ShingleFilterFactory" outputUnigrams="true" />
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StandardFilterFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.ShingleFilterFactory" outputUnigrams="false"
outputUnigramIfNoNgram="true" />
      </analyzer>
    </fieldType>

(Regarding outputUnigramsIfNoNgram parameter, see
http://issues.apache.org/jira/browse/SOLR-744.)

I wasn't sure if I should want to define my kwic field (the one I use
for highlighting) as type shingleText, to match the body field, or
type noshingleText. So I tried both. Neither work quite as desired.

[kwic as type shingleText]

If I have both body and kwic as type shingleText, then highlighting
more or less works, but there are some anomolies. The main thing is
that it really likes to pick fragments where the highlighted term
(e.g. "car") is the last term in the fragment:

... la la la la la <em>car</em> ...
... foo foo foo foo foo <em>car</em> ...

This should obviously happen some of the time, but this is happening
with like 95% of my fragments, which is statistically unexpected. And
unfortunate. And it doesn't happen if I turn of shingling.

Another issue is that, if there are two instances of a highlighted
term within a given fragment, it will often highlight not just those
instances, but all the terms in between, like this:

... boo boo bar <em>car la la la car</em> bar bar bar ...

This too doesn't seem to happen if I disable bigram indexing.

I haven't figured out why this is the case. One potential issue is that
the TokenGroup abstraction doesn't necessarily make sense if you have
a token stream of alternating unigrams and bigrams like this:

  the, the cat, cat, cat went, went, went for, for, ...

Even if you could have a TokenGroup abstraction that makes sense, the current
implementation of TokenGroup.isDistinct looks like this:

  return token.startOffset()>=endOffset

and it turns false most of the time in this case. (I can give some
explanation of why, but maybe I'll save that for later.)

I'm not sure if the highlighter can easily be made to accomodate
sequences of alternating unigrams and bigrams, or if highlighting
should really only be attempted on bigram-free token streams.

[kwic with type noshingleText]

If I set kwic to be of type noshingleText, then the above symptoms go
away. Some things are not quite right, though. The particular symptom
now is that if I do a quoted query like

  "big dog"

then the correct results get returned, but no preview fragments are returned.

The underlying reason this happens is that an inappropriate Query
object is being passed
to the constructor for QueryScorer. The query that gets passed is

  TermQuery:"big dog"

That is the Query that should be used for *searching* on my bigram body
field, but it's *not* the Query that should be used for *highlighting*; the
Query that should be used for highlighting is something like

  PhraseQuery:"big dog"~0

What apparently is going on is that the highlighter is using the Query
object generated by the the *search* component to do highlighting.
One possibility is that the highlighter should
instead create a separate Query object for each hl.fl parameter; each
one would use the analyzer particular to the given *highlighting* field,
rather than the one for the default search field. There might be reasons why
that would be crazy, though.

Sorry this post is a little half-baked, but I'd really like to hear if
anyone has any ideas about how I might proceed with debugging.

Chris

Highlighting Trouble With Bigram Shingle Index

Reply via email to