There is also PostingsHighlighter -- I recommend it, if only for the performance improvement, which is substantial, but I'm not completely sure how it handles this issue. The one drawback I *am* aware of is that it is insensitive to positions (so words from phrases get highlighted even in isolation)

-Mike


On 02/24/2015 12:46 PM, Erik Hatcher wrote:
BooleanQuery’s extractTerms looks like this:

public void extractTerms(Set<Term> terms) {
   for (BooleanClause clause : clauses) {
     if (clause.isProhibited() == false) {
       clause.getQuery().extractTerms(terms);
     }
   }
}
that’s generally the method called by the Highlighter for what terms should be 
highlighted.  So even if a term didn’t match the document, the query that the 
term was in matched the document and it just blindly highlights all the terms 
(minus prohibited ones).   That at least explains the behavior you’re seeing, 
but it’s not ideal.  I’ve seen specialized highlighters that convert to spans, 
which are accurate to the exact matches within the document.  Been a while 
since I dug into the HighlightComponent, so maybe there’s some other options 
available out of the box?

—
Erik Hatcher, Senior Solutions Architect
http://www.lucidworks.com <http://www.lucidworks.com/>




On Feb 24, 2015, at 3:16 AM, Dmitry Kan <solrexp...@gmail.com> wrote:

Erick,

Our default operator is AND.

Both queries below parse the same:

a OR (b c) OR d
a OR (b AND c) OR d

The parsed query:

<str name="parsedquery_toString">Contents:a (+Contents:b +Contents:c)
Contents:d</str>

So this part is consistent with our expectation.


I'm a bit puzzled by your statement that "c" didn't contribute to the
score.
what I meant was that the term c was not hit by the scorerer: the explain
section does not refer to it. I'm using the made up terms here, but the
concept holds.

The code suggests that we could benefit from storing term offsets and
positions:
http://grepcode.com/file/repo1.maven.org/maven2/org.apache.solr/solr-core/4.3.1/org/apache/solr/highlight/DefaultSolrHighlighter.java#470

Is it correct assumption?

On Mon, Feb 23, 2015 at 8:29 PM, Erick Erickson <erickerick...@gmail.com>
wrote:

Highlighting is such a pain...

what does the parsed query look like? If the default operator is OR,
then this seems correct as both 'd' and 'c' appear in the doc. So
I'm a bit puzzled by your statement that "c" didn't contribute to the
score.

If the parsed query is, indeed
a +b +c d

then it does look like something with the highlighter. Whether other
highlighters are better for this case.. no clue ;(

Best,
Erick

On Mon, Feb 23, 2015 at 9:36 AM, Dmitry Kan <solrexp...@gmail.com> wrote:
Erick,

nope, we are using std lucene qparser with some customizations, that do
not
affect the boolean query parsing logic.

Should we try some other highlighter?

On Mon, Feb 23, 2015 at 6:57 PM, Erick Erickson <erickerick...@gmail.com

wrote:

Are you using edismax?

On Mon, Feb 23, 2015 at 3:28 AM, Dmitry Kan <solrexp...@gmail.com>
wrote:
Hello!

In solr 4.3.1 there seem to be some inconsistency with the
highlighting
of
the boolean query:

a OR (b c) OR d

This returns a proper hit, which shows that only d was included into
the
document score calculation.

But the highlighter returns both d and c in <em> tags.

Is this a known issue of the standard highlighter? Can it be
mitigated?

--
Dmitry Kan
Luke Toolbox: http://github.com/DmitryKey/luke
Blog: http://dmitrykan.blogspot.com
Twitter: http://twitter.com/dmitrykan
SemanticAnalyzer: www.semanticanalyzer.info


--
Dmitry Kan
Luke Toolbox: http://github.com/DmitryKey/luke
Blog: http://dmitrykan.blogspot.com
Twitter: http://twitter.com/dmitrykan
SemanticAnalyzer: www.semanticanalyzer.info


--
Dmitry Kan
Luke Toolbox: http://github.com/DmitryKey/luke
Blog: http://dmitrykan.blogspot.com
Twitter: http://twitter.com/dmitrykan
SemanticAnalyzer: www.semanticanalyzer.info


Reply via email to