Re: highlighting the boolean query

Michael Sokolov Tue, 24 Feb 2015 10:23:32 -0800

There is also PostingsHighlighter -- I recommend it, if only for theperformance improvement, which is substantial, but I'm not completelysure how it handles this issue. The one drawback I *am* aware of isthat it is insensitive to positions (so words from phrases gethighlighted even in isolation)


-Mike



On 02/24/2015 12:46 PM, Erik Hatcher wrote:

BooleanQuery’s extractTerms looks like this:

public void extractTerms(Set<Term> terms) {
   for (BooleanClause clause : clauses) {
     if (clause.isProhibited() == false) {
       clause.getQuery().extractTerms(terms);
     }
   }
}
that’s generally the method called by the Highlighter for what terms should be 
highlighted.  So even if a term didn’t match the document, the query that the 
term was in matched the document and it just blindly highlights all the terms 
(minus prohibited ones).   That at least explains the behavior you’re seeing, 
but it’s not ideal.  I’ve seen specialized highlighters that convert to spans, 
which are accurate to the exact matches within the document.  Been a while 
since I dug into the HighlightComponent, so maybe there’s some other options 
available out of the box?

—
Erik Hatcher, Senior Solutions Architect
http://www.lucidworks.com <http://www.lucidworks.com/>

On Feb 24, 2015, at 3:16 AM, Dmitry Kan <solrexp...@gmail.com> wrote:

Erick,

Our default operator is AND.

Both queries below parse the same:

a OR (b c) OR d
a OR (b AND c) OR d

The parsed query:

<str name="parsedquery_toString">Contents:a (+Contents:b +Contents:c)
Contents:d</str>

So this part is consistent with our expectation.

I'm a bit puzzled by your statement that "c" didn't contribute to the

score.
what I meant was that the term c was not hit by the scorerer: the explain
section does not refer to it. I'm using the made up terms here, but the
concept holds.

The code suggests that we could benefit from storing term offsets and
positions:
http://grepcode.com/file/repo1.maven.org/maven2/org.apache.solr/solr-core/4.3.1/org/apache/solr/highlight/DefaultSolrHighlighter.java#470

Is it correct assumption?

On Mon, Feb 23, 2015 at 8:29 PM, Erick Erickson <erickerick...@gmail.com>
wrote:

Highlighting is such a pain...

what does the parsed query look like? If the default operator is OR,
then this seems correct as both 'd' and 'c' appear in the doc. So
I'm a bit puzzled by your statement that "c" didn't contribute to the
score.

If the parsed query is, indeed
a +b +c d

then it does look like something with the highlighter. Whether other
highlighters are better for this case.. no clue ;(

Best,
Erick

On Mon, Feb 23, 2015 at 9:36 AM, Dmitry Kan <solrexp...@gmail.com> wrote:

Erick,

nope, we are using std lucene qparser with some customizations, that do

not

affect the boolean query parsing logic.

Should we try some other highlighter?

On Mon, Feb 23, 2015 at 6:57 PM, Erick Erickson <erickerick...@gmail.com

wrote:

Are you using edismax?

On Mon, Feb 23, 2015 at 3:28 AM, Dmitry Kan <solrexp...@gmail.com>

wrote:

Hello!

In solr 4.3.1 there seem to be some inconsistency with the

highlighting

of

the boolean query:

a OR (b c) OR d

This returns a proper hit, which shows that only d was included into

the

document score calculation.

But the highlighter returns both d and c in <em> tags.

Is this a known issue of the standard highlighter? Can it be

mitigated?


--
Dmitry Kan
Luke Toolbox: http://github.com/DmitryKey/luke
Blog: http://dmitrykan.blogspot.com
Twitter: http://twitter.com/dmitrykan
SemanticAnalyzer: www.semanticanalyzer.info



--
Dmitry Kan
Luke Toolbox: http://github.com/DmitryKey/luke
Blog: http://dmitrykan.blogspot.com
Twitter: http://twitter.com/dmitrykan
SemanticAnalyzer: www.semanticanalyzer.info



--
Dmitry Kan
Luke Toolbox: http://github.com/DmitryKey/luke
Blog: http://dmitrykan.blogspot.com
Twitter: http://twitter.com/dmitrykan
SemanticAnalyzer: www.semanticanalyzer.info

Re: highlighting the boolean query

Reply via email to