Erick, Eric and Mike,

Thanks for your help and ideas.

It sounds like we'd need to do a bit of revamping in the highlighter.
Perhaps even PostingsHighligher should be taken as the baseline, since it
is faster. It uses the same extractTerms() method, that Erik has shown.

The user story here is that user is made to believe, that the boolean query
did not work correctly, judging from the highlights. The issue is minor
otherwise, since the search *does* work as expected.

Dmitry

On Tue, Feb 24, 2015 at 8:19 PM, Michael Sokolov <
msoko...@safaribooksonline.com> wrote:

> There is also PostingsHighlighter -- I recommend it, if only for the
> performance improvement, which is substantial, but I'm not completely sure
> how it handles this issue.  The one drawback I *am* aware of is that it is
> insensitive to positions (so words from phrases get highlighted even in
> isolation)
>
> -Mike
>
>
>
> On 02/24/2015 12:46 PM, Erik Hatcher wrote:
>
>> BooleanQuery’s extractTerms looks like this:
>>
>> public void extractTerms(Set<Term> terms) {
>>    for (BooleanClause clause : clauses) {
>>      if (clause.isProhibited() == false) {
>>        clause.getQuery().extractTerms(terms);
>>      }
>>    }
>> }
>> that’s generally the method called by the Highlighter for what terms
>> should be highlighted.  So even if a term didn’t match the document, the
>> query that the term was in matched the document and it just blindly
>> highlights all the terms (minus prohibited ones).   That at least explains
>> the behavior you’re seeing, but it’s not ideal.  I’ve seen specialized
>> highlighters that convert to spans, which are accurate to the exact matches
>> within the document.  Been a while since I dug into the HighlightComponent,
>> so maybe there’s some other options available out of the box?
>>
>> —
>> Erik Hatcher, Senior Solutions Architect
>> http://www.lucidworks.com <http://www.lucidworks.com/>
>>
>>
>>
>>
>>  On Feb 24, 2015, at 3:16 AM, Dmitry Kan <solrexp...@gmail.com> wrote:
>>>
>>> Erick,
>>>
>>> Our default operator is AND.
>>>
>>> Both queries below parse the same:
>>>
>>> a OR (b c) OR d
>>> a OR (b AND c) OR d
>>>
>>> The parsed query:
>>>
>>> <str name="parsedquery_toString">Contents:a (+Contents:b +Contents:c)
>>> Contents:d</str>
>>>
>>> So this part is consistent with our expectation.
>>>
>>>
>>>  I'm a bit puzzled by your statement that "c" didn't contribute to the
>>>>>
>>>> score.
>>> what I meant was that the term c was not hit by the scorerer: the explain
>>> section does not refer to it. I'm using the made up terms here, but the
>>> concept holds.
>>>
>>> The code suggests that we could benefit from storing term offsets and
>>> positions:
>>> http://grepcode.com/file/repo1.maven.org/maven2/org.
>>> apache.solr/solr-core/4.3.1/org/apache/solr/highlight/
>>> DefaultSolrHighlighter.java#470
>>>
>>> Is it correct assumption?
>>>
>>> On Mon, Feb 23, 2015 at 8:29 PM, Erick Erickson <erickerick...@gmail.com
>>> >
>>> wrote:
>>>
>>>  Highlighting is such a pain...
>>>>
>>>> what does the parsed query look like? If the default operator is OR,
>>>> then this seems correct as both 'd' and 'c' appear in the doc. So
>>>> I'm a bit puzzled by your statement that "c" didn't contribute to the
>>>> score.
>>>>
>>>> If the parsed query is, indeed
>>>> a +b +c d
>>>>
>>>> then it does look like something with the highlighter. Whether other
>>>> highlighters are better for this case.. no clue ;(
>>>>
>>>> Best,
>>>> Erick
>>>>
>>>> On Mon, Feb 23, 2015 at 9:36 AM, Dmitry Kan <solrexp...@gmail.com>
>>>> wrote:
>>>>
>>>>> Erick,
>>>>>
>>>>> nope, we are using std lucene qparser with some customizations, that do
>>>>>
>>>> not
>>>>
>>>>> affect the boolean query parsing logic.
>>>>>
>>>>> Should we try some other highlighter?
>>>>>
>>>>> On Mon, Feb 23, 2015 at 6:57 PM, Erick Erickson <
>>>>> erickerick...@gmail.com
>>>>>
>>>>> wrote:
>>>>>
>>>>>  Are you using edismax?
>>>>>>
>>>>>> On Mon, Feb 23, 2015 at 3:28 AM, Dmitry Kan <solrexp...@gmail.com>
>>>>>>
>>>>> wrote:
>>>>
>>>>> Hello!
>>>>>>>
>>>>>>> In solr 4.3.1 there seem to be some inconsistency with the
>>>>>>>
>>>>>> highlighting
>>>>
>>>>> of
>>>>>>
>>>>>>> the boolean query:
>>>>>>>
>>>>>>> a OR (b c) OR d
>>>>>>>
>>>>>>> This returns a proper hit, which shows that only d was included into
>>>>>>>
>>>>>> the
>>>>
>>>>> document score calculation.
>>>>>>>
>>>>>>> But the highlighter returns both d and c in <em> tags.
>>>>>>>
>>>>>>> Is this a known issue of the standard highlighter? Can it be
>>>>>>>
>>>>>> mitigated?
>>>>
>>>>>
>>>>>>> --
>>>>>>> Dmitry Kan
>>>>>>> Luke Toolbox: http://github.com/DmitryKey/luke
>>>>>>> Blog: http://dmitrykan.blogspot.com
>>>>>>> Twitter: http://twitter.com/dmitrykan
>>>>>>> SemanticAnalyzer: www.semanticanalyzer.info
>>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> Dmitry Kan
>>>>> Luke Toolbox: http://github.com/DmitryKey/luke
>>>>> Blog: http://dmitrykan.blogspot.com
>>>>> Twitter: http://twitter.com/dmitrykan
>>>>> SemanticAnalyzer: www.semanticanalyzer.info
>>>>>
>>>>
>>>
>>> --
>>> Dmitry Kan
>>> Luke Toolbox: http://github.com/DmitryKey/luke
>>> Blog: http://dmitrykan.blogspot.com
>>> Twitter: http://twitter.com/dmitrykan
>>> SemanticAnalyzer: www.semanticanalyzer.info
>>>
>>
>>
>


-- 
Dmitry Kan
Luke Toolbox: http://github.com/DmitryKey/luke
Blog: http://dmitrykan.blogspot.com
Twitter: http://twitter.com/dmitrykan
SemanticAnalyzer: www.semanticanalyzer.info

Reply via email to