Re: Error with highlighter and UTF-8 chars?

Peter Wolanin Tue, 24 Feb 2009 13:29:10 -0800

So - something in the highlighting code is counting bytes when it
should be counting characters.  Looks like a lucene bug, so I'm
surprised others have not hit this before.  Probably this is it:
https://issues.apache.org/jira/browse/LUCENE-1500


-Peter


On Tue, Feb 24, 2009 at 2:22 PM, Peter Wolanin <peter.wola...@acquia.com> wrote:
> Here you can see a manifestation of it when trying to highlight with ?q=daß
>
> <lst name="highlighting">
> −
> <lst name="ebdcc46ab3791a12dccd0f915a463bd2/node/11622">
> −
> <arr name="body">
> −
> <str>
> -Community" einfach nicht mehr wahrnimmt.
> Hätte mir am letzten Montag Nachmittag jemand gesagt, <strong>daß
> </strong>ich am Abend
> </str>
> −
> <str>
> recht, wenn er sagte, d<strong>aß d</strong>as wirklich wertvolle an
> Drupal schlichtweg seine (Entwickler- und Anwender-)
> </str>
> −
> <str>
> die Entstehungsgeschichte des Portals) auch dokumentiert worden, denn
> Ihr vermutet schon richtig, da<strong>ß da</strong>
> </str>
> </arr>
> </lst>
> </lst>
>
>
> You can see the "strong" tags each get offset one character more from
> where they are supposed to be.
>
>
> -Peter
>
>
>
> On Mon, Feb 23, 2009 at 8:24 AM, Peter Wolanin <peter.wola...@acquia.com> 
> wrote:
>> We are using Solr trunk (1.4)  - currently " nightly exported - yonik
>> - 2009-02-05 08:06:00"
>>
>> -Peter
>>
>> On Mon, Feb 23, 2009 at 8:07 AM, Koji Sekiguchi <k...@r.email.ne.jp> wrote:
>>> Jacob,
>>>
>>> What Solr version are you using? There is a bug in SolrHighlighter of Solr
>>> 1.3,
>>> you may want to look at:
>>>
>>> https://issues.apache.org/jira/browse/SOLR-925
>>> https://issues.apache.org/jira/browse/LUCENE-1500
>>>
>>> regards,
>>>
>>> Koji
>>>
>>>
>>> Jacob Singh wrote:
>>>>
>>>> Hi,
>>>>
>>>> We ran into a weird one today.  We have a document which is written in
>>>> German and everytime we make a query which matches it, we get the
>>>> following:
>>>>
>>>> java.lang.StringIndexOutOfBoundsException: String index out of range: 2822
>>>>        at java.lang.String.substring(String.java:1935)
>>>>        at
>>>> org.apache.lucene.search.highlight.Highlighter.getBestTextFragments(Highlighter.java:274)
>>>>
>>>>
>>>> >From source diving it looks like Lucene's highlighter is trying to
>>>> subStr against an offset that is outside the bounds of the body field
>>>> which it is highlighting against.  Running a fq against the ID of the
>>>> doucment returns it fine (because no highlighting is done) and I took
>>>> the body and tried to cut the first 2822 chars and while it is near
>>>> the end of the body, it is still in range.
>>>>
>>>> Here is the related code:
>>>>
>>>> startOffset = tokenGroup.matchStartOffset;
>>>> endOffset = tokenGroup.matchEndOffset;
>>>> tokenText = text.substring(startOffset, endOffset);
>>>>
>>>>
>>>> This leads me to believe there is some problem with mb string encoding
>>>> and Lucene's counting.
>>>>
>>>> Any ideas here?  Tomcat is configured with UTF-8 btw.
>>>>
>>>> Best,
>>>> Jacob
>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>>
>> --
>> Peter M. Wolanin, Ph.D.
>> Momentum Specialist,  Acquia. Inc.
>> peter.wola...@acquia.com
>>
>
>
>
> --
> Peter M. Wolanin, Ph.D.
> Momentum Specialist,  Acquia. Inc.
> peter.wola...@acquia.com
>



-- 
Peter M. Wolanin, Ph.D.
Momentum Specialist,  Acquia. Inc.
peter.wola...@acquia.com

Re: Error with highlighter and UTF-8 chars?

Reply via email to