Re: Error with highlighter and UTF-8 chars?

Peter Wolanin Tue, 24 Feb 2009 11:23:28 -0800

Here you can see a manifestation of it when trying to highlight with ?q=daß


<lst name="highlighting">
−
<lst name="ebdcc46ab3791a12dccd0f915a463bd2/node/11622">
−
<arr name="body">
−
<str>
-Community" einfach nicht mehr wahrnimmt.
Hätte mir am letzten Montag Nachmittag jemand gesagt, <strong>daß
</strong>ich am Abend
</str>
−
<str>
recht, wenn er sagte, d<strong>aß d</strong>as wirklich wertvolle an
Drupal schlichtweg seine (Entwickler- und Anwender-)
</str>
−
<str>
die Entstehungsgeschichte des Portals) auch dokumentiert worden, denn
Ihr vermutet schon richtig, da<strong>ß da</strong>
</str>
</arr>
</lst>
</lst>


You can see the "strong" tags each get offset one character more from
where they are supposed to be.


-Peter



On Mon, Feb 23, 2009 at 8:24 AM, Peter Wolanin <peter.wola...@acquia.com> wrote:
> We are using Solr trunk (1.4)  - currently " nightly exported - yonik
> - 2009-02-05 08:06:00"
>
> -Peter
>
> On Mon, Feb 23, 2009 at 8:07 AM, Koji Sekiguchi <k...@r.email.ne.jp> wrote:
>> Jacob,
>>
>> What Solr version are you using? There is a bug in SolrHighlighter of Solr
>> 1.3,
>> you may want to look at:
>>
>> https://issues.apache.org/jira/browse/SOLR-925
>> https://issues.apache.org/jira/browse/LUCENE-1500
>>
>> regards,
>>
>> Koji
>>
>>
>> Jacob Singh wrote:
>>>
>>> Hi,
>>>
>>> We ran into a weird one today.  We have a document which is written in
>>> German and everytime we make a query which matches it, we get the
>>> following:
>>>
>>> java.lang.StringIndexOutOfBoundsException: String index out of range: 2822
>>>        at java.lang.String.substring(String.java:1935)
>>>        at
>>> org.apache.lucene.search.highlight.Highlighter.getBestTextFragments(Highlighter.java:274)
>>>
>>>
>>> >From source diving it looks like Lucene's highlighter is trying to
>>> subStr against an offset that is outside the bounds of the body field
>>> which it is highlighting against.  Running a fq against the ID of the
>>> doucment returns it fine (because no highlighting is done) and I took
>>> the body and tried to cut the first 2822 chars and while it is near
>>> the end of the body, it is still in range.
>>>
>>> Here is the related code:
>>>
>>> startOffset = tokenGroup.matchStartOffset;
>>> endOffset = tokenGroup.matchEndOffset;
>>> tokenText = text.substring(startOffset, endOffset);
>>>
>>>
>>> This leads me to believe there is some problem with mb string encoding
>>> and Lucene's counting.
>>>
>>> Any ideas here?  Tomcat is configured with UTF-8 btw.
>>>
>>> Best,
>>> Jacob
>>>
>>>
>>>
>>
>>
>
>
>
> --
> Peter M. Wolanin, Ph.D.
> Momentum Specialist,  Acquia. Inc.
> peter.wola...@acquia.com
>



-- 
Peter M. Wolanin, Ph.D.
Momentum Specialist,  Acquia. Inc.
peter.wola...@acquia.com

Re: Error with highlighter and UTF-8 chars?

Reply via email to